cloud-fan commented on code in PR #56453:
URL: https://github.com/apache/spark/pull/56453#discussion_r3398709754
##########
dev/make-distribution.sh:
##########
@@ -272,9 +316,32 @@ if [ "$MAKE_R" == "true" ]; then
echo "Building R source package"
R_PACKAGE_VERSION=`grep Version "$SPARK_HOME/R/pkg/DESCRIPTION" | awk
'{print $NF}'`
pushd "$SPARK_HOME/R" > /dev/null
+ # Ship the Apache LICENSE and NOTICE inside the SparkR source package. These
+ # are removed again after the package is built.
+ cp "$SPARK_HOME/LICENSE" pkg/LICENSE
+ cp "$SPARK_HOME/NOTICE" pkg/NOTICE
+ # Reference the bundled LICENSE from DESCRIPTION so `R CMD check --as-cran`
does
+ # not emit "File LICENSE is not mentioned in the DESCRIPTION file". The
committed
+ # DESCRIPTION is left untouched because SparkR CI runs check-cran.sh without
the
+ # LICENSE file present; this edit is transient and restored after the build
(and
+ # by the EXIT trap on failure). The backup lives outside pkg/ so R CMD check
does
+ # not flag it as a non-standard file. NOTE: the "Non-standard file 'NOTICE'"
note
+ # cannot be silenced this way and is expected.
+ cp pkg/DESCRIPTION "$SPARK_HOME/R/DESCRIPTION.orig"
+ sed 's/^License: Apache License (== 2.0)$/License: Apache License (== 2.0) +
file LICENSE/' \
+ "$SPARK_HOME/R/DESCRIPTION.orig" > pkg/DESCRIPTION
# Build source package and run full checks
# Do not source the check-cran.sh - it should be run from where it is for it
to set SPARK_HOME
NO_TESTS=1 "$SPARK_HOME/R/check-cran.sh"
+ mv -f "$SPARK_HOME/R/DESCRIPTION.orig" pkg/DESCRIPTION
+ rm -f pkg/LICENSE pkg/NOTICE
+
+ # Guard against regressions: the SparkR source package must contain LICENSE
and
+ # NOTICE at the package root (SPARK-57393).
+ for required in LICENSE NOTICE; do
+ tar tzf "SparkR_$R_PACKAGE_VERSION.tar.gz" | grep -qE "/$required\$" || \
+ { echo "ERROR: SparkR source package is missing $required"; exit 1; }
+ done
Review Comment:
Same `pipefail` + `grep -q` SIGPIPE issue as the PySpark guard above — less
likely to fire here since the SparkR listing is small, but the same fix keeps
both guards reliable:
```suggestion
listing=$(tar tzf "SparkR_$R_PACKAGE_VERSION.tar.gz")
for required in LICENSE NOTICE; do
grep -qE "^[^/]+/$required\$" <<< "$listing" || \
{ echo "ERROR: SparkR source package is missing $required"; exit 1; }
done
```
##########
dev/make-distribution.sh:
##########
@@ -259,9 +274,38 @@ if [ "$MAKE_PIP" == "true" ]; then
pushd "$SPARK_HOME/python" > /dev/null
# Delete the egg info file if it exists, this can cache older setup files.
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping
deletion"
+ # Ship the Apache LICENSE and NOTICE inside the PySpark source distributions
+ # (see MANIFEST.in). These are removed again after the sdists are built.
+ #
+ # The classic pyspark sdist bundles the assembly jars
(packaging/classic/setup.py
+ # builds a deps/jars symlink farm), so it ships the binary LICENSE/NOTICE
that
+ # enumerate the bundled third-party jars' licenses, mirroring the binary
+ # distribution above. The connect and client sdists bundle no jars and ship
the
+ # plain source LICENSE/NOTICE.
+ if [ -e "$SPARK_HOME/LICENSE-binary" ]; then
+ cp "$SPARK_HOME/LICENSE-binary" LICENSE
+ cp "$SPARK_HOME/NOTICE-binary" NOTICE
+ else
+ cp "$SPARK_HOME/LICENSE" LICENSE
+ cp "$SPARK_HOME/NOTICE" NOTICE
+ fi
python3 packaging/classic/setup.py sdist
+
+ cp "$SPARK_HOME/LICENSE" LICENSE
+ cp "$SPARK_HOME/NOTICE" NOTICE
python3 packaging/connect/setup.py sdist
python3 packaging/client/setup.py sdist
+ rm -f LICENSE NOTICE
+
+ # Guard against regressions: every PySpark sdist must contain LICENSE and
NOTICE
+ # at the package root. The missing files were only caught by a Spark 4.2.0
RC1
+ # vote -1 (SPARK-57393); fail the release build here instead of at vote time.
+ for f in dist/pyspark*.tar.gz; do
+ for required in LICENSE NOTICE; do
+ tar tzf "$f" | grep -qE "/$required\$" || \
+ { echo "ERROR: $f is missing $required at the package root"; exit 1; }
+ done
+ done
Review Comment:
This guard can kill the release build on a compliant tarball: under `set -o
pipefail` (line 27), `grep -q` exits at the first match and `tar` takes SIGPIPE
writing the rest of the listing, so the pipeline fails and the `||` branch
fires. Reproduced locally — a tarball containing both files failed with `ERROR:
... missing NOTICE`, and it's racy (LICENSE passed in the same run). This
pattern came from my example snippet in the last round; sorry about that.
Capturing the listing once fixes it, and the root-anchored regex makes the
check match the comment above (the current pattern matches at any depth):
```suggestion
for f in dist/pyspark*.tar.gz; do
listing=$(tar tzf "$f")
for required in LICENSE NOTICE; do
grep -qE "^[^/]+/$required\$" <<< "$listing" || \
{ echo "ERROR: $f is missing $required at the package root"; exit 1;
}
done
done
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]