This is an automated email from the ASF dual-hosted git repository.
pan3793 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 319dc6e05f1c [SPARK-57130][BUILD] `make-distribution.sh` copies only
git-tracked files for python
319dc6e05f1c is described below
commit 319dc6e05f1c2774142bbc4dadb5f1389cadd2b0
Author: Cheng Pan <[email protected]>
AuthorDate: Sat May 30 19:19:05 2026 +0800
[SPARK-57130][BUILD] `make-distribution.sh` copies only git-tracked files
for python
### What changes were proposed in this pull request?
`make-distribution.sh` copies only git-tracked files for `python` folder,
when `git` and `cpio` commands are available and under a git repo, instead of
raw `cp`.
### Why are the changes needed?
I find that sometimes `make-distribution.sh` produces an unreasonably large
tarball because it copies the entire `python` folder to the `dist` directory,
which may contain generated files, e.g., compiled PySpark docs.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Run `dev/make-distribution.sh` manually.
Also tested the performance of the new command, on macOS, `cpio` is
slightly slower than raw `cp`, but good enough.
```
$ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
42452 blocks
git ls-files -z "$PWD/python" 0.01s user 0.01s system 76% cpu 0.027 total
cpio -0pdm "target" 0.05s user 1.10s system 77% cpu 1.480 total
$ rm -rf target/python
$ time cp -r "$PWD/python" "target"
cp -r "$PWD/python" "target" 0.02s user 0.56s system 78% cpu 0.731 total
```
on Linux, `cpio` is faster
```
$ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
46385 blocks
git ls-files -z "$PWD/python" 0.01s user 0.01s system 81% cpu 0.022 total
cpio -0pdm "target" 0.05s user 1.02s system 84% cpu 1.260 total
$ rm -rf target/python
$ time cp -r "$PWD/python" "target"
cp -r "$PWD/python" "target" 0.02s user 0.57s system 73% cpu 0.807 total
```
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: DeepSeek V4 Pro.
Closes #56186 from pan3793/SPARK-57130.
Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
---
dev/make-distribution.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index e43e4afbd0e2..7cd9ea7889e8 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -294,7 +294,11 @@ mkdir "$DISTDIR/conf"
cp "$SPARK_HOME"/conf/*.template "$DISTDIR/conf"
cp "$SPARK_HOME/README.md" "$DISTDIR"
cp -r "$SPARK_HOME/bin" "$DISTDIR"
-cp -r "$SPARK_HOME/python" "$DISTDIR"
+if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null;
then
+ git ls-files -z "$SPARK_HOME/python" | cpio -0pdm "$DISTDIR"
+else
+ cp -r "$SPARK_HOME/python" "$DISTDIR"
+fi
# Remove the python distribution from dist/ if we built it
if [ "$MAKE_PIP" == "true" ]; then
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]