[GitHub] [spark] HyukjinKwon opened a new pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

GitBox Tue, 24 Nov 2020 06:20:18 -0800


HyukjinKwon opened a new pull request #30486:
URL: https://github.com/apache/spark/pull/30486

### What changes were proposed in this pull request?

TL;DR:
- This PR completes the support of archives in Spark itself instead of
Yarn-only
- After this PR, PySpark users can use Conda to ship Python packages
together as below:
```python
conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0
pandas==1.1.4 conda-pack==0.5.0
conda activate pyspark_env
conda pack -f -o pyspark_env.tar.gz
PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python
pyspark --archives pyspark_env.tar.gz#environment
```

This PR proposes to add Spark's native `--archives` in Spark submit, and
`spark.archives` configuration. Currently, both are supported only in Yarn mode:

```bash
./bin/spark-submit --help
```

```
Options:
...
Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default:
"default").
--archives ARCHIVES Comma separated list of archives to be
extracted into the
working directory of each executor.
```

This `archives` feature is useful often when you have to ship a directory
and unpack into executors. One example is native libraries to use e.g. JNI.
Another example is to ship Python packages together by Conda environment.

Especially for Conda, PySpark currently does not have a nice way to ship a
package that works in general, please see also
https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment
(PySpark new documentation demo for 3.1.0).

The neatest way is arguably to use Conda environment by shipping zipped
Conda environment but this is currently dependent on this archive feature. NOTE
that we are able to use `spark.files` by relying on its undocumented behaviour
that untars `tar.gz` but I don't think we should document such ways and promote
people to more rely on it.

Also, note that this PR does not target to add the feature parity of
`spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented
that this is an experimental feature as well.

### Why are the changes needed?

To complete the feature parity, and to provide a better support of shipping
Python libraries together with Conda env.

### Does this PR introduce _any_ user-facing change?

Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a
new configuration `spark.archives`.

### How was this patch tested?

I added unittests. Also, manually tested in standalone cluster,
local-cluster, and local modes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Reply via email to