[
https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762588#comment-16762588
]
ASF subversion and git services commented on AIRFLOW-3647:
----------------------------------------------------------
Commit 13c63ffad05817bf4ed6ef948dc9672c26f8ffb6 in airflow's branch
refs/heads/master from Penumbra69
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=13c63ff ]
[AIRFLOW-3647] Add archives config option to SparkSubmitOperator (#4467)
To enable to spark behavior of transporting and extracting an archive
on job launch, making the _contents_ of the archive available to the
driver as well as the workers (not just the jar or archive as a zip
file) - this configuration attribute is necessary.
This is required if you have no ability to modify the Python env on
the worker / driver nodes, but you wish to use versions, modules, or
features not installed.
We transport a full Python 3.5 environment to our CDH cluster using
this option and the alias "#PYTHON" paired an additional configuration
to spark to use it:
--archives "hdfs:///user/myuser/my_python_env.zip#PYTHON"
--conf
"spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3"
> Contributed SparkSubmitOperator doesn't honor --archives configuration
> ----------------------------------------------------------------------
>
> Key: AIRFLOW-3647
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
> Project: Apache Airflow
> Issue Type: Improvement
> Components: contrib
> Affects Versions: 1.10.1
> Environment: Linux (RHEL 7)
> Python 3.5 (using a virtual environment)
> spark-2.1.3-bin-hadoop26
> Airflow 1.10.1
> CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
> Reporter: Ken Melms
> Priority: Minor
> Labels: easyfix, newbie
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The contributed SparkSubmitOperator has no ability to honor the spark-submit
> configuration field "--archives" which is treated subtly different than
> "files" or "-py-files" in that it will unzip the archive into the
> application's working directory, and can optionally add an alias to the
> unzipped folder so that you can refer to it elsewhere in your submission.
> EG:
> spark-submit --archives=hdfs:////user/someone/python35_venv.zip#PYTHON
> --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3"
> run_me.py
> In our case - this behavior allows for multiple python virtual environments
> to be sourced from HDFS without incurring the penalty of pushing the whole
> python virtual env to the cluster each submission. This solves (for us)
> using python-based spark jobs on a cluster that the end user has no ability
> to define the python modules in use.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)