[
https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ash Berlin-Taylor resolved AIRFLOW-3647.
----------------------------------------
Resolution: Fixed
Fix Version/s: 1.10.3
> Contributed SparkSubmitOperator doesn't honor --archives configuration
> ----------------------------------------------------------------------
>
> Key: AIRFLOW-3647
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
> Project: Apache Airflow
> Issue Type: Improvement
> Components: contrib
> Affects Versions: 1.10.1
> Environment: Linux (RHEL 7)
> Python 3.5 (using a virtual environment)
> spark-2.1.3-bin-hadoop26
> Airflow 1.10.1
> CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
> Reporter: Ken Melms
> Priority: Minor
> Labels: easyfix, newbie
> Fix For: 1.10.3
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The contributed SparkSubmitOperator has no ability to honor the spark-submit
> configuration field "--archives" which is treated subtly different than
> "files" or "-py-files" in that it will unzip the archive into the
> application's working directory, and can optionally add an alias to the
> unzipped folder so that you can refer to it elsewhere in your submission.
> EG:
> spark-submit --archives=hdfs:////user/someone/python35_venv.zip#PYTHON
> --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3"
> run_me.py
> In our case - this behavior allows for multiple python virtual environments
> to be sourced from HDFS without incurring the penalty of pushing the whole
> python virtual env to the cluster each submission. This solves (for us)
> using python-based spark jobs on a cluster that the end user has no ability
> to define the python modules in use.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)