[ 
https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762588#comment-16762588
 ] 

ASF subversion and git services commented on AIRFLOW-3647:
----------------------------------------------------------

Commit 13c63ffad05817bf4ed6ef948dc9672c26f8ffb6 in airflow's branch 
refs/heads/master from Penumbra69
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=13c63ff ]

[AIRFLOW-3647] Add archives config option to SparkSubmitOperator (#4467)

To enable to spark behavior of transporting and extracting an archive
on job launch,  making the _contents_ of the archive available to the
driver as well as the workers (not just the jar or archive as a zip
file) - this configuration attribute is necessary.

This is required if you have no ability to modify the Python env on
the worker / driver nodes, but you wish to use versions, modules, or
features not installed.

We transport a full Python 3.5 environment to our CDH cluster using
this option and the alias "#PYTHON" paired an additional configuration
to spark to use it:

    --archives "hdfs:///user/myuser/my_python_env.zip#PYTHON"
    --conf 
"spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3"

> Contributed SparkSubmitOperator doesn't honor --archives configuration
> ----------------------------------------------------------------------
>
>                 Key: AIRFLOW-3647
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: contrib
>    Affects Versions: 1.10.1
>         Environment: Linux (RHEL 7)
> Python 3.5 (using a virtual environment)
> spark-2.1.3-bin-hadoop26
> Airflow 1.10.1
> CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
>            Reporter: Ken Melms
>            Priority: Minor
>              Labels: easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The contributed SparkSubmitOperator has no ability to honor the spark-submit 
> configuration field "--archives" which is treated subtly different than 
> "files" or "-py-files" in that it will unzip the archive into the 
> application's working directory, and can optionally add an alias to the 
> unzipped folder so that you can refer to it elsewhere in your submission.
> EG:
> spark-submit  --archives=hdfs:////user/someone/python35_venv.zip#PYTHON 
> --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" 
> run_me.py  
> In our case - this behavior allows for multiple python virtual environments 
> to be sourced from HDFS without incurring the penalty of pushing the whole 
> python virtual env to the cluster each submission.  This solves (for us) 
> using python-based spark jobs on a cluster that the end user has no ability 
> to define the python modules in use.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to