[ 
https://issues.apache.org/jira/browse/AIRFLOW-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Melms updated AIRFLOW-3647:
-------------------------------
    Description: 
The contributed SparkSubmitOperator has no ability to honor the spark-submit 
configuration field "--archives" which is treated subtly different than "files" 
or "-py-files" in that it will unzip the archive into the application's working 
directory, and can optionally add an alias to the unzipped folder so that you 
can refer to it elsewhere in your submission.

EG:

spark-submit  --archives=hdfs:////user/someone/python35_venv.zip#PYTHON --conf 
"spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" 
run_me.py  

In our case - this behavior allows for multiple python virtual environments to 
be sourced from HDFS without incurring the penalty of pushing the whole python 
virtual env to the cluster each submission.  This solves (for us) using 
python-based spark jobs on a cluster that the end user has no ability to define 
the python modules in use.

 

  was:
The contributed SparkSubmitOperator has no ability to honor the spark-submit 
configuration field "--archives" which is treated subtly different than 
"--files" or "--py-files" in that it will unzip the archive into the 
application's working directory, and can optionally add an alias to the 
unzipped folder so that you can refer to it elsewhere in your submission.

EG:

spark-submit  --archives=hdfs:////user/someone/python35_venv.zip#PYTHON --conf 
"spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" 
run_me.py  



In our case - this behavior allows for multiple python virtual environments to 
be sourced from HDFS without incurring the penalty of pushing the whole python 
virtual env to the cluster each submission.  This solves (for us) using 
python-based spark jobs on a cluster that the end user has no ability to define 
the python modules in use.

 


> Contributed SparkSubmitOperator doesn't honor --archives configuration
> ----------------------------------------------------------------------
>
>                 Key: AIRFLOW-3647
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3647
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: contrib
>    Affects Versions: 1.10.1
>         Environment: Linux (RHEL 7)
> Python 3.5 (using a virtual environment)
> spark-2.1.3-bin-hadoop26
> Airflow 1.10.1
> CDH 5.14 Hadoop [Yarn] cluster (no end user / dev modifications allowed)
>            Reporter: Ken Melms
>            Priority: Minor
>              Labels: easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The contributed SparkSubmitOperator has no ability to honor the spark-submit 
> configuration field "--archives" which is treated subtly different than 
> "files" or "-py-files" in that it will unzip the archive into the 
> application's working directory, and can optionally add an alias to the 
> unzipped folder so that you can refer to it elsewhere in your submission.
> EG:
> spark-submit  --archives=hdfs:////user/someone/python35_venv.zip#PYTHON 
> --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHON/python35/bin/python3" 
> run_me.py  
> In our case - this behavior allows for multiple python virtual environments 
> to be sourced from HDFS without incurring the penalty of pushing the whole 
> python virtual env to the cluster each submission.  This solves (for us) 
> using python-based spark jobs on a cluster that the end user has no ability 
> to define the python modules in use.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to