[jira] [Assigned] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working

Apache Spark (JIRA) Sun, 18 Sep 2016 00:39:58 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-17512:
------------------------------------

    Assignee:     (was: Apache Spark)

> Specifying remote files for Python based Spark jobs in Yarn cluster mode not 
> working
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-17512
>                 URL: https://issues.apache.org/jira/browse/SPARK-17512
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Submit, YARN
>    Affects Versions: 2.0.0
>            Reporter: Udit Mehrotra
>
> When I run a python application, and specify a remote path for the extra 
> files to be included in the PYTHON_PATH using the '--py-files' or 
> 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the 
> following error:
> Exception in thread "main" java.lang.IllegalArgumentException: Launching 
> Python applications through spark-submit is currently only supported for 
> local files: s3://xxxx/app.py
> at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> Here are sample commands which would throw this error in Spark 2.0 
> (sparkApp.py requires app.py):
> spark-submit --deploy-mode cluster --py-files s3://xxxx/app.py 
> s3://xxxx/sparkApp.py (works fine in 1.6)
> spark-submit --deploy-mode cluster --conf 
> spark.submit.pyFiles=s3://xxxx/app.py s3://xxxx/sparkApp1.py (not working in 
> 1.6)
> This would work fine if app.py is downloaded locally and specified.
> This was working correctly using ‘—py-files’ option in earlier version of 
> Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But 
> now, it does not work through either of the ways.
> The following diff shows the comment which states that it should work with 
> ‘non-local’ paths for the YARN cluster mode, and we are specifically doing 
> separate validation to fail if YARN client mode is used with remote paths:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309
> And then this code gets triggered at the end of each run, irrespective of 
> whether we are using Client or Cluster mode, and internally validates that 
> the paths should be non-local:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634
> This above validation was not getting triggered in earlier version of Spark 
> using ‘—py-files’ option because we were not storing the arguments passed to 
> ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, 
> the following code was newly added in 2.0 which now stores it and hence this 
> validation gets triggered even if we specify files through ‘—py-files’ option:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545
> Also, we changed the logic in YARN client, to read values directly from 
> ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):
> https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543
> So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the 
> validation gets triggered in both cases irrespective of whether we use Client 
> or Cluster mode with YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working

Reply via email to