Saisai Shao commented on SPARK-17512:

This is due to some behavior changes during submitting spark applications on 
yarn with client or cluster deloy mode. In 2.0 we convert most of the arguments 
to use system properties, while in 1.6 for "--py-files" we still use command 
arguments for yarn cluster mode, and for {{PythonRunner}} it only checks system 
property {{spark.submit.pyFiles}}, so that's why it works under 1.6. It is 
really a issue should be fixed, let me handle it.

> Specifying remote files for Python based Spark jobs in Yarn cluster mode not 
> working
> ------------------------------------------------------------------------------------
>                 Key: SPARK-17512
>                 URL: https://issues.apache.org/jira/browse/SPARK-17512
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Submit
>    Affects Versions: 2.0.0
>            Reporter: Udit Mehrotra
> When I run a python application, and specify a remote path for the extra 
> files to be included in the PYTHON_PATH using the '--py-files' or 
> 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the 
> following error:
> Exception in thread "main" java.lang.IllegalArgumentException: Launching 
> Python applications through spark-submit is currently only supported for 
> local files: s3://xxxx/app.py
> at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> Here are sample commands which would throw this error in Spark 2.0 
> (sparkApp.py requires app.py):
> spark-submit --deploy-mode cluster --py-files s3://xxxx/app.py 
> s3://xxxx/sparkApp.py (works fine in 1.6)
> spark-submit --deploy-mode cluster --conf 
> spark.submit.pyFiles=s3://xxxx/app.py s3://xxxx/sparkApp1.py (not working in 
> 1.6)
> This would work fine if app.py is downloaded locally and specified.
> This was working correctly using ‘—py-files’ option in earlier version of 
> Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But 
> now, it does not work through either of the ways.
> The following diff shows the comment which states that it should work with 
> ‘non-local’ paths for the YARN cluster mode, and we are specifically doing 
> separate validation to fail if YARN client mode is used with remote paths:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309
> And then this code gets triggered at the end of each run, irrespective of 
> whether we are using Client or Cluster mode, and internally validates that 
> the paths should be non-local:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634
> This above validation was not getting triggered in earlier version of Spark 
> using ‘—py-files’ option because we were not storing the arguments passed to 
> ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, 
> the following code was newly added in 2.0 which now stores it and hence this 
> validation gets triggered even if we specify files through ‘—py-files’ option:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545
> Also, we changed the logic in YARN client, to read values directly from 
> ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):
> https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543
> So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the 
> validation gets triggered in both cases irrespective of whether we use Client 
> or Cluster mode with YARN.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to