[
https://issues.apache.org/jira/browse/SPARK-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-17512:
------------------------------------
Assignee: Apache Spark
> Specifying remote files for Python based Spark jobs in Yarn cluster mode not
> working
> ------------------------------------------------------------------------------------
>
> Key: SPARK-17512
> URL: https://issues.apache.org/jira/browse/SPARK-17512
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Submit, YARN
> Affects Versions: 2.0.0
> Reporter: Udit Mehrotra
> Assignee: Apache Spark
>
> When I run a python application, and specify a remote path for the extra
> files to be included in the PYTHON_PATH using the '--py-files' or
> 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the
> following error:
> Exception in thread "main" java.lang.IllegalArgumentException: Launching
> Python applications through spark-submit is currently only supported for
> local files: s3://xxxx/app.py
> at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
> at
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
> at
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
> at
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Here are sample commands which would throw this error in Spark 2.0
> (sparkApp.py requires app.py):
> spark-submit --deploy-mode cluster --py-files s3://xxxx/app.py
> s3://xxxx/sparkApp.py (works fine in 1.6)
> spark-submit --deploy-mode cluster --conf
> spark.submit.pyFiles=s3://xxxx/app.py s3://xxxx/sparkApp1.py (not working in
> 1.6)
> This would work fine if app.py is downloaded locally and specified.
> This was working correctly using ‘—py-files’ option in earlier version of
> Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But
> now, it does not work through either of the ways.
> The following diff shows the comment which states that it should work with
> ‘non-local’ paths for the YARN cluster mode, and we are specifically doing
> separate validation to fail if YARN client mode is used with remote paths:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309
> And then this code gets triggered at the end of each run, irrespective of
> whether we are using Client or Cluster mode, and internally validates that
> the paths should be non-local:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634
> This above validation was not getting triggered in earlier version of Spark
> using ‘—py-files’ option because we were not storing the arguments passed to
> ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However,
> the following code was newly added in 2.0 which now stores it and hence this
> validation gets triggered even if we specify files through ‘—py-files’ option:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545
> Also, we changed the logic in YARN client, to read values directly from
> ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):
> https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543
> So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the
> validation gets triggered in both cases irrespective of whether we use Client
> or Cluster mode with YARN.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]