Udit Mehrotra created SPARK-17512:
-------------------------------------
Summary: Specifying remote files for Python based Spark jobs in
Yarn cluster mode not working
Key: SPARK-17512
URL: https://issues.apache.org/jira/browse/SPARK-17512
Project: Spark
Issue Type: Bug
Components: PySpark, Spark Submit
Affects Versions: 2.0.0
Reporter: Udit Mehrotra
When I run a python application, and specify a remote path for the extra files
to be included in the PYTHON_PATH using the '--py-files' or
'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the
following error:
Exception in thread "main" java.lang.IllegalArgumentException: Launching Python
applications through spark-submit is currently only supported for local files:
s3://xxxx/app.py
at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
at
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Here are sample commands which would throw this error in Spark 2.0 (sparkApp.py
requires app.py):
spark-submit --deploy-mode cluster --py-files s3://xxxx/app.py
s3://xxxx/sparkApp.py (works fine in 1.6)
spark-submit --deploy-mode cluster --conf spark.submit.pyFiles=s3://xxxx/app.py
s3://xxxx/sparkApp1.py (not working in 1.6)
This would work fine if app.py is downloaded locally and specified.
This was working correctly using ‘—py-files’ option in earlier version of
Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But now,
it does not work through either of the ways.
The following diff shows the comment which states that it should work with
‘non-local’ paths for the YARN cluster mode, and we are specifically doing
separate validation to fail if YARN client mode is used with remote paths:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309
And then this code gets triggered at the end of each run, irrespective of
whether we are using Client or Cluster mode, and internally validates that the
paths should be non-local:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634
This above validation was not getting triggered in earlier version of Spark
using ‘—py-files’ option because we were not storing the arguments passed to
‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, the
following code was newly added in 2.0 which now stores it and hence this
validation gets triggered even if we specify files through ‘—py-files’ option:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545
Also, we changed the logic in YARN client, to read values directly from
‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):
https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543
So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the
validation gets triggered in both cases irrespective of whether we use Client
or Cluster mode with YARN.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]