[ https://issues.apache.org/jira/browse/SPARK-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-17512: ------------------------------------ Assignee: (was: Apache Spark) > Specifying remote files for Python based Spark jobs in Yarn cluster mode not > working > ------------------------------------------------------------------------------------ > > Key: SPARK-17512 > URL: https://issues.apache.org/jira/browse/SPARK-17512 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit, YARN > Affects Versions: 2.0.0 > Reporter: Udit Mehrotra > > When I run a python application, and specify a remote path for the extra > files to be included in the PYTHON_PATH using the '--py-files' or > 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the > following error: > Exception in thread "main" java.lang.IllegalArgumentException: Launching > Python applications through spark-submit is currently only supported for > local files: s3://xxxx/app.py > at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104) > at > org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136) > at > org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136) > at > org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636) > at > org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Here are sample commands which would throw this error in Spark 2.0 > (sparkApp.py requires app.py): > spark-submit --deploy-mode cluster --py-files s3://xxxx/app.py > s3://xxxx/sparkApp.py (works fine in 1.6) > spark-submit --deploy-mode cluster --conf > spark.submit.pyFiles=s3://xxxx/app.py s3://xxxx/sparkApp1.py (not working in > 1.6) > This would work fine if app.py is downloaded locally and specified. > This was working correctly using ‘—py-files’ option in earlier version of > Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But > now, it does not work through either of the ways. > The following diff shows the comment which states that it should work with > ‘non-local’ paths for the YARN cluster mode, and we are specifically doing > separate validation to fail if YARN client mode is used with remote paths: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309 > And then this code gets triggered at the end of each run, irrespective of > whether we are using Client or Cluster mode, and internally validates that > the paths should be non-local: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634 > This above validation was not getting triggered in earlier version of Spark > using ‘—py-files’ option because we were not storing the arguments passed to > ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, > the following code was newly added in 2.0 which now stores it and hence this > validation gets triggered even if we specify files through ‘—py-files’ option: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545 > Also, we changed the logic in YARN client, to read values directly from > ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier): > https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543 > So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the > validation gets triggered in both cases irrespective of whether we use Client > or Cluster mode with YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org