[
https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049929#comment-14049929
]
Matthew Farrellee commented on SPARK-1030:
------------------------------------------
using pyspark to submit is deprecated in spark 1.0 in favor of spark-submit. i
think this should be closed as resolved/workfix. /cc: [~pwendell] [~joshrosen]
> unneeded file required when running pyspark program using yarn-client
> ---------------------------------------------------------------------
>
> Key: SPARK-1030
> URL: https://issues.apache.org/jira/browse/SPARK-1030
> Project: Spark
> Issue Type: Bug
> Components: Deploy, PySpark, YARN
> Affects Versions: 0.8.1
> Reporter: Diana Carroll
> Assignee: Josh Rosen
>
> I can successfully run a pyspark program using the yarn-client master using
> the following command:
> {code}
> SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
> \
> SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
> test1.py
> {code}
> However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python
> program, and therefore there's no JAR. If I don't set the value, or if I set
> the value to a non-existent files, Spark gives me an error message.
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
> {code}
> or
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.io.FileNotFoundException: File file:dummy.txt does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
> {code}
> My program is very simple:
> {code}
> from pyspark import SparkContext
> def main():
> sc = SparkContext("yarn-client", "Simple App")
> logData =
> sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log")
> numjpgs = logData.filter(lambda s: '.jpg' in s).count()
> print "Number of JPG requests: " + str(numjpgs)
> {code}
> Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at
> all; I can point it at anything, as long as it's a valid, accessible file,
> and it works the same.
> Although there's an obvious workaround for this bug, it's high priority from
> my perspective because I'm working on a course to teach people how to do
> this, and it's really hard to explain why this variable is needed!
--
This message was sent by Atlassian JIRA
(v6.2#6252)