[ https://issues.apache.org/jira/browse/SPARK-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kun Liu updated SPARK-15969: ---------------------------- Remaining Estimate: 120h (was: 168h) Original Estimate: 120h (was: 168h) > FileNotFoundException: Multiple arguments for py-files flag, (also jars) for > spark-submit > ----------------------------------------------------------------------------------------- > > Key: SPARK-15969 > URL: https://issues.apache.org/jira/browse/SPARK-15969 > Project: Spark > Issue Type: Bug > Components: Spark Submit > Affects Versions: 1.5.0, 1.6.1 > Environment: Mac OS X 10.11.5 > Reporter: Kun Liu > Priority: Minor > Original Estimate: 120h > Remaining Estimate: 120h > > First time to open a JIRA issue. Newbie to the Spark community. Correct me if > I was wrong. Thanks. > An exception, java.io.FileNotFoundException, happened when multiple arguments > were specified for the -py-files (also -jars) flag. > I searched for a while but only found a similar issue on Windows OS: > https://issues.apache.org/jira/browse/SPARK-6435 > My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1 > 1.1 Observations: > 1) Quotation does not make any difference for the arguments, the result will > always be the same > 2) The first path before comma, as long as valid, won’t be a problem whether > it is an absolute or a relative path > 3) The second and further py-files paths won’t be a problem if ALL of them > are: > a. are relative paths under the same directory as the working directory > ($PWD); OR > b. specified by using environment variable at the beginning, e.g. > $ENV_VAR/path/to/file; OR > c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter > absolute or relative paths, as long as valid > 4) The path of the driver program, assuming valid, does not matter, as it is > a single file > 1.2 Experiments: > Assuming main.py calls functions from helper1.py and helper2.py, and all > paths below are valid. > ~/Desktop/testpath: main.py, helper1.py, helper2.py > $SPARK_HOME/testpath: helper1.py, helper2.py > 1) Successful output: > a. Multiple python paths are relative paths under the same directory as > the working directory > cd $SPARK_HOME > bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py > ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files > testpath/helper1.py,testpath/helper2.py testpath/main.py > b. Multiple python paths are specified by using environment variable > export TEST_DIR=~/Desktop/testpath > cd ~ > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > c. Multiple paths (absolute or relative) after being preprocessed: > $SPARK_HOME/bin/spark-submit --py-files $(echo > $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr > ' ' ',') ~/Desktop/testpath/main.py > (reference link: > http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option) > 2) Failure output: if the second python path is an absolute one; the same > problem will happen for further paths > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py > ~/Desktop/testpath/main.py > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: Added file > file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist. > 1.3 Conclusions > I would suggest the py-files flag of spark-submit could support all absolute > paths arguments, not just relative path under the working directory. > If necessary, I would like to submit a pull request and start working on it > as my first contribution to the Spark community. > 1.4 Note > 1) I think the same issue will happen when multiple jar files delimited by > comma are passed to the —jars flag flag for Java applications. > 2) I suggest wildcard paths arguments should also be supported, as indicated > by https://issues.apache.org/jira/browse/SPARK-3451 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org