[ https://issues.apache.org/jira/browse/TOREE-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376776#comment-16376776 ]
Krzysztof Mierzejewski edited comment on TOREE-344 at 2/26/18 12:29 PM: ------------------------------------------------------------------------ The point is, for PySpark on *nix systems Toree forks processes not with Java any more, but with Python instead. Technically a new process is created with {code} python -m pyspark.daemon{code} Hence, PySpark must be available and reachable on executor nodes as well. This approach gave me some hassle with the JupyterHub and Spark on Yarn in a Hadoop cluster, with 2 executor nodes for a notebook by default. Because I have not installed PySpark with pip / conda, but I make use of the Spark bundled package, the following steps resolved the issue in my case: # Distribute the _$HADOOP_HOME_/spark/python/lib to all data nodes, where Yarn runs executors on. I just did scp... # *Salient*: +for executors+, set _PYTHONPATH_ environment variable to the zip packages in the local _$HADOOP_HOME_/spark/python/lib. Now, I wasted lots of time in this point trying to set _PYTHONPATH_ environment variable with variety of means, but to no avail. No matter what I did, the value was seemingly superseded by a mysterious culprit. And it was Yarn. So I finally succeeded with the following configuration value in the _spark-default.conf_ file: {code:java} spark.executorEnv.PYTHONPATH $HADOOP_HOME/spark/python/lib/py4j-0.10.4-src.zip:$HADOOP_HOME/spark/python/lib/pyspark.zip{code} An interesting modification would be to upload these two _zip_ archives to HDFS and make use of _spark.yarn.dist.files_ value in place of copying the files physically. But as of now I have no clue how to set _PYTHONPATH_ in such a case. Please leave a note if you find a way :) was (Author: mierzej): The point is, for PySpark on *nix systems Toree forks processes not with Java any more, but with Python instead. Technically a new process is created with {code} python -m pyspark.daemon{code} Hence, PySpark must be available and reachable on executor nodes as well. This approach gave me some hassle with the JupyterHub and Spark on Yarn in a Hadoop cluster, with 2 executor nodes for a notebook by default. Because I have not installed PySpark with pip / conda, but I make use of the Spark bundled package, the following steps resolved the issue in my case: # Distribute the _$HADOOP_HOME_/spark/python/lib to all data nodes, where Yarn runs executors on. I just did scp... # *Salient*: +for executors+, set _PYTHONPATH_ environment variable to the zip packages in the local _$HADOOP_HOME_/spark/python/lib. Now, I wasted lots of time in this point trying to set _PYTHONPATH_ environment variable with variety of mans, but to no avail. No matter what I did, the value was seemingly superseded by a mysterious culprit. And it was Yarn. So I finally succeeded with the following configuration value in the _spark-default.conf_ file: {code:java} spark.executorEnv.PYTHONPATH $HADOOP_HOME/spark/python/lib/py4j-0.10.4-src.zip:$HADOOP_HOME/spark/python/lib/pyspark.zip{code} An interesting modification would be to upload these two _zip_ archives to HDFS and make use of _spark.yarn.dist.files_ value in place of copying the files physically. But as of now I have no clue how to set _PYTHONPATH_ in such a case. Please leave a note if you find a way :) > No module named pyspark > ----------------------- > > Key: TOREE-344 > URL: https://issues.apache.org/jira/browse/TOREE-344 > Project: TOREE > Issue Type: Bug > Reporter: haniar > Priority: Major > > I have installed toree to my jupyter environment > (https://github.com/apache/incubator-toree) and written a piece of code that > works with pyspark. Yarn starts properly and I can see the containers running > in the queue, > When I run the code, I get the following error > Error from python worker: > /usr/local/bin/python2.7: No module named pyspark > the kernel is set-up as follows: > { > "language": "python", > "display_name": "Apache Toree - PySpark", > "env": { > "__TOREE_SPARK_OPTS__": " --master yarn", > "SPARK_HOME": "/usr/hdp/2.4.2.0-258/spark", > "__TOREE_OPTS__": "", > "DEFAULT_INTERPRETER": "PySpark", > "PYTHONPATH": > "/usr/hdp/2.4.2.0-258/spark/python:/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip", > "PYTHON_EXEC": "python", > "PYTHONSTARTUP": "/usr/hdp/2.4.2.0-258/spark/python/pyspark/shell.py", > "PYSPARK_PYTHON": "/usr/local/bin/python2.7", > "PYSPARK_DRIVER_PYTHON": "/usr/local/bin/python2.7" > }, > "argv": [ > "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh", > "--profile", > "{connection_file}" > ] > } -- This message was sent by Atlassian JIRA (v7.6.3#76005)