[jira] [Comment Edited] (TOREE-344) No module named pyspark

Krzysztof Mierzejewski (JIRA) Mon, 26 Feb 2018 04:30:23 -0800

    [ 
https://issues.apache.org/jira/browse/TOREE-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376776#comment-16376776
 ]


Krzysztof Mierzejewski edited comment on TOREE-344 at 2/26/18 12:29 PM:
------------------------------------------------------------------------

The point is, for PySpark on *nix systems Toree forks processes not with Java 
any more, but with Python instead. Technically a new process is created with
{code}
python -m pyspark.daemon{code}
Hence, PySpark must be available and reachable on executor nodes as well.

This approach gave me some hassle with the JupyterHub and Spark on Yarn in a 
Hadoop cluster, with 2 executor nodes for a notebook by default. Because I have 
not installed PySpark with pip / conda, but I make use of the Spark bundled 
package, the following steps resolved the issue in my case:
 # Distribute the _$HADOOP_HOME_/spark/python/lib to all data nodes, where Yarn 
runs executors on. I just did scp...
 # *Salient*: +for executors+, set _PYTHONPATH_ environment variable to the zip 
packages in the local _$HADOOP_HOME_/spark/python/lib. Now, I wasted lots of 
time in this point trying to set _PYTHONPATH_ environment variable with variety 
of means, but to no avail. No matter what I did, the value was seemingly 
superseded by a mysterious culprit. And it was Yarn. So I finally succeeded 
with the following configuration value in the _spark-default.conf_ file:
{code:java}
spark.executorEnv.PYTHONPATH    
$HADOOP_HOME/spark/python/lib/py4j-0.10.4-src.zip:$HADOOP_HOME/spark/python/lib/pyspark.zip{code}

An interesting modification would be to upload these two _zip_ archives to HDFS 
and make use of _spark.yarn.dist.files_ value in place of copying the files 
physically. But as of now I have no clue how to set _PYTHONPATH_ in such a 
case. Please leave a note if you find a way :)


was (Author: mierzej):
The point is, for PySpark on *nix systems Toree forks processes not with Java 
any more, but with Python instead. Technically a new process is created with
{code}
python -m pyspark.daemon{code}
Hence, PySpark must be available and reachable on executor nodes as well.

This approach gave me some hassle with the JupyterHub and Spark on Yarn in a 
Hadoop cluster, with 2 executor nodes for a notebook by default. Because I have 
not installed PySpark with pip / conda, but I make use of the Spark bundled 
package, the following steps resolved the issue in my case:
 # Distribute the _$HADOOP_HOME_/spark/python/lib to all data nodes, where Yarn 
runs executors on. I just did scp...
 # *Salient*: +for executors+, set _PYTHONPATH_ environment variable to the zip 
packages in the local _$HADOOP_HOME_/spark/python/lib. Now, I wasted lots of 
time in this point trying to set _PYTHONPATH_ environment variable with variety 
of mans, but to no avail. No matter what I did, the value was seemingly 
superseded by a mysterious culprit. And it was Yarn. So I finally succeeded 
with the following configuration value in the _spark-default.conf_ file:
{code:java}
spark.executorEnv.PYTHONPATH    
$HADOOP_HOME/spark/python/lib/py4j-0.10.4-src.zip:$HADOOP_HOME/spark/python/lib/pyspark.zip{code}

An interesting modification would be to upload these two _zip_ archives to HDFS 
and make use of _spark.yarn.dist.files_ value in place of copying the files 
physically. But as of now I have no clue how to set _PYTHONPATH_ in such a 
case. Please leave a note if you find a way :)

> No module named pyspark
> -----------------------
>
>                 Key: TOREE-344
>                 URL: https://issues.apache.org/jira/browse/TOREE-344
>             Project: TOREE
>          Issue Type: Bug
>            Reporter: haniar
>            Priority: Major
>
> I have installed toree to my jupyter environment 
> (https://github.com/apache/incubator-toree) and written a piece of code that 
> works with pyspark. Yarn starts properly and I can see the containers running 
> in the queue,
> When I run the code, I get the following error
> Error from python worker:
>   /usr/local/bin/python2.7: No module named pyspark
> the kernel is set-up as follows:
> {
>   "language": "python",
>   "display_name": "Apache Toree - PySpark",
>   "env": {
>     "__TOREE_SPARK_OPTS__": " --master yarn",
>     "SPARK_HOME": "/usr/hdp/2.4.2.0-258/spark",
>     "__TOREE_OPTS__": "",
>     "DEFAULT_INTERPRETER": "PySpark",
>     "PYTHONPATH": 
> "/usr/hdp/2.4.2.0-258/spark/python:/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip",
>     "PYTHON_EXEC": "python",
>  "PYTHONSTARTUP": "/usr/hdp/2.4.2.0-258/spark/python/pyspark/shell.py",
>  "PYSPARK_PYTHON": "/usr/local/bin/python2.7",
>        "PYSPARK_DRIVER_PYTHON": "/usr/local/bin/python2.7"
>   },
>   "argv": [
>     "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
>     "--profile",
>     "{connection_file}"
>   ]
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TOREE-344) No module named pyspark

Reply via email to