It's showing your PYTHONPATH as
/disk3/local/filecache/103/spark-assembly.jar. Toree is looking for pyspark
on your PYTHONPATH.

https://github.com/apache/incubator-toree/blob/master/pyspark-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/pyspark/PySparkProcess.scala#L78

That code is showing us augmenting the existing PYTHONPATH to include
$SPARK_HOME/python/, where we are searching for your pyspark distribution.

Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
is also troubling.

On Wed, Dec 14, 2016 at 9:41 AM chris snow <[email protected]> wrote:

> I'm trying to setup toree as follows:
>
>     CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> https://${BI_HOST}:9443/api/v1/clusters
> | python -c 'import sys, json;
> print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
>     echo Cluster Name: $CLUSTER_NAME
>
>     CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
> = [ item["Hosts"]["host_name"] for item in items ]; print("
> ".join(hosts));')
>     echo Cluster Hosts: $CLUSTER_HOSTS
>
>     wget -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
>
>     # Install anaconda if it isn't already installed
>     [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
>
>     # check toree is available, if not install it
>     ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
> toree
>
>     # Install toree
>     ./anaconda2/bin/jupyter toree install \
>             --spark_home=/usr/iop/current/spark-client/ \
>             --user --interpreters Scala,PySpark,SparkR  \
>             --spark_opts="--master yarn" \
>             --python_exec=${HOME}/anaconda2/bin/python2.7
>
>     # Install anaconda on all of the cluster nodes
>     for CLUSTER_HOST in ${CLUSTER_HOSTS};
>     do
>        if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
>        then
>           echo "*** Processing $CLUSTER_HOST ***"
>           ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh";
>           ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> Anaconda2-4.1.1-Linux-x86_64.sh -b"
>
>           # You can install your pip modules on each node using something
> like this:
>           # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
>        fi
>     done
>
>     echo 'Finished installing'
>
> However, when I try to run a pyspark job I get the following error:
>
>     Name: org.apache.toree.interpreter.broker.BrokerException
>     Message: Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>     : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> org.apache.spark.SparkException:
>     Error from python worker:
>       /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
>     PYTHONPATH was:
>       /disk3/local/filecache/103/spark-assembly.jar
>     java.io.EOFException
>
> Any ideas what is going wrong?
>

Reply via email to