It's showing your PYTHONPATH as /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for pyspark on your PYTHONPATH.
https://github.com/apache/incubator-toree/blob/master/pyspark-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/pyspark/PySparkProcess.scala#L78 That code is showing us augmenting the existing PYTHONPATH to include $SPARK_HOME/python/, where we are searching for your pyspark distribution. Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which is also troubling. On Wed, Dec 14, 2016 at 9:41 AM chris snow <[email protected]> wrote: > I'm trying to setup toree as follows: > > CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET > https://${BI_HOST}:9443/api/v1/clusters > | python -c 'import sys, json; > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);') > echo Cluster Name: $CLUSTER_NAME > > CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts > | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts > = [ item["Hosts"]["host_name"] for item in items ]; print(" > ".join(hosts));') > echo Cluster Hosts: $CLUSTER_HOSTS > > wget -c > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh > > # Install anaconda if it isn't already installed > [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b > > # check toree is available, if not install it > ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install > toree > > # Install toree > ./anaconda2/bin/jupyter toree install \ > --spark_home=/usr/iop/current/spark-client/ \ > --user --interpreters Scala,PySpark,SparkR \ > --spark_opts="--master yarn" \ > --python_exec=${HOME}/anaconda2/bin/python2.7 > > # Install anaconda on all of the cluster nodes > for CLUSTER_HOST in ${CLUSTER_HOSTS}; > do > if [[ "$CLUSTER_HOST" != "$BI_HOST" ]]; > then > echo "*** Processing $CLUSTER_HOST ***" > ssh $BI_USER@$CLUSTER_HOST "wget -q -c > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh" > ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash > Anaconda2-4.1.1-Linux-x86_64.sh -b" > > # You can install your pip modules on each node using something > like this: > # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary" > fi > done > > echo 'Finished installing' > > However, when I try to run a pyspark job I get the following error: > > Name: org.apache.toree.interpreter.broker.BrokerException > Message: Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net): > org.apache.spark.SparkException: > Error from python worker: > /home/biadmin/anaconda2/bin/python2.7: No module named pyspark > PYTHONPATH was: > /disk3/local/filecache/103/spark-assembly.jar > java.io.EOFException > > Any ideas what is going wrong? >
