These tips were very helpful! By setting SPARK_MASTER_IP as you suggest, I
was able to make progress. Unfortunately, it is unclear to me how to
specify the hadoop-client dependency for a pyspark stand-alone application.
So, I still get the EOFException, since I used a non-default Hadoop
distribution (I was using 2.3.0-cdh5.0.0 distributed with CDH 5). The
documentation describes how to add a hadoop-client dependency to the
pom.xml for a Java application, but not for PySpark. To work around the
EOFException, I created a multi-node Hadoop cluster with version 1.04 (the
default Hadoop for Spark 0.9.1). This worked and I was able to successfully
do a multi-node Spark job.

The question remains though: how do you specify a hadoop-client dependency
for a Python stand-alone Spark application (i.e. do the equivalent of
adding to the pom.xml for a Java Spark application)? Thanks!

-T.J.


On Thu, May 29, 2014 at 4:04 AM, jaranda <jordi.ara...@bsc.es> wrote:

> I finally got it working. Main points:
>
> - I had to add hadoop-client dependency to avoid a strange EOFException.
> - I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f
> instead of hostname, since akka seems not to work properly with host names
> /
> ip, it requires fully qualified domain names.
> - I also set SPARK_MASTER_IP in conf/spark-env.sh to hostname -f so that
> other workers can reach the master.
> - Be sure that conf/slaves also contains fully qualified domain names.
> - It seems that both master and workers need to have access to the driver
> client and since I was within a VPN I had lot of troubles with this. It
> took
> some time but I finally realized it.
>
> Making these changes, everything just worked like a charm!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Standalone-App-in-Scala-Standalone-mode-issues-tp6493p6514.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to