[ 
https://issues.apache.org/jira/browse/SPARK-19770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Olson updated SPARK-19770:
------------------------------
    Description: 
I am trying to submit the example SparkPi job to a 9 node Spark cluster running 
on Mesos. My spark-submit statement:

{quote}
./bin/spark-submit \
  --name "Test01:" \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://<IP Address>:7078 \
  --deploy-mode cluster \
  --executor-memory 16G \
  --executor-cores 1 \
  --driver-cores 10 \
  --driver-memory 48G \
  --num-executors 1 \
  file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
  1000
{quote}

When I do this, the job completes successfully. I can go into the stdout file 
on the driver machine, and see the "Pi is roughly...." output. 

However, on most of the slave machines, If I go into the stderr file for that 
same job, I see the following exceptions:

{quote}
17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Failed to connect to <machine 
name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791
{quote}

There appears to be a connectivity issue between some of the 9 nodes, some of 
the time. It does not seem to be consistent on the routes between machines. 
(Sometimes #2 and #7 talk, sometimes they do not). 

How can I troubleshoot this? Network behavior seems normal otherwise.

ALSO - sometimes in the logs I'll see 

{quote}
17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
{quote}

However, my resource count in Mesos (via the UI) is accurate.


  was:
I am trying to submit the example SparkPi job to a 9 node Spark cluster running 
on Mesos. My spark-submit statement:

./bin/spark-submit \
  --name "Test01:" \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://<IP Address>:7078 \
  --deploy-mode cluster \
  --executor-memory 16G \
  --executor-cores 1 \
  --driver-cores 10 \
  --driver-memory 48G \
  --num-executors 1 \
  file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
  1000

When I do this, the job completes successfully. I can go into the stdout file 
on the driver machine, and see the "Pi is roughly...." output. 

However, on most of the slave machines, If I go into the stderr file for that 
same job, I see the following exceptions:

17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Failed to connect to <machine 
name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791

There appears to be a connectivity issue between some of the 9 nodes, some of 
the time. It does not seem to be consistent on the routes between machines. 
(Sometimes #2 and #7 talk, sometimes they do not). 

How can I troubleshoot this? Network behavior seems normal otherwise.

ALSO - sometimes in the logs I'll see 

17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources

However, my resource count in Mesos (via the UI) is accurate.



> Running Example SparkPi Job on Mesos
> ------------------------------------
>
>                 Key: SPARK-19770
>                 URL: https://issues.apache.org/jira/browse/SPARK-19770
>             Project: Spark
>          Issue Type: Question
>          Components: Mesos, Spark Core, Spark Submit
>    Affects Versions: 2.1.0
>         Environment: spark-2.1.0-bin-hadoop2.3, mesos-1-1
>            Reporter: Joe Olson
>
> I am trying to submit the example SparkPi job to a 9 node Spark cluster 
> running on Mesos. My spark-submit statement:
> {quote}
> ./bin/spark-submit \
>   --name "Test01:" \
>   --class org.apache.spark.examples.SparkPi \
>   --master mesos://<IP Address>:7078 \
>   --deploy-mode cluster \
>   --executor-memory 16G \
>   --executor-cores 1 \
>   --driver-cores 10 \
>   --driver-memory 48G \
>   --num-executors 1 \
>   file://mnt/ocarchive1/sstables/jars/spark-examples_2.11-2.1.0.jar \
>   1000
> {quote}
> When I do this, the job completes successfully. I can go into the stdout file 
> on the driver machine, and see the "Pi is roughly...." output. 
> However, on most of the slave machines, If I go into the stderr file for that 
> same job, I see the following exceptions:
> {quote}
> 17/02/28 10:20:59 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks
> java.io.IOException: Failed to connect to <machine 
> name>/fe80:0:0:0:ec4:7aff:fea4:82e1%5:44791
> {quote}
> There appears to be a connectivity issue between some of the 9 nodes, some of 
> the time. It does not seem to be consistent on the routes between machines. 
> (Sometimes #2 and #7 talk, sometimes they do not). 
> How can I troubleshoot this? Network behavior seems normal otherwise.
> ALSO - sometimes in the logs I'll see 
> {quote}
> 17/02/28 10:20:54 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> {quote}
> However, my resource count in Mesos (via the UI) is accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to