Behroz Sikander created SPARK-24794:
---------------------------------------

             Summary: DriverWrapper should have both master addresses in 
-Dspark.master
                 Key: SPARK-24794
                 URL: https://issues.apache.org/jira/browse/SPARK-24794
             Project: Spark
          Issue Type: Bug
          Components: Deploy
    Affects Versions: 2.2.1
            Reporter: Behroz Sikander


In standalone cluster mode, one could launch a Driver with supervise mode 
enabled. Spark launches the driver with a JVM argument -Dspark.master which is 
set to [host and port of current 
master|[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].]

 

During the life of context, the spark masters can switch due to any reason. 
After that if the driver dies unexpectedly and comes up it tries to connect 
with the master which was set initially with -Dspark.master but that master is 
in STANDBY mode. The context tries multiple times to connect to standby and 
then just kills itself.

 

*Suggestion:*

While launching the driver process, Spark master should use the [spark.master 
passed as 
input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124]
 instead of master and port of the current master.

Log messages that we observe:

 
{code:java}
2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
Connecting to master spark://10.100.100.22:7077..
.....
2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 
org.apache.spark.network.client.TransportClientFactory []: Successfully created 
connection to /10.100.100.22:7077 after 1 ms (0 ms spent in bootstraps)
.....
2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application 
has been killed. Reason: All masters are unresponsive! Giving up.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to