[
https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549019#comment-16549019
]
Apache Spark commented on SPARK-24794:
--------------------------------------
User 'bsikander' has created a pull request for this issue:
https://github.com/apache/spark/pull/21816
> DriverWrapper should have both master addresses in -Dspark.master
> -----------------------------------------------------------------
>
> Key: SPARK-24794
> URL: https://issues.apache.org/jira/browse/SPARK-24794
> Project: Spark
> Issue Type: Bug
> Components: Deploy
> Affects Versions: 2.2.1
> Reporter: Behroz Sikander
> Priority: Major
>
> In standalone cluster mode, one could launch a Driver with supervise mode
> enabled. Spark launches the driver with a JVM argument -Dspark.master which
> is set to [host and port of current
> master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].
>
> During the life of context, the spark masters can switch due to any reason.
> After that if the driver dies unexpectedly and comes up it tries to connect
> with the master which was set initially with -Dspark.master but that master
> is in STANDBY mode. The context tries multiple times to connect to standby
> and then just kills itself.
>
> *Suggestion:*
> While launching the driver process, Spark master should use the [spark.master
> passed as
> input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124]
> instead of master and port of the current master.
> Log messages that we observe:
>
> {code:java}
> 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []:
> Connecting to master spark://10.100.100.22:7077..
> .....
> 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0
> org.apache.spark.network.client.TransportClientFactory []: Successfully
> created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in
> bootstraps)
> .....
> 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []:
> Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []:
> Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application
> has been killed. Reason: All masters are unresponsive! Giving up.{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]