[
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey updated SPARK-19900:
---------------------------
Description:
I've found some problems when node, where driver is running, has unstable
network. A situation is possible when two identical applications are running on
a cluster.
*Steps to Reproduce:*
# prepare 3 node. One for the spark master and two for the spark workers.
# submit an application with parameter spark.driver.supervise = true
# go to the node where driver is running (for example spark-worker-1) and close
7077 port
{code}
# iptables -A OUTPUT -p tcp --dport 7077 -j DROP
{code}
# wait more 60 seconds
# look at the spark master UI
There are two spark applications and one driver. The new application has
WAITING state and the second application has RUNNING state. Driver has RUNNING
or RELAUNCHING state (It depends on the resources available, as I understand
it) and it launched on other node (for example spark-worker-2)
# open the port
{code}
# iptables -D OUTPUT -p tcp --dport 7077 -j DROP
{code}
# look an the spark UI again
There are no changes
In addition, if you look at the processes on the node spark-worker-1
{code}
# ps ax | grep spark
{code}
you will see that the old driver is still working!
*Spark master logs:*
{code}
17/03/10 05:26:27 WARN Master: Removing
worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60
seconds
17/03/10 05:26:27 INFO Master: Removing worker
worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-0000
17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347-0000 on
worker worker-20170310052411-spark-worker-2-40473
17/03/10 05:26:35 INFO Master: Registering app TestApplication
17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID
app-20170310052635-0001
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got status update for unknown executor
app-20170310052354-0000/1
17/03/10 05:31:07 WARN Master: Got status update for unknown executor
app-20170310052354-0000/0
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 INFO Master: Registering worker spark-worker-1:35039 with 8
cores, 10.8 GB RAM
17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/4 on
worker worker-20170310052240-spark-worker-1-35039
17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/5 on
worker worker-20170310052240-spark-worker-1-35039
{code}
I expect the following behaviour:
# when the driver is relaunching it should not create a new application or the
old application should be removed
# the process with old driver should be killed
Correct me please if I do not understand something or I missed some settings.
was:
I've found some problems when node, where driver is running, has unstable
network. A situation is possible when two identical applications are running on
a cluster.
*Steps to Reproduce:*
# prepare 3 node. One for the spark master and two for the spark workers.
# submit an application with parameter spark.driver.supervise = true
# go to the node where driver is running (for example spark-worker-1) and close
7077 port
{code}
# iptables -A OUTPUT -p tcp --dport 7077 -j DROP
{code}
# wait more 60 seconds
# look at the spark master UI
There are two spark applications and one driver. The new application has
WAITING state and the second application has RUNNING state. Driver has RUNNING
or RELAUNCHING state (It depends on the resources available, as I understand
it) and it launched on other node (for example spark-worker-2)
# open the port
{code}
# iptables -D OUTPUT -p tcp --dport 7077 -j DROP
{code}
# look an the spark UI again
There are no changes
In addition, if you look at the processes on the node spark-worker-1
{code}
# ps ax | grep spark
{code}
you will see that the old driver is still working!
*Spark master logs:*
{code}
17/03/10 05:26:27 WARN Master: Removing
worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60
seconds
17/03/10 05:26:27 INFO Master: Removing worker
worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-0000
17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347-0000 on
worker worker-20170310052411-spark-worker-2-40473
17/03/10 05:26:35 INFO Master: Registering app TestApplication
17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID
app-20170310052635-0001
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got status update for unknown executor
app-20170310052354-0000/1
17/03/10 05:31:07 WARN Master: Got status update for unknown executor
app-20170310052354-0000/0
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
17/03/10 05:31:07 INFO Master: Registering worker spark-worker-1:35039 with 8
cores, 10.8 GB RAM
17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/4 on
worker worker-20170310052240-spark-worker-1-35039
17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/5 on
worker worker-20170310052240-spark-worker-1-35039
{code}
> [Standalone] Master registers application again when driver relaunched
> ----------------------------------------------------------------------
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Spark Core
> Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
> Reporter: Sergey
> Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable
> network. A situation is possible when two identical applications are running
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has
> WAITING state and the second application has RUNNING state. Driver has
> RUNNING or RELAUNCHING state (It depends on the resources available, as I
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
> you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-0000
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347-0000 on
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor
> app-20170310052354-0000/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor
> app-20170310052354-0000/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 INFO Master: Registering worker spark-worker-1:35039 with 8
> cores, 10.8 GB RAM
> 17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/4
> on worker worker-20170310052240-spark-worker-1-35039
> 17/03/10 05:31:07 INFO Master: Launching executor app-20170310052354-0000/5
> on worker worker-20170310052240-spark-worker-1-35039
> {code}
> I expect the following behaviour:
> # when the driver is relaunching it should not create a new application or
> the old application should be removed
> # the process with old driver should be killed
> Correct me please if I do not understand something or I missed some settings.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]