[jira] [Commented] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

Josh Rosen (JIRA) Mon, 24 Nov 2014 22:15:36 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224108#comment-14224108
 ]


Josh Rosen commented on SPARK-4592:
-----------------------------------

It looks like this is a bug that was introduced in [the 
patch|https://github.com/apache/spark/pull/2828] for SPARK-3736.  Prior to that 
patch, a worker that became disassociated from a master would wait for a master 
to initiate a reconnection.  This behavior caused problems when a master failed 
by stopping and restarting, since the restarted master would never know to 
initiate a reconnection.  That patch addressed this issue by having workers 
attempt to reconnect to the master if they think they've become disconnected.  
When reviewing that patch, I wrote a [long 
comment|https://github.com/apache/spark/pull/2828#issuecomment-59602394] that 
explains it in much more detail; that's a good reference for understanding its 
motivation.

This change introduced a problem, though: there's now a race-condition during 
multi-master failover.  In the old multi-master code, a worker never initiates 
a reconnection attempt to a master; instead, it reconnects after the new active 
/ primary master tells the worker that it's the new master.  With the addition 
of worker-initiated reconnect, there's a race-condition where the worker 
detects that it's become disconnected, goes down the list of known masters and 
tries to connect to each of them, successfully connects to the new primary 
master, then receives a {{ChangedMaster}} event and attempts to connect to the 
new primary master _even though it's already connected_, causing a duplicate 
worker registration.

There are a number of ways that we might fix this, but we have to be careful 
because it seems likely that the worker-initiated reconnect could have 
introduced other problems:

- What happens if a worker sends a reconnection attempt to a live master which 
is not the new primary?  Will that non-primary master reject or redirect those 
registrations, or will it register the workers and cause a split-brain scenario 
to occur?
- The Worker is implemented as an actor and thus does not have synchronization 
of its internal state since it assumes message-at-a-time processing, but the 
asynchronous re-registration timer thread may violate this assumption because 
it directly calls internal worker methods instead of sending messages to the 
worker's own mailbox.

One simple fix might be to have the worker never initiate reconnection attempts 
when running in a multi-master environment.  I still need to think through 
whether this will cause new problems similar to SPARK-3736.  I don't think it 
will be a problem because that patch was motivated by cases where the master 
forgot who the worker was and couldn't initiate a reconnect.  If the list of 
registered workers is stored durably in ZooKeeper such that a worker is never 
told that it has registered until a record of its registration has become 
durable, then I think this is fine: if a live master thinks that a worker has 
disconnected, then it will initiate a reconnection; when a new master fails 
over, it will reconnect all workers based on the list from ZooKeeper. 

> "Worker registration failed: Duplicate worker ID" error during Master failover
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-4592
>                 URL: https://issues.apache.org/jira/browse/SPARK-4592
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Blocker
>
> When running Spark Standalone in high-availability mode, we sometimes see 
> "Worker registration failed: Duplicate worker ID" errors which prevent 
> workers from reconnecting to the new active master.  I've attached full logs 
> from a reproduction in my integration tests suite (which runs something 
> similar to Spark's FaultToleranceTest).  Here's the relevant excerpt from a 
> worker log during a failed run of the "rolling outage" test, which creates a 
> multi-master cluster then repeatedly kills the active master, waits for 
> workers to reconnect to a new active master, then kills that master, and so 
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at 
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master 
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor 
> app-20141123022303-0000/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" 
> "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
>  "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" 
> "akka.tcp://[email protected]:51271/user/CoarseGrainedScheduler" 
> "1" "172.17.0.90" "8" "app-20141123022303-0000" 
> "akka.tcp://[email protected]:8888/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://[email protected]:8888] -> 
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://[email protected]:7077] has failed, address is now 
> gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://[email protected]:8888] -> 
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: 
> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from 
> Actor[akka://sparkWorker/user/Worker#-1246122173] to 
> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters 
> encountered. This logging can be turned off or adjusted with configuration 
> settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkWorker/deadLetters] to 
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
>  was not delivered. [2] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at 
> spark://172.17.0.87:7077
> 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker 
> ID
> 14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

Reply via email to