[ 
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224194#comment-14224194
 ] 

Andrew Or commented on SPARK-4592:
----------------------------------

I submitted a fix at https://github.com/apache/spark/pull/3447. For more 
detail, please read the message there. It should be noted that even this fix is 
still subject to more obscure race conditions. For instance:

1. Master A dies, worker tries to reconnect
2. Master B comes up and notifies worker
3. Master A comes back up, and worker successfully re-registers with Master A
4. Worker additionally receives the notification from Master B, and now it 
listens to both masters

As noted in the PR description, these race conditions are much less likely than 
the one this issue is trying to fix.

> "Worker registration failed: Duplicate worker ID" error during Master failover
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-4592
>                 URL: https://issues.apache.org/jira/browse/SPARK-4592
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Blocker
>         Attachments: log.txt
>
>
> When running Spark Standalone in high-availability mode, we sometimes see 
> "Worker registration failed: Duplicate worker ID" errors which prevent 
> workers from reconnecting to the new active master.  I've attached full logs 
> from a reproduction in my integration tests suite (which runs something 
> similar to Spark's FaultToleranceTest).  Here's the relevant excerpt from a 
> worker log during a failed run of the "rolling outage" test, which creates a 
> multi-master cluster then repeatedly kills the active master, waits for 
> workers to reconnect to a new active master, then kills that master, and so 
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at 
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master 
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor 
> app-20141123022303-0000/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" 
> "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
>  "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" 
> "akka.tcp://[email protected]:51271/user/CoarseGrainedScheduler" 
> "1" "172.17.0.90" "8" "app-20141123022303-0000" 
> "akka.tcp://[email protected]:8888/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://[email protected]:8888] -> 
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://[email protected]:7077] has failed, address is now 
> gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://[email protected]:8888] -> 
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: 
> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from 
> Actor[akka://sparkWorker/user/Worker#-1246122173] to 
> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters 
> encountered. This logging can be turned off or adjusted with configuration 
> settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkWorker/deadLetters] to 
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
>  was not delivered. [2] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at 
> spark://172.17.0.87:7077
> 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker 
> ID
> 14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to