[ https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or reassigned SPARK-4592: -------------------------------- Assignee: Andrew Or (was: Josh Rosen) > "Worker registration failed: Duplicate worker ID" error during Master failover > ------------------------------------------------------------------------------ > > Key: SPARK-4592 > URL: https://issues.apache.org/jira/browse/SPARK-4592 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core > Affects Versions: 1.2.0 > Reporter: Josh Rosen > Assignee: Andrew Or > Priority: Blocker > Attachments: log.txt > > > When running Spark Standalone in high-availability mode, we sometimes see > "Worker registration failed: Duplicate worker ID" errors which prevent > workers from reconnecting to the new active master. I've attached full logs > from a reproduction in my integration tests suite (which runs something > similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a > worker log during a failed run of the "rolling outage" test, which creates a > multi-master cluster then repeatedly kills the active master, waits for > workers to reconnect to a new active master, then kills that master, and so > on. > {code} > 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at > http://172.17.0.90:8081 > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:02 INFO Worker: Successfully registered with master > spark://172.17.0.86:7077 > 14/11/23 02:23:03 INFO Worker: Asked to launch executor > app-20141123022303-0000/1 for spark-integration-tests > 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" > "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar" > "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" > "akka.tcp://sparkdri...@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" > "1" "172.17.0.90" "8" "app-20141123022303-0000" > "akka.tcp://sparkWorker@172.17.0.90:8888/user/Worker" > 14/11/23 02:23:14 INFO Worker: Disassociated > [akka.tcp://sparkWorker@172.17.0.90:8888] -> > [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! > 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for > master to reconnect... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now > gated for [5000] ms. Reason is: [Disassociated]. > 14/11/23 02:23:14 INFO Worker: Disassociated > [akka.tcp://sparkWorker@172.17.0.90:8888] -> > [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! > 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for > master to reconnect... > 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: > Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from > Actor[akka://sparkWorker/user/Worker#-1246122173] to > Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters > encountered. This logging can be turned off or adjusted with configuration > settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. > 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 14/11/23 02:23:14 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkWorker/deadLetters] to > Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613] > was not delivered. [2] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1) > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2) > 14/11/23 02:23:36 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:36 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:36 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at > spark://172.17.0.87:7077 > 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3) > 14/11/23 02:23:47 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:47 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:47 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker > ID > 14/11/23 02:23:47 INFO ExecutorRunner: Killing process! > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org