[
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Or reassigned SPARK-4592:
--------------------------------
Assignee: Andrew Or (was: Josh Rosen)
> "Worker registration failed: Duplicate worker ID" error during Master failover
> ------------------------------------------------------------------------------
>
> Key: SPARK-4592
> URL: https://issues.apache.org/jira/browse/SPARK-4592
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Spark Core
> Affects Versions: 1.2.0
> Reporter: Josh Rosen
> Assignee: Andrew Or
> Priority: Blocker
> Attachments: log.txt
>
>
> When running Spark Standalone in high-availability mode, we sometimes see
> "Worker registration failed: Duplicate worker ID" errors which prevent
> workers from reconnecting to the new active master. I've attached full logs
> from a reproduction in my integration tests suite (which runs something
> similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a
> worker log during a failed run of the "rolling outage" test, which creates a
> multi-master cluster then repeatedly kills the active master, waits for
> workers to reconnect to a new active master, then kills that master, and so
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor
> app-20141123022303-0000/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
> "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://[email protected]:51271/user/CoarseGrainedScheduler"
> "1" "172.17.0.90" "8" "app-20141123022303-0000"
> "akka.tcp://[email protected]:8888/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated
> [akka.tcp://[email protected]:8888] ->
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://[email protected]:7077] has failed, address is now
> gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated
> [akka.tcp://[email protected]:8888] ->
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef:
> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from
> Actor[akka://sparkWorker/user/Worker#-1246122173] to
> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with
> the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
> was not delivered. [2] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at
> spark://172.17.0.87:7077
> 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker
> ID
> 14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]