[
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224108#comment-14224108
]
Josh Rosen commented on SPARK-4592:
-----------------------------------
It looks like this is a bug that was introduced in [the
patch|https://github.com/apache/spark/pull/2828] for SPARK-3736. Prior to that
patch, a worker that became disassociated from a master would wait for a master
to initiate a reconnection. This behavior caused problems when a master failed
by stopping and restarting, since the restarted master would never know to
initiate a reconnection. That patch addressed this issue by having workers
attempt to reconnect to the master if they think they've become disconnected.
When reviewing that patch, I wrote a [long
comment|https://github.com/apache/spark/pull/2828#issuecomment-59602394] that
explains it in much more detail; that's a good reference for understanding its
motivation.
This change introduced a problem, though: there's now a race-condition during
multi-master failover. In the old multi-master code, a worker never initiates
a reconnection attempt to a master; instead, it reconnects after the new active
/ primary master tells the worker that it's the new master. With the addition
of worker-initiated reconnect, there's a race-condition where the worker
detects that it's become disconnected, goes down the list of known masters and
tries to connect to each of them, successfully connects to the new primary
master, then receives a {{ChangedMaster}} event and attempts to connect to the
new primary master _even though it's already connected_, causing a duplicate
worker registration.
There are a number of ways that we might fix this, but we have to be careful
because it seems likely that the worker-initiated reconnect could have
introduced other problems:
- What happens if a worker sends a reconnection attempt to a live master which
is not the new primary? Will that non-primary master reject or redirect those
registrations, or will it register the workers and cause a split-brain scenario
to occur?
- The Worker is implemented as an actor and thus does not have synchronization
of its internal state since it assumes message-at-a-time processing, but the
asynchronous re-registration timer thread may violate this assumption because
it directly calls internal worker methods instead of sending messages to the
worker's own mailbox.
One simple fix might be to have the worker never initiate reconnection attempts
when running in a multi-master environment. I still need to think through
whether this will cause new problems similar to SPARK-3736. I don't think it
will be a problem because that patch was motivated by cases where the master
forgot who the worker was and couldn't initiate a reconnect. If the list of
registered workers is stored durably in ZooKeeper such that a worker is never
told that it has registered until a record of its registration has become
durable, then I think this is fine: if a live master thinks that a worker has
disconnected, then it will initiate a reconnection; when a new master fails
over, it will reconnect all workers based on the list from ZooKeeper.
> "Worker registration failed: Duplicate worker ID" error during Master failover
> ------------------------------------------------------------------------------
>
> Key: SPARK-4592
> URL: https://issues.apache.org/jira/browse/SPARK-4592
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Spark Core
> Affects Versions: 1.2.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Blocker
>
> When running Spark Standalone in high-availability mode, we sometimes see
> "Worker registration failed: Duplicate worker ID" errors which prevent
> workers from reconnecting to the new active master. I've attached full logs
> from a reproduction in my integration tests suite (which runs something
> similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a
> worker log during a failed run of the "rolling outage" test, which creates a
> multi-master cluster then repeatedly kills the active master, waits for
> workers to reconnect to a new active master, then kills that master, and so
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor
> app-20141123022303-0000/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
> "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://[email protected]:51271/user/CoarseGrainedScheduler"
> "1" "172.17.0.90" "8" "app-20141123022303-0000"
> "akka.tcp://[email protected]:8888/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated
> [akka.tcp://[email protected]:8888] ->
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://[email protected]:7077] has failed, address is now
> gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated
> [akka.tcp://[email protected]:8888] ->
> [akka.tcp://[email protected]:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef:
> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from
> Actor[akka://sparkWorker/user/Worker#-1246122173] to
> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with
> the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
> was not delivered. [2] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:36 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at
> spark://172.17.0.87:7077
> 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.86:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.87:7077...
> 14/11/23 02:23:47 INFO Worker: Connecting to master
> spark://172.17.0.88:7077...
> 14/11/23 02:23:47 ERROR Worker: Worker registration failed: Duplicate worker
> ID
> 14/11/23 02:23:47 INFO ExecutorRunner: Killing process!
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]