cloud-fan edited a comment on issue #24569: [SPARK-23191][CORE] Warn rather 
than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#issuecomment-492637963
 
 
   I think we need to go back to the design and think about how to fix the root 
cause. The information I have are:
   1. zookeeper knows who is the actual leader and is the single source of truth
   2. masters have states (standby, active), which should eventually be 
consistent with zookeeper, but can be out of sync for a while
   3. each worker keeps the url of the active master, which should eventually 
be consistent with zookeeper, but can be out of sync for a while
   
   The worker will search the active master by sending messages to all masters, 
when
   1. on start
   2. the heartbeat timeout (master disconnect)
   3. one master sends `ReconnectWorker` message
   
   There are 2 problems I see
   1. sending messages to all masters may not find the active master, as it's 
possible that more than one master think it's the leader
   2. non-active master may also send the `ReconnectWorker` message.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to