cloud-fan commented on a change in pull request #24569: [SPARK-23191][CORE] 
Warn rather than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#discussion_r283867767
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
 ##########
 @@ -485,8 +493,13 @@ private[deploy] class Worker(
       masterRef.send(WorkerSchedulerStateResponse(workerId, execs.toList, 
drivers.keys.toSeq))
 
     case ReconnectWorker(masterUrl) =>
-      logInfo(s"Master with url $masterUrl requested this worker to 
reconnect.")
-      registerWithMaster()
+      if (masterUrl != activeMasterUrl) {
 
 Review comment:
   This is the part that worries me worst. We need to be super sure that it's 
100% safe.
   
   It looks to me that
   1. `activeMasterUrl` is the master that this worker thinks it's active. This 
may not always be the actual active master. For example, when the zookeeper 
finishes leader election, there is a time window before this worker gets the 
`MasterChanged` message.
   2. `ReconnectWorker` may be sent by a standby master, as you explained in 
the PR description.
   
   Can we list all the cases that a `ReconnectWorker` is sent, and make sure we 
don't miss valid cases here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to