Ngone51 commented on a change in pull request #24569: [SPARK-23191][CORE] Warn 
rather than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#discussion_r284147066
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
 ##########
 @@ -485,8 +493,13 @@ private[deploy] class Worker(
       masterRef.send(WorkerSchedulerStateResponse(workerId, execs.toList, 
drivers.keys.toSeq))
 
     case ReconnectWorker(masterUrl) =>
-      logInfo(s"Master with url $masterUrl requested this worker to 
reconnect.")
-      registerWithMaster()
+      if (masterUrl != activeMasterUrl) {
 
 Review comment:
   > 2. `ReconnectWorker` may be sent by a standby master, as you explained in 
the PR description.
   
   I made a wrong PR description on step order in CASE 2(have revised). Sorry 
for it. Actually, while sending `ReconnectWorker`, Master A is still active but 
quickly going to die(as a race condition metioned above.)
   
   Actually, there's no doubt that the msg `ReconnectWorker(master)` must come 
from an active Master.
   So, when Worker receives that msg from Master X, cases would be:
   
   1) Master X is active
       1.1) Master X is the initial active master(No `MasterChanged` msg)
              1.1.1) master == activeMasterUrl 
                  just reonnect to (all) masters
              1.1.2)master != activeMasterUrl
                  impossible case
        1.2) Master X is elected to be new active master
               1.2.1)master == activeMasterUrl (`MasterChanged` comes before 
`ReconnectWorker`) 
                   just reonnect to (all) masters
               1.2.2)  master != activeMasterUrl (`MasterChanged` comes after 
`ReconnectWorker`)
                  seems very impossible, but can be a valid case  as you 
mentioned above. In this case, 
                  we'll always ignore the reconnect msg until we receive 
`MasterChanged`.
   2) Master X is in-active, Master Y takes over after Master X sends 
`ReconnectWorker`
       2.1) master == activeMasterUrl (`MasterChanged` from Y comes after 
`ReconnectWorker` from X)
           the active master has changed, but Worker haven't relaized the 
truth. It will still try to 
           reconnect to (all) masters. In this case(contrary to CASE 2), we'll 
hit duplicate register issue.
       2.2) master != activeMasterUrl (`MasterChanged` from Y comes before 
`ReconnectWorker` from X)    
           ignore it since Worker has already changed the active master to 
Master Y.
   
   
   **Since this PR suggests to change the result of worker duplicate register 
from exit to warn, so, I think it's ok if we remove this condition check here. 
Because the worst result by accepting `ReconnectWorker` is duplicate register 
to the active master, which is covered by this PR's fix solution.**
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to