Ngone51 opened a new pull request #24569: [SPARK-23191][CORE] Warn rather than 
terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569
 
 
   ## What changes were proposed in this pull request?
   
   #2828 firstly introduces the bug of *duplicate Worker register*, and #3447 
fixed it. But there's still corner cases(see SPARK-23191 for details) where 
#3447 can not cover it:
   
   * CASE 1
   (1) Master A disconnected(e.g. due to network drop), but not fail
   (2) Worker attempts to reconnect to all masters
   (3) during the time, network recover,  Worker register to Master A again
   (4) Master A response with `RegisterWorkerFailed("Duplicate worker ID")`
   (5) Worker receives that msg, exit
   
   * CASE 2
   (1) Master A lost leadership, Master B takes overs
   (2) Worker receives `MasterChanged`, change masterRef to Master B
   (3) Master A receives a late HeartBeat from Worker, asking it to reconnect
   (4) Worker receives `ReconnectWorker` from Master A, reconnect to all masters
   (5) Master B receives register request again from the Worker,  response with 
`RegisterWorkerFailed("Duplicate worker ID")`
   (6) Worker receives that msg, exit
   
   For CASE 2, we could avoid duplicate worker register by comparing the 
current active master url and requested url in step (4).
   
   For CASE 1, I found it would be troublesome to avoid *duplicate Worker 
register* in this special cases. So, this pr suggests to log warn rather than 
terminate the worker process when duplicate worker register happens. In this 
way, worker and master could also work well with each other.
   
   ## How was this patch tested?
   
   Tested Manually.
   
   I followed the steps as  Neeraj Gupta suggested in JIRA SPARK-23191 to 
reproduce the case 1.
   
   Before this pr, Worker would be DEAD from UI.
   After this pr, Worker just warn the duplicate register behavior (as you can 
see the second last row in log snippet below), and still be ALIVE from UI.
   
   ```
   19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for 
master to reconnect...
   19/05/09 20:58:32 INFO Worker: wuyi.local:7077 Disassociated !
   19/05/09 20:58:32 INFO Worker: Connecting to master wuyi.local:7077...
   19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for 
master to reconnect...
   19/05/09 20:58:32 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
   19/05/09 20:58:37 WARN TransportClientFactory: DNS resolution for 
wuyi.local/127.0.0.1:7077 took 5005 ms
   19/05/09 20:58:37 INFO TransportClientFactory: Found inactive connection to 
wuyi.local/127.0.0.1:7077, creating a new one.
   19/05/09 20:58:37 INFO TransportClientFactory: Successfully created 
connection to wuyi.local/127.0.0.1:7077 after 3 ms (0 ms spent in bootstraps)
   19/05/09 20:58:37 WARN Worker: Duplicate registration at master 
spark://wuyi.local:7077
   19/05/09 20:58:37 INFO Worker: Successfully registered with master 
spark://wuyi.local:7077
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to