Ngone51 commented on issue #24569: [SPARK-23191][CORE] Warn rather than 
terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#issuecomment-492269385
 
 
   > Then how do the workers know which master is active?
   
   hmmm... worker itself does not know active master well whenever it is not 
connected to a certain master. It just try to connect to all masters and wait 
for masters replies. 
   
   And, if a master fails on leader election, its state would remains in 
`STANDBY`. So, when it receives worker's register request, the standby master 
would ignore the request and just tell the worker, 'I'm in standby'. In turn, 
worker would also ignore this msg(it would continue to wait for the active 
one's response).
   
   If a master is elected to be the leader successfully, its state would change 
to `ALIVE`. So, when it receives worker's register request, the alive(or 
active) master would register the worker and response with `RegisteredWorker`. 
And when worker receives `RegisteredWorker`, it would change its 
masterRef(RpcEndpointRef) to the active one. Afterwards, the worker could 
communicate with the active master by the masterRef directly.
   
   And if the active master crash happens, worker would re-try to connect to 
all masters(the behavior may be a little different after #3447). Then, the 
above process would reproduce.
   
   Now, I'm thinking that we do can know the master is active or not just by 
checking its state is STANDBY or not.
   
   But in case 2, even though we could recognize a master is active or not, we 
may still could not avoid step (3).  
   
   See the log snippets(see details in JIRA SPARK-23191) which provide by 
@zuotingbing :
   
   ```
   2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost 
leadership
   2019-03-15 20:22:14,544 WARN Master: Removing 
worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
   2019-03-15 20:22:14,544 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
   2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker 
worker-20190218183101-vmax18-33129. Asking it to re-register.
   2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master 
shutting down.
   ```
   
   In log, it seems like that we have a race condition between 
ZooKeeperLeaderElectionAgent and Master. When the master receives a late 
heartbeat, it's still active. But, almost simultaneously, it changes to 
in-active.
   
   
   And, shouldn't we synchronized on master's receive if it extends to 
`ThreadSafeRpcEndpoint`(though this may not the reason that the race condition 
exists) ?
   
https://github.com/apache/spark/blob/d9e4cf67c06b2d6daa4cd24b056e33dfb5eb35f5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L221-L222
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to