Ngone51 commented on issue #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens URL: https://github.com/apache/spark/pull/24569#issuecomment-492269385 > Then how do the workers know which master is active? hmmm... worker itself does not know active master well whenever it is not connected to a certain master. It just try to connect to all masters and wait for masters replies. And, if a master fails on leader election, its state would remains in `STANDBY`. So, when it receives worker's register request, the standby master would ignore the request and just tell the worker, 'I'm in standby'. In turn, worker would also ignore this msg(it would continue to wait for the active one's response). If a master is elected to be the leader successfully, its state would change to `ALIVE`. So, when it receives worker's register request, the alive(or active) master would register the worker and response with `RegisteredWorker`. And when worker receives `RegisteredWorker`, it would change its masterRef(RpcEndpointRef) to the active one. Afterwards, the worker could communicate with the active master by the masterRef directly. And if the active master crash happens, worker would re-try to connect to all masters(the behavior may be a little different after #3447). Then, the above process would reproduce. Now, I'm thinking that we do can know the master is active or not just by checking its state is STANDBY or not. But in case 2, even though we could recognize a master is active or not, we may still could not avoid step (3). See the log snippets(see details in JIRA SPARK-23191) which provide by @zuotingbing : ``` 2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost leadership 2019-03-15 20:22:14,544 WARN Master: Removing worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds 2019-03-15 20:22:14,544 INFO Master: Removing worker worker-20190218183101-vmax18-33129 on vmax18:33129 2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker worker-20190218183101-vmax18-33129. Asking it to re-register. 2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master shutting down. ``` In log, it seems like that we have a race condition between ZooKeeperLeaderElectionAgent and Master. When the master receives a late heartbeat, it's still active. But, almost simultaneously, it changes to in-active. And, shouldn't we synchronized on master's receive if it extends to `ThreadSafeRpcEndpoint`(though this may not the reason that the race condition exists) ? https://github.com/apache/spark/blob/d9e4cf67c06b2d6daa4cd24b056e33dfb5eb35f5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L221-L222
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
