[GitHub] [spark] Ngone51 commented on issue #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens

GitBox Tue, 14 May 2019 07:46:54 -0700

Ngone51 commented on issue #24569: [SPARK-23191][CORE] Warn rather than
terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#issuecomment-492269385

> Then how do the workers know which master is active?

hmmm... worker itself does not know active master well whenever it is not
connected to a certain master. It just try to connect to all masters and wait
for masters replies.

And, if a master fails on leader election, its state would remains in
`STANDBY`. So, when it receives worker's register request, the standby master
would ignore the request and just tell the worker, 'I'm in standby'. In turn,
worker would also ignore this msg(it would continue to wait for the active
one's response).

If a master is elected to be the leader successfully, its state would change
to `ALIVE`. So, when it receives worker's register request, the alive(or
active) master would register the worker and response with `RegisteredWorker`.
And when worker receives `RegisteredWorker`, it would change its
masterRef(RpcEndpointRef) to the active one. Afterwards, the worker could
communicate with the active master by the masterRef directly.

And if the active master crash happens, worker would re-try to connect to
all masters(the behavior may be a little different after #3447). Then, the
above process would reproduce.

Now, I'm thinking that we do can know the master is active or not just by
checking its state is STANDBY or not.

But in case 2, even though we could recognize a master is active or not, we
may still could not avoid step (3).

See the log snippets(see details in JIRA SPARK-23191) which provide by
@zuotingbing :

```
2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost
leadership
2019-03-15 20:22:14,544 WARN Master: Removing
worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker
worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master
shutting down.
```

In log, it seems like that we have a race condition between
ZooKeeperLeaderElectionAgent and Master. When the master receives a late
heartbeat, it's still active. But, almost simultaneously, it changes to
in-active.

And, shouldn't we synchronized on master's receive if it extends to
`ThreadSafeRpcEndpoint`(though this may not the reason that the race condition
exists) ?

https://github.com/apache/spark/blob/d9e4cf67c06b2d6daa4cd24b056e33dfb5eb35f5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L221-L222


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on issue #24569: [SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens

Reply via email to