Ngone51 commented on a change in pull request #24569: [SPARK-23191][CORE] Warn
rather than terminate when duplicate worker register happens
URL: https://github.com/apache/spark/pull/24569#discussion_r284147066
##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -485,8 +493,13 @@ private[deploy] class Worker(
masterRef.send(WorkerSchedulerStateResponse(workerId, execs.toList,
drivers.keys.toSeq))
case ReconnectWorker(masterUrl) =>
- logInfo(s"Master with url $masterUrl requested this worker to
reconnect.")
- registerWithMaster()
+ if (masterUrl != activeMasterUrl) {
Review comment:
> 2. `ReconnectWorker` may be sent by a standby master, as you explained in
the PR description.
I made a wrong PR description on step order in CASE 2(have revised). Sorry
for it. Actually, while sending `ReconnectWorker`, Master A is still active but
quickly going to die(as a race condition metioned above.)
Actually, there's no doubt that the msg `ReconnectWorker(master)` must come
from an active Master.
So, when Worker receives that msg from Master X, cases would be:
1) Master X is active
1.1) Master X is the initial active master(No `MasterChanged` msg)
1.1.1) master == activeMasterUrl
just reonnect to (all) masters
1.1.2)master != activeMasterUrl
impossible case
1.2) Master X is elected to be new active master
1.2.1)master == activeMasterUrl (`MasterChanged` comes before
`ReconnectWorker`)
just reonnect to (all) masters
1.2.2) master != activeMasterUrl (`MasterChanged` comes after
`ReconnectWorker`)
seems very impossible, but can be a valid case as you
mentioned above. In this case,
we'll always ignore the reconnect msg until we receive
`MasterChanged`.
2) Master X is in-active, Master Y takes over after Master X sends
`ReconnectWorker`
2.1) master == activeMasterUrl (`MasterChanged` from Y comes after
`ReconnectWorker` from X)
the active master has changed, but Worker haven't relaized the
truth. It will still try to
reconnect to (all) masters. In this case(contrary to CASE 2), we'll
hit duplicate register issue.
2.2) master != activeMasterUrl (`MasterChanged` from Y comes before
`ReconnectWorker` from X)
ignore it since Worker has already changed the active master to
Master Y.
**Since this PR suggests to change the result of worker duplicate register
from exit to warn, so, I think it's ok if we remove this condition check here.
Because the worst result by accepting `ReconnectWorker` is duplicate register
to the active master, which is covered by this PR's fix solution.**
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]