GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/3447

    [SPARK-4592] Avoid duplicate worker registrations in standalone mode

    **Symptom.** On failover, the Master may receive duplicate registrations 
from the same worker, causing the worker to exit.
    
    **Cause.** This commit 
https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f 
adds logic for the worker to re-register with the master in case of failures. 
However, the following race condition may occur:
    
    (1) Master A fails and Worker attempts to reconnect to all masters
    (2) Master B takes over and notifies Worker
    (3) Worker responds by registering with Master B
    (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, 
causing the same Worker to register with Master B twice
    
    **Fix.** Instead of attempting to register with all known masters, the 
worker should re-register with only the one that it has been communicating 
with. Then, when it is finally notified of the change in master, the worker 
gives up on the old master and communicates with the new one.
    
    **Caveat.** Even this fix is subject to more obscure race conditions. For 
instance, if Master B fails and Master A recovers immediately, then Master A 
may still observed duplicate worker registrations. However, this, and other 
potential race conditions summarized in 
[SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much 
less likely than the one described above, which is deterministically 
reproducible.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark standalone-failover

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3447
    
----
commit b6f269e6460ecc441c319b5e92437e47d141c361
Author: Andrew Or <[email protected]>
Date:   2014-11-25T07:40:00Z

    Avoid duplicate worker registrations
    
    The gist is that we only reconnect to the master we've been
    communicating with instead of making a registration request
    to all known masters. More details in the code comments.

commit 1fce6a9343d6f563dac0c793480420c6511091ac
Author: Andrew Or <[email protected]>
Date:   2014-11-25T08:06:04Z

    Active master actor could be null in the beginning
    
    If a worker cannot initially reach a master, then it will attempt
    a retry. In this case, the active master actor must be null. This
    commit removes an assert that falsely assumes the contrary.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to