GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/2828

    [SPARK-3736] Workers reconnect when disassociated from the master.

    Before, if the master node is killed and restarted, the worker nodes
    would not attempt to reconnect to the Master. Therefore, when the Master
    node was restarted, the worker nodes needed to be restarted as well.
    
    Now, when the Master node is disconnected, the worker nodes will
    continuously ping the master node in attempts to reconnect to it. Once
    the master node restarts, it will detect one of the registration
    requests from its former workers. The result is that the cluster
    re-enters a healthy state.
    
    In addition, when the master does not receive a heartbeat from the
    worker, the worker was removed; however, when the worker sent a
    heartbeat to the master, the master used to ignore the heartbeat. Now,
    a master that receives a heartbeat from a worker that had been
    disconnected will request the worker to re-attempt the registration
    process, at which point the worker will send a RegisterWorker request
    and be re-connected accordingly.
    
    Re-connection attempts per worker are submitted every N seconds, where N
    is configured by the property spark.worker.reconnect.interval - this has
    a default of 60 seconds right now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark reconnect-dead-workers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2828.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2828
    
----
commit b5b34af964199af296e12490413225f55d93a6cd
Author: mcheah <[email protected]>
Date:   2014-10-15T22:27:21Z

    [SPARK-3736] Workers reconnect when disassociated from the master.
    
    Before, if the master node is killed and restarted, the worker nodes
    would not attempt to reconnect to the Master. Therefore, when the Master
    node was restarted, the worker nodes needed to be restarted as well.
    
    Now, when the Master node is disconnected, the worker nodes will
    continuously ping the master node in attempts to reconnect to it. Once
    the master node restarts, it will detect one of the registration
    requests from its former workers. The result is that the cluster
    re-enters a healthy state.
    
    In addition, when the master does not receive a heartbeat from the
    worker, the worker was removed; however, when the worker sent a
    heartbeat to the master, the master used to ignore the heartbeat. Now,
    a master that receives a heartbeat from a worker that had been
    disconnected will request the worker to re-attempt the registration
    process, at which point the worker will send a RegisterWorker request
    and be re-connected accordingly.
    
    Re-connection attempts per worker are submitted every N seconds, where N
    is configured by the property spark.worker.reconnect.interval - this has
    a default of 60 seconds right now.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to