[GitHub] spark pull request: [SPARK-10622] [core] Differentiate dead from "...

vanzin Wed, 23 Sep 2015 14:21:32 -0700

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/8887


    [SPARK-10622] [core] Differentiate dead from "mostly dead" executors.

    In YARN mode, when preemption is enabled, we may leave executors in a
    zombie state while we wait to retrieve the reason for which the executor
    exited. This is so that we don't account for failed tasks that were
    running on a preempted executor.
    
    The issue is that while we wait for this information, the scheduler
    might decide to schedule tasks on the executor, which will never be
    able to run them. Other side effects include the block manager still
    considering the executor available to cache blocks, for example.
    
    So, when we know that an executor went down but we don't know why,
    stop everything related to the executor, except its running tasks.
    Only when we know the reason for the exit (or give up waiting for
    it) we do update the running tasks.
    
    This is achieved by a new `disableExecutor()` method in the
    `Schedulable` interface. For managers that do not behave like this
    (i.e. every one but YARN), the existing `executorLost()` method
    will behave the same way it did before.
    
    On top of that change, a few minor changes that made debugging easier,
    and fixed some other minor issues:
    - The cluster-mode AM was printing a misleading log message every
      time an executor disconnected from the driver (because the akka
      actor system was shared between driver and AM).
    - Avoid sending unnecessary requests for an executor's exit reason
      when we already know it was explicitly disabled / killed. This
      avoids both multiple requests, and unnecessary requests that would
      just cause warning messages on the AM (in the explicit kill case).
    - Tone down a log message about the executor being lost when it
      exited normally (e.g. preemption)
    - Wake up the AM monitor thread when requests for executor loss
      reasons arrive too, so that we can more quickly remove executors
      from this zombie state.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-10622

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8887.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8887
    
----
commit 6f834fdd376cb0cd225fef3d8a1b9c05c4918982
Author: Marcelo Vanzin <[email protected]>
Date:   2015-09-22T21:07:43Z

    [SPARK-10622] [core] Differentiate dead from "mostly dead" executors.
    
    In YARN mode, when preemption is enabled, we may leave executors in a
    zombie state while we wait to retrieve the reason for which the executor
    exited. This is so that we don't account for failed tasks that were
    running on a preempted executor.
    
    The issue is that while we wait for this information, the scheduler
    might decide to schedule tasks on the executor, which will never be
    able to run them. Other side effects include the block manager still
    considering the executor available to cache blocks, for example.
    
    So, when we know that an executor went down but we don't know why,
    stop everything related to the executor, except its running tasks.
    Only when we know the reason for the exit (or give up waiting for
    it) we do update the running tasks.
    
    This is achieved by a new `disableExecutor()` method in the
    `Schedulable` interface. For managers that do not behave like this
    (i.e. every one but YARN), the existing `executorLost()` method
    will behave the same way it did before.
    
    On top of that change, a few minor changes that made debugging easier,
    and fixed some other minor issues:
    - The cluster-mode AM was printing a misleading log message every
      time an executor disconnected from the driver (because the akka
      actor system was shared between driver and AM).
    - Avoid sending unnecessary requests for an executor's exit reason
      when we already know it was explicitly disabled / killed. This
      avoids both multiple requests, and unnecessary requests that would
      just cause warning messages on the AM (in the explicit kill case).
    - Tone down a log message about the executor being lost when it
      exited normally (e.g. preemption)
    - Wake up the AM monitor thread when requests for executor loss
      reasons arrive too, so that we can more quickly remove executors
      from this zombie state.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10622] [core] Differentiate dead from "...

Reply via email to