GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/9164

    [SPARK-11178] Improving naming around task failures.

    Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new
    functionality so that if an executor dies for a reason that's not
    caused by one of the tasks running on the executor (e.g., due to
    pre-emption), Spark doesn't count the failure towards the maximum
    number of failures for the task.  That commit introduced some vague
    naming that this commit attempts to fix; in particular:
    
    (1) The variable "isNormalExit", which was used to refer to cases where
    the executor died for a reason unrelated to the tasks running on the
    machine, has been renamed to "exitUnrelatedToRunningTasks". The problem
    with the existing name is that it's not clear (at least to me!) what it
    means for an exit to be "normal"; the new name is intended to make the
    purpose of this variable more clear.
    
    (2) The variable "shouldEventuallyFailJob" has been renamed to
    "countTowardsTaskFailures". This variable is used to determine whether
    a task's failure should be counted towards the maximum number of failures
    allowed for a task before the associated Stage is aborted. The problem
    with the existing name is that it can be confused with implying that
    the task's failure should immediately cause the stage to fail because it
    is somehow fatal (this is the case for a fetch failure, for example: if
    a task fails because of a fetch failure, there's no point in retrying,
    and the whole stage should be failed).
    
    There are a couple of remaining issues that it would be helpful if
    folks commented on:
    
    (a) In YarnAllocator.scala, line 447: what does "exited normally" mean
    here? I'm hoping to improve the comment to be clearer to those less
    familiar with YARN.
    
    (b) In SparkDeploySchedulerBackend, the original commit seems to have
    changed the behavior here so that when executorRemoved is called, the
    failure isn't counted towards the task's maximum number of failures.
    Why is this? It looks like executorRemoved can be called because
    the JVM died, which can be the result of an issue with a running task?
    
    (c) This commit changes the JsonProtocol. I think this is OK because the
    existing Json was never part of a release, so this only affects folks
    who run on master, and it just means that if someone uses the new
    code to load old event logs, the value of "isNormalExit" i.e.
    "exitUnrelatedToRunningTasks" won't display correctly, but it would be
    great if others could verify this.
    
    (d) I don't love using a negative in the variable name
    ("exitUnrelatedToRunningTasks") because it leads to a double-negative
    when the variable is false.  But I also thought this might be more clear
    than a variable called "exitRelatedToRunningTasks", because with the latter
    name, it seems like it should always be true.  More importantly, I didn't
    wnt to reverse the variable unless others agreed it should be done, because
    it's annoying to reverse all of the uses of it.  Let me know if you think it
    should be reversed, and I'll do that.
    
    (e) For the case where the commit failed, we never count that towards the
    max failures.  Is there any legimate reason the commit could fail? I'm
    wondering if it's possible to have a stage that fails infinitely many times
    due to a real problem with committing the task.
    
    cc @andrewor14 @mccheah and @vanzin who were involved in the original naming

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 SPARK-11178

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9164
    
----
commit cf6f21ee6d692827cebb7b238c0218387fcf6032
Author: Kay Ousterhout <[email protected]>
Date:   2015-10-17T03:09:11Z

    [SPARK-11178] Improving naming around task failures.
    
    Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new
    functionality so that if an executor dies for a reason that's not
    caused by one of the tasks running on the executor (e.g., due to
    pre-emption), Spark doesn't count the failure towards the maximum
    number of failures for the task.  That commit introduced some vague
    naming that this commit attempts to fix; in particular:
    
    (1) The variable "isNormalExit", which was used to refer to cases where
    the executor died for a reason unrelated to the tasks running on the
    machine, has been renamed to "exitUnrelatedToRunningTasks". The problem
    with the existing name is that it's not clear (at least to me!) what it
    means for an exit to be "normal"; the new name is intended to make the
    purpose of this variable more clear.
    
    (2) The variable "shouldEventuallyFailJob" has been renamed to
    "countTowardsTaskFailures". This variable is used to determine whether
    a task's failure should be counted towards the maximum number of failures
    allowed for a task before the associated Stage is aborted. The problem
    with the existing name is that it can be confused with implying that
    the task's failure should immediately cause the stage to fail because it
    is somehow fatal (this is the case for a fetch failure, for example: if
    a task fails because of a fetch failure, there's no point in retrying,
    and the whole stage should be failed).
    
    There are a couple of remaining issues that it would be helpful if
    folks commented on:
    
    (a) In YarnAllocator.scala, line 447: what does "exited normally" mean
    here? I'm hoping to improve the comment to be clearer to those less
    familiar with YARN.
    
    (b) In SparkDeploySchedulerBackend, the original commit seems to have
    changed the behavior here so that when executorRemoved is called, the
    failure isn't counted towards the task's maximum number of failures.
    Why is this? It looks like executorRemoved can be called because
    the JVM died, which can be the result of an issue with a running task?
    
    (c) This commit changes the JsonProtocol. I think this is OK because the
    existing Json was never part of a release, so this only affects folks
    who run on master, and it just means that if someone uses the new
    code to load old event logs, the value of "isNormalExit" i.e.
    "exitUnrelatedToRunningTasks" won't display correctly, but it would be
    great if others could verify this.
    
    (d) I don't love using a negative in the variable name
    ("exitUnrelatedToRunningTasks") because it leads to a double-negative
    when the variable is false.  But I also thought this might be more clear
    than a variable called "exitRelatedToRunningTasks", because with the latter
    name, it seems like it should always be true.  More importantly, I didn't
    wnt to reverse the variable unless others agreed it should be done, because
    it's annoying to reverse all of the uses of it.  Let me know if you think it
    should be reversed, and I'll do that.
    
    (e) For the case where the commit failed, we never count that towards the
    max failures.  Is there any legimate reason the commit could fail? I'm
    wondering if it's possible to have a stage that fails infinitely many times
    due to a real problem with committing the task.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to