GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/9164
[SPARK-11178] Improving naming around task failures.
Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task. That commit introduced some vague
naming that this commit attempts to fix; in particular:
(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed to "exitUnrelatedToRunningTasks". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.
(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).
There are a couple of remaining issues that it would be helpful if
folks commented on:
(a) In YarnAllocator.scala, line 447: what does "exited normally" mean
here? I'm hoping to improve the comment to be clearer to those less
familiar with YARN.
(b) In SparkDeploySchedulerBackend, the original commit seems to have
changed the behavior here so that when executorRemoved is called, the
failure isn't counted towards the task's maximum number of failures.
Why is this? It looks like executorRemoved can be called because
the JVM died, which can be the result of an issue with a running task?
(c) This commit changes the JsonProtocol. I think this is OK because the
existing Json was never part of a release, so this only affects folks
who run on master, and it just means that if someone uses the new
code to load old event logs, the value of "isNormalExit" i.e.
"exitUnrelatedToRunningTasks" won't display correctly, but it would be
great if others could verify this.
(d) I don't love using a negative in the variable name
("exitUnrelatedToRunningTasks") because it leads to a double-negative
when the variable is false. But I also thought this might be more clear
than a variable called "exitRelatedToRunningTasks", because with the latter
name, it seems like it should always be true. More importantly, I didn't
wnt to reverse the variable unless others agreed it should be done, because
it's annoying to reverse all of the uses of it. Let me know if you think it
should be reversed, and I'll do that.
(e) For the case where the commit failed, we never count that towards the
max failures. Is there any legimate reason the commit could fail? I'm
wondering if it's possible to have a stage that fails infinitely many times
due to a real problem with committing the task.
cc @andrewor14 @mccheah and @vanzin who were involved in the original naming
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 SPARK-11178
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9164.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9164
----
commit cf6f21ee6d692827cebb7b238c0218387fcf6032
Author: Kay Ousterhout <[email protected]>
Date: 2015-10-17T03:09:11Z
[SPARK-11178] Improving naming around task failures.
Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task. That commit introduced some vague
naming that this commit attempts to fix; in particular:
(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed to "exitUnrelatedToRunningTasks". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.
(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).
There are a couple of remaining issues that it would be helpful if
folks commented on:
(a) In YarnAllocator.scala, line 447: what does "exited normally" mean
here? I'm hoping to improve the comment to be clearer to those less
familiar with YARN.
(b) In SparkDeploySchedulerBackend, the original commit seems to have
changed the behavior here so that when executorRemoved is called, the
failure isn't counted towards the task's maximum number of failures.
Why is this? It looks like executorRemoved can be called because
the JVM died, which can be the result of an issue with a running task?
(c) This commit changes the JsonProtocol. I think this is OK because the
existing Json was never part of a release, so this only affects folks
who run on master, and it just means that if someone uses the new
code to load old event logs, the value of "isNormalExit" i.e.
"exitUnrelatedToRunningTasks" won't display correctly, but it would be
great if others could verify this.
(d) I don't love using a negative in the variable name
("exitUnrelatedToRunningTasks") because it leads to a double-negative
when the variable is false. But I also thought this might be more clear
than a variable called "exitRelatedToRunningTasks", because with the latter
name, it seems like it should always be true. More importantly, I didn't
wnt to reverse the variable unless others agreed it should be done, because
it's annoying to reverse all of the uses of it. Let me know if you think it
should be reversed, and I'll do that.
(e) For the case where the commit failed, we never count that towards the
max failures. Is there any legimate reason the commit could fail? I'm
wondering if it's possible to have a stage that fails infinitely many times
due to a real problem with committing the task.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]