[
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507957#comment-14507957
]
Kay Ousterhout commented on SPARK-5945:
---------------------------------------
Commenting here rather than on the github for archiving purposes!
I took at look at the proposed pull request, and I'd be in favor of a much
simpler approach, where for each stage, we track the number of failures (from
any cause), and then fail the job once a stage fails 4 times (4, for
consistency with the max task failures). If the stage succeeds, we can reset
the count to 0, to avoid the potential problem [~imranr] mentioned for stages
that are re-used by many jobs (so the counter would be numConsecutiveFailures
or something like that). This can just be added to the Stage class, I think.
This is consistent with the approach we use for tasks, where if a task has
failed 4 times (for any reason), we abort the stage.
I also would advocate against adding a configuration parameter for this. I
can't imagine a case where someone would want to keep trying after 4 failures,
and I think sometimes configuration parameters for things like this lead people
to believe they can fix a problem by changing the configuration variable (just
up the max number of failures!!) when really there is some bigger underlying
issue they should fix. 4 seems to have worked well for tasks, so I'd just use
the same default here (and it's always easy to add a configuration variable
later on if lots of people say they need it).
> Spark should not retry a stage infinitely on a FetchFailedException
> -------------------------------------------------------------------
>
> Key: SPARK-5945
> URL: https://issues.apache.org/jira/browse/SPARK-5945
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Imran Rashid
> Assignee: Ilya Ganelin
>
> While investigating SPARK-5928, I noticed some very strange behavior in the
> way spark retries stages after a FetchFailedException. It seems that on a
> FetchFailedException, instead of simply killing the task and retrying, Spark
> aborts the stage and retries. If it just retried the task, the task might
> fail 4 times and then trigger the usual job killing mechanism. But by
> killing the stage instead, the max retry logic is skipped (it looks to me
> like there is no limit for retries on a stage).
> After a bit of discussion with Kay Ousterhout, it seems the idea is that if a
> fetch fails, we assume that the block manager we are fetching from has
> failed, and that it will succeed if we retry the stage w/out that block
> manager. In that case, it wouldn't make any sense to retry the task, since
> its doomed to fail every time, so we might as well kill the whole stage. But
> this raises two questions:
> 1) Is it really safe to assume that a FetchFailedException means that the
> BlockManager has failed, and ti will work if we just try another one?
> SPARK-5928 shows that there are at least some cases where that assumption is
> wrong. Even if we fix that case, this logic seems brittle to the next case
> we find. I guess the idea is that this behavior is what gives us the "R" in
> RDD ... but it seems like its not really that robust and maybe should be
> reconsidered.
> 2) Should stages only be retried a limited number of times? It would be
> pretty easy to put in a limited number of retries per stage. Though again,
> we encounter issues with keeping things resilient. Theoretically one stage
> could have many retries, but due to failures in different stages further
> downstream, so we might need to track the cause of each retry as well to
> still have the desired behavior.
> In general it just seems there is some flakiness in the retry logic. This is
> the only reproducible example I have at the moment, but I vaguely recall
> hitting other cases of strange behavior w/ retries when trying to run long
> pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch
> fails, but the executor eventually comes back and the stage gets retried
> again, but the same GC issues happen the second time around, etc.
> Copied from SPARK-5928, here's the example program that can regularly produce
> a loop of stage failures. Note that it will only fail from a remote fetch,
> so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell
> --num-executors 2 --executor-memory 4000m}}
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
> val n = 3e3.toInt
> val arr = new Array[Byte](n)
> //need to make sure the array doesn't compress to something small
> scala.util.Random.nextBytes(arr)
> arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]