GitHub user ilganeli opened a pull request:

    https://github.com/apache/spark/pull/5636

    [SPARK-5945] Spark should not retry a stage infinitely on a 
FetchFailedException

    All - I've added a map to track the reasons, and counts / reason for stage 
failures. I then check whether to abort a stage based on whether there is a 
sufficient number of failures for a single reason.
    
    The open questions are:
    
    1)  Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one? E.g. does 
retrying the failed stage even make sense?
    2) For the equality check, I'm using the failureMessage inside the handler 
for FetchFailed. Is it safe to assume that String will be consistent between 
subsequent stage failures or are there any counters or such in there that will 
change? If this isn't a safe item to use for comparison, is there any metric 
that can be used to track the number of failures for a particular reason?
    3) How many attempts for a stage should be made before aborting? I know 
that a task can have up to 4  failures before it's aborted but since a stage 
may see multiple fetch failures from concurrent tasks, I assume this number 
should be higher. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ilganeli/spark SPARK-5945

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5636
    
----
commit 40aefbedda98828d191ba463725cd1278b0b25ad
Author: Ilya Ganelin <[email protected]>
Date:   2015-04-22T18:07:23Z

    [SPARK-5945] Added map to track reasons for stage failures and supporting 
function to check whether to abort a stage when it fails for a single reason 
more than N times.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to