GitHub user tigerquoll opened a pull request:

    https://github.com/apache/spark/pull/5250

    [CORE] [SPARK-6593] Provide option for HadoopRDD to skip bad data splits

    When reading a large amount of files from HDFS eg. with 
sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted 
then the entire job is canceled. As default behaviour this is probably for the 
best, but it would be nice in some circumstances where you know it will be ok 
to have the option to skip the corrupted portion and continue the job.
    
    Ideally I'd like to be able to report the list of bad files directly back 
to the master, but that would involve a public API change and is a discussion 
for later.  For now, I propose a new option of 
'spark.hadoop.ignoreInputErrors', which defaults to false to maintain the 
existing behaviour of terminating the job upon any exception from the hadoop 
input libraries.  when set to true however, this option will simply log the 
error to the executor's log, skip the input-split that caused the error and 
continue processing.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tigerquoll/spark SPARK-6593a

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5250
    
----
commit 37d2cbde5b67c94cac26e2c55030e9bb00678d7a
Author: Dale <[email protected]>
Date:   2015-03-29T11:23:01Z

    [SPARK-6593] Added spark.hadoop.ignoreInputEerrors config option and 
implementation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to