[
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dale Richardson updated SPARK-6593:
-----------------------------------
Description:
When reading a large amount of files from HDFS eg. with
sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries
report an exception then the entire job is canceled. As default behaviour this
is probably for the best, but it would be nice in some circumstances where you
know it will be ok to have the option to skip the corrupted portion and
continue the job.
was:
When reading a large amount of files from HDFS eg. with
sc.textFile("hdfs:///user/cloudera/logs*.gz"). If a single split is corrupted
then the entire job is canceled. As default behaviour this is probably for the
best, but it would be nice in some circumstances where you know it will be ok
to have the option to skip the corrupted portion and continue the job.
> Provide option for HadoopRDD to skip bad data splits.
> -----------------------------------------------------
>
> Key: SPARK-6593
> URL: https://issues.apache.org/jira/browse/SPARK-6593
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.3.0
> Reporter: Dale Richardson
> Priority: Minor
>
> When reading a large amount of files from HDFS eg. with
> sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries
> report an exception then the entire job is canceled. As default behaviour
> this is probably for the best, but it would be nice in some circumstances
> where you know it will be ok to have the option to skip the corrupted portion
> and continue the job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]