[ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592560#action_12592560
 ] 

Chris Douglas commented on HADOOP-3144:
---------------------------------------

I was going by Zheng's last comment, i.e. "This is actually what was happening 
on our cluster - for binary file that a user mistakenly treats as a text file." 
I didn't mean anything by it.

Assuming we're discussing the third bullet in the last two comments:

If one is loading corrupted data into HDFS, then I don't think it's fair to 
assume that the most generic of text readers can do anything with it. I mention 
the archive format because it seems unavoidable that opening a file in an 
archive will return bounded stream within a large, composite file, i.e. be 
agnostic to the particular InputFormat employed, but act a lot like a bounded 
FileSplit. If that's the sort of thing you could use to deal with sketchy data, 
then it seemed to be a useful issue to monitor. Alternatively, a new 
InputFormat that generated bounded splits for Text files to recover from this 
condition might work for your case, and probably for others' if you felt like 
contributing it.

The insurance analogy doesn't seem to describe this error. It's not like a car 
accident; it's like filling one's gas tank with creamed corn. Though he had 
every reason to believe it was gasoline- and is understandably angry that his 
engine is full of creamed corn- anger at the car for failing to run on the 
creamed corn is misspent. Though I like the idea in general- i.e. skipping 
unexpectedly long lines, or even just truncating records- my original question 
was trying to determine whether it skipped to the next record, continued 
reading bytes into the next record from wherever it stopped, or quit for 
extremely long lines. At a glance, it looked like it continued reading at 
wherever it left off in the stream, but I haven't looked at it as closely as 
the contributor and wanted to ask after its behavior. I'm still curious how, 
exactly, this patch effects its solution.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to