[ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592560#action_12592560 ]
Chris Douglas commented on HADOOP-3144: --------------------------------------- I was going by Zheng's last comment, i.e. "This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file." I didn't mean anything by it. Assuming we're discussing the third bullet in the last two comments: If one is loading corrupted data into HDFS, then I don't think it's fair to assume that the most generic of text readers can do anything with it. I mention the archive format because it seems unavoidable that opening a file in an archive will return bounded stream within a large, composite file, i.e. be agnostic to the particular InputFormat employed, but act a lot like a bounded FileSplit. If that's the sort of thing you could use to deal with sketchy data, then it seemed to be a useful issue to monitor. Alternatively, a new InputFormat that generated bounded splits for Text files to recover from this condition might work for your case, and probably for others' if you felt like contributing it. The insurance analogy doesn't seem to describe this error. It's not like a car accident; it's like filling one's gas tank with creamed corn. Though he had every reason to believe it was gasoline- and is understandably angry that his engine is full of creamed corn- anger at the car for failing to run on the creamed corn is misspent. Though I like the idea in general- i.e. skipping unexpectedly long lines, or even just truncating records- my original question was trying to determine whether it skipped to the next record, continued reading bytes into the next record from wherever it stopped, or quit for extremely long lines. At a glance, it looked like it continued reading at wherever it left off in the stream, but I haven't looked at it as closely as the contributor and wanted to ask after its behavior. I'm still curious how, exactly, this patch effects its solution. > better fault tolerance for corrupted text files > ----------------------------------------------- > > Key: HADOOP-3144 > URL: https://issues.apache.org/jira/browse/HADOOP-3144 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.3 > Reporter: Joydeep Sen Sarma > Assignee: Zheng Shao > Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch > > > every once in a while - we encounter corrupted text files (corrupted at > source prior to copying into hadoop). inevitably - some of the data looks > like a really really long line and hadoop trips over trying to stuff it into > an in memory object and gets outofmem error. Code looks same way in trunk as > well .. > so looking for an option to the textinputformat (and like) to ignore long > lines. ideally - we would just skip errant lines above a certain size limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.