[ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592526#action_12592526 ]
Chris Douglas commented on HADOOP-3144: --------------------------------------- * Do these limits really need to be a longs? Changing the public API of readLine seems unnecessary when an int should be- and has been- sufficient. * There is some odd spacing around LineRecordReader::157,268 that make it difficult to tell which block the closing brace belongs to * I'm not sure I understand the skip logic. For the case where a line is larger than 64k (the buffer size), it looks like this reads up to a threshold, then discards input that exceeds what was requested, then returns the next record as the segment between the point in the threshold and the following newline (i.e. the trailing bytes of the too-long record). Is this accurate? Instead of getting a random segment of a record, wouldn't it be preferred to discard input until the next record boundary is found? > better fault tolerance for corrupted text files > ----------------------------------------------- > > Key: HADOOP-3144 > URL: https://issues.apache.org/jira/browse/HADOOP-3144 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.3 > Reporter: Joydeep Sen Sarma > Assignee: Zheng Shao > Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch > > > every once in a while - we encounter corrupted text files (corrupted at > source prior to copying into hadoop). inevitably - some of the data looks > like a really really long line and hadoop trips over trying to stuff it into > an in memory object and gets outofmem error. Code looks same way in trunk as > well .. > so looking for an option to the textinputformat (and like) to ignore long > lines. ideally - we would just skip errant lines above a certain size limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.