[ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594414#action_12594414 ]
Chris Douglas commented on HADOOP-3144: --------------------------------------- The actual calculated for maxBytesToConsume in next() can still cause the overflow error you mentioned above; it's not sufficient to cast it to int, unfortunately. The purpose of this patch is to prevent LineRecordReader from reading absurdly far into the next split, right? Is there something wrong with max(min(end\-pos, MAX_INT), maxLineLen) ? > better fault tolerance for corrupted text files > ----------------------------------------------- > > Key: HADOOP-3144 > URL: https://issues.apache.org/jira/browse/HADOOP-3144 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.3 > Reporter: Joydeep Sen Sarma > Assignee: Zheng Shao > Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, > 3144-ignore-spaces-3.patch > > > every once in a while - we encounter corrupted text files (corrupted at > source prior to copying into hadoop). inevitably - some of the data looks > like a really really long line and hadoop trips over trying to stuff it into > an in memory object and gets outofmem error. Code looks same way in trunk as > well .. > so looking for an option to the textinputformat (and like) to ignore long > lines. ideally - we would just skip errant lines above a certain size limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.