[ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592550#action_12592550 ]
Chris Douglas commented on HADOOP-3144: --------------------------------------- bq. It used to be sufficient does not mean that they will be sufficient in the future - that's why we have open64. The cost of using a long instead of an int is minimal, while we do avoid potential overflow problems True, but it's accumulating bytes read from a text file into memory for a single record. It's not at all obvious to me that this requires a long. Future-proofing a case that will be a total disaster for the rest of the framework seems premature, particularly when the change is to a generic text parser. If someone truly needs to slurp >2GB of text data _per record_, surely their requirements justify a less general RecordReader. It's not the cost of the int that concerns me, but rather it's the API change to support a case that's not only degenerate, but implausible. bq. The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB. A far more portable solution for what this expresses would be an InputFormat generating a subclass of FileSplit annotated with a hard limit enforced by the RecordReader (i.e. returns EOF at some position within the file). Some of this will inevitably be done as part of the Hadoop archive work (HADOOP-3307). As a workaround, don't point text readers at binary data. ;) > better fault tolerance for corrupted text files > ----------------------------------------------- > > Key: HADOOP-3144 > URL: https://issues.apache.org/jira/browse/HADOOP-3144 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.3 > Reporter: Joydeep Sen Sarma > Assignee: Zheng Shao > Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch > > > every once in a while - we encounter corrupted text files (corrupted at > source prior to copying into hadoop). inevitably - some of the data looks > like a really really long line and hadoop trips over trying to stuff it into > an in memory object and gets outofmem error. Code looks same way in trunk as > well .. > so looking for an option to the textinputformat (and like) to ignore long > lines. ideally - we would just skip errant lines above a certain size limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.