[ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592550#action_12592550
 ] 

Chris Douglas commented on HADOOP-3144:
---------------------------------------

bq. It used to be sufficient does not mean that they will be sufficient in the 
future - that's why we have open64. The cost of using a long instead of an int 
is minimal, while we do avoid potential overflow problems

True, but it's accumulating bytes read from a text file into memory for a 
single record. It's not at all obvious to me that this requires a long. 
Future-proofing a case that will be a total disaster for the rest of the 
framework seems premature, particularly when the change is to a generic text 
parser. If someone truly needs to slurp >2GB of text data _per record_, surely 
their requirements justify a less general RecordReader. It's not the cost of 
the int that concerns me, but rather it's the API change to support a case 
that's not only degenerate, but implausible.

bq. The reason for "maxBytesToConsume" is to tell readLine the end of this 
block - there is no reason for the readLine to go through tens of gigs of data 
search for an end of line, while the current block is only 128MB.

A far more portable solution for what this expresses would be an InputFormat 
generating a subclass of FileSplit annotated with a hard limit enforced by the 
RecordReader (i.e. returns EOF at some position within the file). Some of this 
will inevitably be done as part of the Hadoop archive work (HADOOP-3307). As a 
workaround, don't point text readers at binary data. ;)

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to