[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Chris Douglas (JIRA) Fri, 25 Apr 2008 16:05:33 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592526#action_12592526
 ]


Chris Douglas commented on HADOOP-3144:
---------------------------------------

* Do these limits really need to be a longs? Changing the public API of 
readLine seems unnecessary when an int should be- and has been- sufficient.
* There is some odd spacing around LineRecordReader::157,268 that make it 
difficult to tell which block the closing brace belongs to
* I'm not sure I understand the skip logic. For the case where a line is larger 
than 64k (the buffer size), it looks like this reads up to a threshold, then 
discards input that exceeds what was requested, then returns the next record as 
the segment between the point in the threshold and the following newline (i.e. 
the trailing bytes of the too-long record). Is this accurate? Instead of 
getting a random segment of a record, wouldn't it be preferred to discard input 
until the next record boundary is found?

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Reply via email to