[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Zheng Shao (JIRA) Fri, 25 Apr 2008 18:47:29 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592539#action_12592539
 ]


Zheng Shao commented on HADOOP-3144:
------------------------------------

* It used to be sufficient does not mean that they will be sufficient in the 
future - that's why we have open64. The cost of using a long instead of an int 
is minimal, while we do avoid potential overflow problems. The only interesting 
usage of this return value is accumulating the number of bytes read, which 
definitely should be stored in a long. So I don't see a problem here.

* I will fix the spacing problem when we get a consensus on other problems.

* The skip logic is to skip the whole long line - not just "maxLineLength" of 
bytes.

The reason for "maxBytesToConsume" is to tell readLine the end of this block - 
there is no reason for the readLine to go through tens of gigs of data search 
for an end of line, while the current block is only 128MB.  This is actually 
what was happening on our cluster - for binary file that a user mistakenly 
treats as a text file. All map jobs just swamped the cluster. The only use of 
maxBytesToConsume is to let readLine know when to stop. What would be the best 
way to fix this?


> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Reply via email to