[ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592732#action_12592732
 ] 

Joydeep Sen Sarma commented on HADOOP-3144:
-------------------------------------------

> isn't recovery effected by skipping the record that caused a failure on the 
> map (HADOOP-153)?

thanks for pointing this out. this jira is not fixed and looks like there's 
still a debate on what the right approach is .. it seems that even if the jira 
were fixed - the linerecordreader would have to implement an additional api to 
skip to the next record boundary (to skip the bad record on map re-try) - so 
looks like we would need similar code - albeit under a different api. 

that said - i am not sure i agree with the design of 153. it's not clear to me 
why it doesn't suffice to let the recordreaders skip bad records (as they must 
be able to even with 153's additional apis). but that's a separate discussion ..

what's the status of 153? seems like depending on where it goes - these changes 
may conflict or overlap ..

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to