[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Jason Lowe (JIRA) Fri, 19 Jun 2015 09:35:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593600#comment-14593600
 ]


Jason Lowe commented on MAPREDUCE-5948:
---------------------------------------

Thanks for taking this up, Akira.  Will try to review the patch shortly.

[~Markovich] could you provide more details on the duplicate records with bz2?  
A similar problem was reported in MAPREDUCE-6299 but there are no details to 
work with.  Note that the latest patch will not address any issues with bz2, as 
it only fixes the handling of duplicate records with uncompressed input.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
> delimiters well
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5948
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Akira AJISAKA
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
> HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
> sometimes has the effect of skipping records from the input.
> This happens when the input splits are split off just after a 
> recordseparator. Starting point for the next split would be non zero and 
> skipFirstLine would be true. A seek into the file is done to start - 1 and 
> the text until the first recorddelimiter is ignored (due to the presumption 
> that this record is already handled by the previous maptask). Since the re 
> ord delimiter is multibyte the seek only got the last byte of the delimiter 
> into scope and its not recognized as a full delimiter. So the text is skipped 
> until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Reply via email to