[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Jason Lowe (JIRA) Tue, 10 Dec 2013 09:15:51 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844428#comment-13844428
 ]


Jason Lowe commented on HADOOP-9867:
------------------------------------

Thanks for updating the patch, Vinay.  Comments:

* I don't think LineReader is the best place to put split-specific code.  Its 
sole purpose is to read lines from an input stream regardless of split 
boundaries.  There are users of this class that are not necessarily processing 
splits.  That's why I created SplitLineReader in MapReduce, and I believe this 
logic is better placed there.
* I don't think we want to change Math.max(maxBytesToConsume(pos), 
maxLineLength)) to Math.min(maxBytesToConsume(pos), maxLineLength)).  We need 
to be able to read a record past the end of the split when the record crosses 
the split boundary, but I think this change could allow a truncated record to 
be returned for an uncompressed input stream. e.g.: fillBuffer happens to 
return data only up to the end of the split, record is incomplete (no delimiter 
found), but maxBytesToConsume keeps us from filling the buffer with more data 
and a truncated record is returned.

I think a more straightforward approach would be to have SplitLineReader be 
aware of the end of the split and track it in fillBuffer() much like 
CompressedLineSplitReader does.  The fillBuffer callback already indicates 
whether we're mid-delimiter or not, so we can simply check if fillBuffer is 
being called after the split has ended but we're mid-delimiter.  In that case 
we need an additional record.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
> delimiters well
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9867
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Vinay
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
> sometimes has the effect of skipping records from the input.
> This happens when the input splits are split off just after a 
> recordseparator. Starting point for the next split would be non zero and 
> skipFirstLine would be true. A seek into the file is done to start - 1 and 
> the text until the first recorddelimiter is ignored (due to the presumption 
> that this record is already handled by the previous maptask). Since the re 
> ord delimiter is multibyte the seek only got the last byte of the delimiter 
> into scope and its not recognized as a full delimiter. So the text is skipped 
> until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Reply via email to