[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Jason Lowe (JIRA) Wed, 20 Nov 2013 08:10:57 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13827795#comment-13827795
 ]


Jason Lowe commented on HADOOP-9867:
------------------------------------

Thanks for the patch, Vinay.  I think this approach can work when the input is 
uncompressed, however I don't think it will work for block-compressed inputs.  
Block codecs often report the file position as being the start of the codec 
block and then it "teleports" to the byte position of the next block once the 
first byte of the next block is consumed.  See HADOOP-9622 for a similar issue 
with the default delimiter and how it's being addressed.  Also 
getFilePosition() for a compressed input is returning a compressed stream 
offset, so if we try to do math on that with an uncompressed delimiter length 
we're mixing different units.

Since LineRecordReader::getFilePosition() can mean different things for 
different inputs, I think a better approach would be to change LineReader (not 
LineRecordReader) so the reported file position for multi-byte custom 
delimiters is the file position after the record but not including its 
delimiter.  Either that or wait for HADOOP-9622 to be committed and  update the 
SplitLineReader interface from the HADOOP-9622 patch so the uncompressed input 
reader would indicate an additional record needs to be read if the split ends 
mid-delimiter.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
> delimiters well
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9867
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Vinay
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
> sometimes has the effect of skipping records from the input.
> This happens when the input splits are split off just after a 
> recordseparator. Starting point for the next split would be non zero and 
> skipFirstLine would be true. A seek into the file is done to start - 1 and 
> the text until the first recorddelimiter is ignored (due to the presumption 
> that this record is already handled by the previous maptask). Since the re 
> ord delimiter is multibyte the seek only got the last byte of the delimiter 
> into scope and its not recognized as a full delimiter. So the text is skipped 
> until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Reply via email to