[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Jason Lowe (JIRA) Fri, 27 Jun 2014 15:28:17 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046477#comment-14046477
 ]


Jason Lowe commented on MAPREDUCE-5948:
---------------------------------------

Moved this to MAPREDUCE since that's were the change should go.  Thanks for the 
patch, Rushabh.  Some comments on the patch:

* LineReader: I'm not sure changing LineReader's byteConsumed semantics will be 
OK.  LineReader is used by things besides MapReduce, so changing the return 
result for some corner cases may be problematic for other downstream projects.  
Rather than rely on the bytes consumed result from LineReader it seems we could 
track this using the existing totalBytesRead value.  We simply need to track 
how many bytes from the input stream we've read before we try to read the next 
line.  If we've read past the end of the split (i.e.: bytes > split length) 
then we can set the finished flag.
* UncompressedSplitLineReader: Given how complicated split processing can be 
and all the corner cases, it'd be nice to have some comments explaining what's 
going on similar to what's in CompressedSplitLineReader.  Otherwise many will 
wonder why we're going out of our way to make sure we stop right at the split 
when we fill the buffer, etc.
* bufferPosition is not really a position in the buffer but a position within 
the split, and that's similar to totalBytesRead.  I'm not sure we can 
completely combine them, but it'd be nice if their names were more specific to 
what they're tracking (one is the total bytes read from the input stream and 
another is total bytes consumed by the line reader).
* {{(inDelimiter && actualBytesRead > 0)}} can just be {{(inDelimiter)}} 
because we already know actualBytesRead > 0 at that point due to the previous 
condition check.
* Test doesn't cleanup the temporary files it creates when it's done.  
(Actually just noticed the tests are creating files in the wrong directory, 
should be something like TestLineRecordReader.class.getName() and not 
TestTextInputFormat.)
* Nit: whitespace between {{if}} and ( and lines formatted to 80 columns to 
follow the coding standard


> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
> delimiters well
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5948
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Vinayakumar B
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
> HADOOP-9867.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
> sometimes has the effect of skipping records from the input.
> This happens when the input splits are split off just after a 
> recordseparator. Starting point for the next split would be non zero and 
> skipFirstLine would be true. A seek into the file is done to start - 1 and 
> the text until the first recorddelimiter is ignored (due to the presumption 
> that this record is already handled by the previous maptask). Since the re 
> ord delimiter is multibyte the seek only got the last byte of the delimiter 
> into scope and its not recognized as a full delimiter. So the text is skipped 
> until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Reply via email to