Joe Ellis created HADOOP-13064:
----------------------------------

             Summary: LineReader reports incorrect number of bytes read 
resulting in correctness issues using LineRecordReader
                 Key: HADOOP-13064
                 URL: https://issues.apache.org/jira/browse/HADOOP-13064
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Joe Ellis
            Priority: Critical


The specific issue we were seeing with LineReader is that when we pass in 
'\r\n' as the line delimiter the number of bytes that it claims to have read is 
less than what it actually read. We narrowed this down to only happening when 
the delimiter is split across the internal buffer boundary, so if fillbuffer 
fills with "row\r" and the next call fills with "\n" then the number of bytes 
reported would be 4 rather than 5.

This results in correctness issues in LineRecordReader because if this off by 
one issue is seen enough times when reading a split then it will continue to 
read records past its split boundary, resulting in records appearing to come 
from multiple splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to