Joe Ellis created HADOOP-13064:
----------------------------------
Summary: LineReader reports incorrect number of bytes read
resulting in correctness issues using LineRecordReader
Key: HADOOP-13064
URL: https://issues.apache.org/jira/browse/HADOOP-13064
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Joe Ellis
Priority: Critical
The specific issue we were seeing with LineReader is that when we pass in
'\r\n' as the line delimiter the number of bytes that it claims to have read is
less than what it actually read. We narrowed this down to only happening when
the delimiter is split across the internal buffer boundary, so if fillbuffer
fills with "row\r" and the next call fills with "\n" then the number of bytes
reported would be 4 rather than 5.
This results in correctness issues in LineRecordReader because if this off by
one issue is seen enough times when reading a split then it will continue to
read records past its split boundary, resulting in records appearing to come
from multiple splits.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)