[
https://issues.apache.org/jira/browse/HADOOP-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe Ellis updated HADOOP-13064:
-------------------------------
Attachment: LineReaderTest.java
Here's an example that fails.
> LineReader reports incorrect number of bytes read resulting in correctness
> issues using LineRecordReader
> --------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-13064
> URL: https://issues.apache.org/jira/browse/HADOOP-13064
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.1
> Reporter: Joe Ellis
> Priority: Critical
> Attachments: LineReaderTest.java
>
>
> The specific issue we were seeing with LineReader is that when we pass in
> '\r\n' as the line delimiter the number of bytes that it claims to have read
> is less than what it actually read. We narrowed this down to only happening
> when the delimiter is split across the internal buffer boundary, so if
> fillbuffer fills with "row\r" and the next call fills with "\n" then the
> number of bytes reported would be 4 rather than 5.
> This results in correctness issues in LineRecordReader because if this off by
> one issue is seen enough times when reading a split then it will continue to
> read records past its split boundary, resulting in records appearing to come
> from multiple splits.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)