Kris Geusebroek created HADOOP-9867:
---------------------------------------
Summary: org.apache.hadoop.mapred.LineRecordReader does not handle
multibyte record delimiters well
Key: HADOOP-9867
URL: https://issues.apache.org/jira/browse/HADOOP-9867
Project: Hadoop Common
Issue Type: Bug
Components: io
Affects Versions: 0.20.2
Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Having defined a recorddelimiter of multiple bytes in a new InputFileFormat
sometimes has the effect of skipping records from the input.
This happens when the input splits are split off just after a recordseparator.
Starting point for the next split would be non zero and skipFirstLine would be
true. A seek into the file is done to start - 1 and the text until the first
recorddelimiter is ignored (due to the presumption that this record is already
handled by the previous maptask). Since the re ord delimiter is multibyte the
seek only got the last byte of the delimiter into scope and its not recognized
as a full delimiter. So the text is skipped until the next delimiter (ignoring
a full record!!)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira