Chaging LineRecordReader algo so that it does not need to skip backwards in the
stream
--------------------------------------------------------------------------------------
Key: HADOOP-4010
URL: https://issues.apache.org/jira/browse/HADOOP-4010
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Affects Versions: 0.19.0
Reporter: Abdul Qadeer
Assignee: Abdul Qadeer
Fix For: 0.19.0
The current algorithm of the LineRecordReader needs to move backwards in the
stream (in its constructor) to correctly position itself in the stream. So it
moves back one byte from the start of its split and try to read a record (i.e.
a line) and throws that away. This is so because it is sure that, this line
would be taken care of by some other mapper. This algorithm is difficult and
in-efficient if used for compressed stream where data is coming to the
LineRecordReader via some codecs. (Although in the current implementation,
Hadoop does not split a compressed file and only makes one split from the start
to the end of the file and so only one mapper handles it. We are currently
working on BZip2 codecs where splitting is possible to work with Hadoop. So
this proposed change will make it possible to uniformly handle plain as well as
compressed stream.)
In the new algorithm, each mapper always skips its first line because it is
sure that, that line would have been read by some other mapper. So now each
mapper must finish its reading at a record boundary which is always beyond its
upper split limit. Due to this change, LineRecordReader does not need to move
backwards in the stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.