[
https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abdul Qadeer updated HADOOP-4010:
---------------------------------
Attachment: Hadoop-4010_version2.patch
Bug fixes.
> Chaging LineRecordReader algo so that it does not need to skip backwards in
> the stream
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-4010
> URL: https://issues.apache.org/jira/browse/HADOOP-4010
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.19.0
>
> Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch
>
>
> The current algorithm of the LineRecordReader needs to move backwards in the
> stream (in its constructor) to correctly position itself in the stream. So
> it moves back one byte from the start of its split and try to read a record
> (i.e. a line) and throws that away. This is so because it is sure that, this
> line would be taken care of by some other mapper. This algorithm is
> difficult and in-efficient if used for compressed stream where data is coming
> to the LineRecordReader via some codecs. (Although in the current
> implementation, Hadoop does not split a compressed file and only makes one
> split from the start to the end of the file and so only one mapper handles
> it. We are currently working on BZip2 codecs where splitting is possible to
> work with Hadoop. So this proposed change will make it possible to uniformly
> handle plain as well as compressed stream.)
> In the new algorithm, each mapper always skips its first line because it is
> sure that, that line would have been read by some other mapper. So now each
> mapper must finish its reading at a record boundary which is always beyond
> its upper split limit. Due to this change, LineRecordReader does not need to
> move backwards in the stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.