[
https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627581#action_12627581
]
Abdul Qadeer commented on HADOOP-4010:
--------------------------------------
(1) In TestLineInputFormat, as you mentioned, equal number of lines
are placed in a split, except the last one. Due to new LineRecordReader
algorithm, the first split will process one more line as compared to other
mappers. Due to this reason I am leaving the first split as well.
(2) About the caching test failure, I am not really sure what is happening.
I tried the LineRecordReader in isolation, for the same kind of test and it
works. Something is going wrong in symlink stuff. I want to debug
the test case but doing so in Eclipse gives error that WebApps are not
not classpath, when infact I have put them on the eclipse classpath.
Any suggestion to debug this test case?
Thanks,
Abdul Qadeer
> Chaging LineRecordReader algo so that it does not need to skip backwards in
> the stream
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-4010
> URL: https://issues.apache.org/jira/browse/HADOOP-4010
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.19.0
>
> Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch
>
>
> The current algorithm of the LineRecordReader needs to move backwards in the
> stream (in its constructor) to correctly position itself in the stream. So
> it moves back one byte from the start of its split and try to read a record
> (i.e. a line) and throws that away. This is so because it is sure that, this
> line would be taken care of by some other mapper. This algorithm is
> difficult and in-efficient if used for compressed stream where data is coming
> to the LineRecordReader via some codecs. (Although in the current
> implementation, Hadoop does not split a compressed file and only makes one
> split from the start to the end of the file and so only one mapper handles
> it. We are currently working on BZip2 codecs where splitting is possible to
> work with Hadoop. So this proposed change will make it possible to uniformly
> handle plain as well as compressed stream.)
> In the new algorithm, each mapper always skips its first line because it is
> sure that, that line would have been read by some other mapper. So now each
> mapper must finish its reading at a record boundary which is always beyond
> its upper split limit. Due to this change, LineRecordReader does not need to
> move backwards in the stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.