[
https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627626#action_12627626
]
Chris Douglas commented on HADOOP-4010:
---------------------------------------
bq. Due to new LineRecordReader algorithm, the first split will process one
more line as compared to other mappers
That's probably not going to be acceptable to users of NLineInputFormat. Users
employing N formatted lines to initialize and run a mapper may find their jobs
no longer work if the input is offset or if a map receives N+1 lines. If this
is necessary for the new algorithm, rewriting or somehow accommodating this
case may be required.
bq. Something is going wrong in symlink stuff. I want to debug the test case
but doing so in Eclipse gives error[...]
Sorry, I don't use eclipse. It looks like the symlink resolution is working;
both cache files are picked up as arguments from the input file. At a glance,
what appears to be going wrong is newline detection or propagation between
invocations of cat from xargs, a bad interaction with streaming (it also uses
LineRecordReader, IIRC), or input exercising an edge case for LineRecordReader.
Since it sounds like you've ruled out the latter, have you tried running a
streaming job like the one in the testcase? I suspect the cache isn't necessary
to reproduce this.
> Chaging LineRecordReader algo so that it does not need to skip backwards in
> the stream
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-4010
> URL: https://issues.apache.org/jira/browse/HADOOP-4010
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.19.0
>
> Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch
>
>
> The current algorithm of the LineRecordReader needs to move backwards in the
> stream (in its constructor) to correctly position itself in the stream. So
> it moves back one byte from the start of its split and try to read a record
> (i.e. a line) and throws that away. This is so because it is sure that, this
> line would be taken care of by some other mapper. This algorithm is
> difficult and in-efficient if used for compressed stream where data is coming
> to the LineRecordReader via some codecs. (Although in the current
> implementation, Hadoop does not split a compressed file and only makes one
> split from the start to the end of the file and so only one mapper handles
> it. We are currently working on BZip2 codecs where splitting is possible to
> work with Hadoop. So this proposed change will make it possible to uniformly
> handle plain as well as compressed stream.)
> In the new algorithm, each mapper always skips its first line because it is
> sure that, that line would have been read by some other mapper. So now each
> mapper must finish its reading at a record boundary which is always beyond
> its upper split limit. Due to this change, LineRecordReader does not need to
> move backwards in the stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.