Re: \r\n problem in LineRecordReader.java

Raghu Angadi Wed, 13 Jun 2007 09:36:24 -0700

Hi Blowen,

Thanks for investigating this.

You should file a Jira on this (with your traces etc). Let me know ifyou want to me to file it.


> and I think this precisely is why when you have mark()/reset() (as in
> \r\n case), the read can be small.  I can track down java's
> bufferedinputstream code to check for details, but it seems pretty
> clear from actual code execution stack.

As you noted this could be caused by "marking" the input stream. I thinkHadoop's dependence on BufferedInputStream read len size should befixed. Alternately we could make this stream non-markable (We wouldstill be depending on BufferedInputStreams behavior that is not part ofits contract).


You might find HADOOP-1470 and HADOOP-1134(last 4-6 comments) relevant.

Raghu.

Bwolen Yang wrote:

taking values at runtime (i have it thru exceptions when the result is
0 and print out he values).


the \r\n problem was observed on the 0.13.0 release.

To study the behavior, I instrument the hadoop source from the head ofthe tree.


More specifically, attached are two sample stacks.  (i have readbuffer
throw when it gets 0 bytes, and have inputchecker catches the
exception and rethrow both.  This way, I catch the values from both
caller and callee.

on a separate note, if (len>=bytesPerSum) the assumption exists, would
it be ok to throw exceptions when violated?   most of time (e.g., in
crawl/indexing), people won't notice some part of input data is
getting throw away.   It would be a lot easier to debug as code
changes (and assumption get violated), and the cost in this case is
probably not too bad as good part of the cost is probably in networks
and going to disk.

Re: \r\n problem in LineRecordReader.java

Reply via email to