Hi Blowen,

Thanks for investigating this.

You should file a Jira on this (with your traces etc). Let me know if you want to me to file it.

> and I think this precisely is why when you have mark()/reset() (as in
> \r\n case), the read can be small.  I can track down java's
> bufferedinputstream code to check for details, but it seems pretty
> clear from actual code execution stack.

As you noted this could be caused by "marking" the input stream. I think Hadoop's dependence on BufferedInputStream read len size should be fixed. Alternately we could make this stream non-markable (We would still be depending on BufferedInputStreams behavior that is not part of its contract).

You might find HADOOP-1470 and HADOOP-1134(last 4-6 comments) relevant.

Raghu.

Bwolen Yang wrote:
taking values at runtime (i have it thru exceptions when the result is
0 and print out he values).

the \r\n problem was observed on the 0.13.0 release.
To study the behavior, I instrument the hadoop source from the head of the tree.

More specifically, attached are two sample stacks.  (i have readbuffer
throw when it gets 0 bytes, and have inputchecker catches the
exception and rethrow both.  This way, I catch the values from both
caller and callee.

on a separate note, if (len>=bytesPerSum) the assumption exists, would
it be ok to throw exceptions when violated?   most of time (e.g., in
crawl/indexing), people won't notice some part of input data is
getting throw away.   It would be a lot easier to debug as code
changes (and assumption get violated), and the cost in this case is
probably not too bad as good part of the cost is probably in networks
and going to disk.

Reply via email to