Hi Blowen,
Thanks for investigating this.
You should file a Jira on this (with your traces etc). Let me know if
you want to me to file it.
> and I think this precisely is why when you have mark()/reset() (as in
> \r\n case), the read can be small. I can track down java's
> bufferedinputstream code to check for details, but it seems pretty
> clear from actual code execution stack.
As you noted this could be caused by "marking" the input stream. I think
Hadoop's dependence on BufferedInputStream read len size should be
fixed. Alternately we could make this stream non-markable (We would
still be depending on BufferedInputStreams behavior that is not part of
its contract).
You might find HADOOP-1470 and HADOOP-1134(last 4-6 comments) relevant.
Raghu.
Bwolen Yang wrote:
taking values at runtime (i have it thru exceptions when the result is
0 and print out he values).
the \r\n problem was observed on the 0.13.0 release.
To study the behavior, I instrument the hadoop source from the head of
the tree.
More specifically, attached are two sample stacks. (i have readbuffer
throw when it gets 0 bytes, and have inputchecker catches the
exception and rethrow both. This way, I catch the values from both
caller and callee.
on a separate note, if (len>=bytesPerSum) the assumption exists, would
it be ok to throw exceptions when violated? most of time (e.g., in
crawl/indexing), people won't notice some part of input data is
getting throw away. It would be a lot easier to debug as code
changes (and assumption get violated), and the cost in this case is
probably not too bad as good part of the cost is probably in networks
and going to disk.