jay vyas created MAPREDUCE-5511:
-----------------------------------
Summary: Multifilewc and the mapred.* API: Is the use of getPos()
valid?
Key: MAPREDUCE-5511
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: examples
Reporter: jay vyas
Priority: Minor
The MultiFileWordCount class in the hadoop examples libraries uses a record
reader which switches between files. This behaviour can cause the
RawLocalFileSystem to break in a concurrent environment because of the way
buffering works (in RawLocalFileSystem, switching between streams results in a
temproraily "null" inner stream, and that inner stream is called by the
getPos() implementation in the custom RecordReader for MultiFileWordCount).
There are basically 2 ways to handle this:
1) Wrap the getPos() implementation in the object returned by open() in the
RawLocalFileSystem to cache the value of getPos() everytime it is called, so
that calls to getPos() can return a valid long even if underlying stream is
null. OR
2) Update the RecordReader in multifilewc to not rely on the inner input stream
and cache the position / return 0 if the stream cannot return a valid value.
The final question here is: Is the RecordReader for MultiFileWordCount doing
the right thing ? Or is it breaking the contract of getPos()... and really...
what SHOULD getPos() return if the underlying stream has already been consumed?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira