[
https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jay vyas updated MAPREDUCE-5511:
--------------------------------
Affects Version/s: 1.0.0
1.2.0
> Multifilewc and the mapred.* API: Is the use of getPos() valid?
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-5511
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: examples
> Affects Versions: 1.0.0, 1.2.0
> Reporter: jay vyas
> Priority: Minor
>
> The MultiFileWordCount class in the hadoop examples libraries uses a record
> reader which switches between files. This behaviour can cause the
> RawLocalFileSystem to break in a concurrent environment because of the way
> buffering works (in RawLocalFileSystem, switching between streams results in
> a temproraily "null" inner stream, and that inner stream is called by the
> getPos() implementation in the custom RecordReader for MultiFileWordCount).
> There are basically 2 ways to handle this:
> 1) Wrap the getPos() implementation in the object returned by open() in the
> RawLocalFileSystem to cache the value of getPos() everytime it is called, so
> that calls to getPos() can return a valid long even if underlying stream is
> null. OR
> 2) Update the RecordReader in multifilewc to not rely on the inner input
> stream and cache the position / return 0 if the stream cannot return a valid
> value.
> The final question here is: Is the RecordReader for MultiFileWordCount doing
> the right thing ? Or is it breaking the contract of getPos()... and
> really... what SHOULD getPos() return if the underlying stream has already
> been consumed?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira