[
https://issues.apache.org/jira/browse/MAPREDUCE-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047284#comment-13047284
]
Robert Joseph Evans commented on MAPREDUCE-2583:
------------------------------------------------
I am not really sure what you are getting at here. I have tried this sort of
thing before by writing small statistics files as out-of-band files and reading
them back in the next map/reduce as part of the distributed cache, but it did
not turn out very well. Even with the distributed cache if you are scaling up
to 100s of mappers/reducers it will put a lot of load on the name node. If it
really is a requirement it is best to post process the files turning them into
a single highly replicated files before passing them off to the next phase.
If you are turning this into a formal Map/Reduce feature then you probably want
to do this compaction in the cleanup task, and have some sort of size limits on
how much data can flow through this.
> DistributedCache for M-R chains
> -------------------------------
>
> Key: MAPREDUCE-2583
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2583
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Mitch McCuiston
>
> Currently the DistributedCache appears to be created at the granularity of a
> job. In the case of a M-R chain, it is sometimes useful to share information
> out-of-band (as small files in hdfs) with each task in the chain. For
> instance, the first M-R phase within a two-phase M-R chain might produce
> useful statistics that could be used to configure the second phase.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira