[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047284#comment-13047284
 ] 

Robert Joseph Evans commented on MAPREDUCE-2583:
------------------------------------------------

I am not really sure what you are getting at here.  I have tried this sort of 
thing before by writing small statistics files as out-of-band files and reading 
them back in the next map/reduce as part of the distributed cache, but it did 
not turn out very well.  Even with the distributed cache if you are scaling up 
to 100s of mappers/reducers it will put a lot of load on the name node.  If it 
really is a requirement it is best to post process the files turning them into 
a single highly replicated files before passing them off to the next phase.

If you are turning this into a formal Map/Reduce feature then you probably want 
to do this compaction in the cleanup task, and have some sort of size limits on 
how much data can flow through this.

> DistributedCache for M-R chains
> -------------------------------
>
>                 Key: MAPREDUCE-2583
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2583
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mitch McCuiston
>
> Currently the DistributedCache appears to be created at the granularity of a 
> job.  In the case of a M-R chain, it is sometimes useful to share information 
> out-of-band (as small files in hdfs) with each task in the chain.  For 
> instance, the first M-R phase within a two-phase M-R chain might produce 
> useful statistics that could be used to configure the second phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to