[
https://issues.apache.org/jira/browse/MAPREDUCE-264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Harsh J resolved MAPREDUCE-264.
-------------------------------
Resolution: Won't Fix
This discussion has gone stale. Closing out.
For the last comment, perhaps the fact that Combiners run on reduce (merge)
side today as well, may help.
> When combiners exist, postpone mappers' spills of map output to disk until
> combiners are unsuccessful.
> ------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-264
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-264
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Dick King
> Assignee: Owen O'Malley
> Priority: Critical
>
> When a map/reduce job is set up with a combiner, the mapper tasks each build
> up an in-heap collection of 100K key/value pairs -- and then apply the
> combiner to reduce that to whatever it becomes by applying the combiner to
> sets with like keys before spilling to disk to send it to the reducers.
> Typically running the combiner consumes a lot less resources than shipping
> the data, especially since the data end up in a reducer where probably the
> same code will be run anyway.
> I would like to see this changed so that when the combiner shrinks the 100K
> key/value pairs to less than, say, 90K, we just keep running the mapper and
> combiner alternately until we get enough distinct keys to make this unlikely
> to be worthwhile [or until we run out of input, of course].
> This has two costs: the whole internal buffer has to be re-sorted so we can
> apply the combiner even though as few as 10K new elements have been added,
> and in some cases we'll call the combiner on many singletons.
> The first of these costs can be avoided by doing a mini-sort in the new pairs
> section and doing a merge to develop the combiner sets and the new sorted
> retained elements section.
> The second of these costs can be avoided by detecting what would otherwise be
> singleton combiner calls and not making them, which is a good idea in itself
> even if we don't decide to do this reform.
> The two techniques combine well; recycled elements of the buffer need not be
> combined if there's no new element with the same key.
> -dk
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira