[ http://issues.apache.org/jira/browse/HADOOP-363?page=comments#action_12421184 ] eric baldeschwieler commented on HADOOP-363: --------------------------------------------
Sounds like over kill to me. Do we have a concrete example where this would make a big performance difference today? I'd suggest a simpler rule of thumb. If the reduced output is less than a quarter of the allowable RAM/keys, keep it and recombine it, assuming you get at least a quarter buffer full of new keys. Otherwise spill it. This should catch the important cases and avoid silly corner cases. But unless you are tracking RAM usage, this is a bad idea, because combining might not be reducing RAM footprint and we don't want to combine until the VM blows up / thrashes. The 100k magic number sounds really dangerous to begin with, we need to be tracking RAM. > When combiners exist, postpone mappers' spills of map output to disk until > combiners are unsuccessful. > ------------------------------------------------------------------------------------------------------ > > Key: HADOOP-363 > URL: http://issues.apache.org/jira/browse/HADOOP-363 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Dick King > > When a map/reduce job is set up with a combiner, the mapper tasks each build > up an in-heap collection of 100K key/value pairs -- and then apply the > combiner to reduce that to whatever it becomes by applying the combiner to > sets with like keys before spilling to disk to send it to the reducers. > Typically running the combiner consumes a lot less resources than shipping > the data, especially since the data end up in a reducer where probably the > same code will be run anyway. > I would like to see this changed so that when the combiner shrinks the 100K > key/value pairs to less than, say, 90K, we just keep running the mapper and > combiner alternately until we get enough distinct keys to make this unlikely > to be worthwhile [or until we run out of input, of course]. > This has two costs: the whole internal buffer has to be re-sorted so we can > apply the combiner even though as few as 10K new elements have been added, > and in some cases we'll call the combiner on many singletons. > The first of these costs can be avoided by doing a mini-sort in the new pairs > section and doing a merge to develop the combiner sets and the new sorted > retained elements section. > The second of these costs can be avoided by detecting what would otherwise be > singleton combiner calls and not making them, which is a good idea in itself > even if we don't decide to do this reform. > The two techniques combine well; recycled elements of the buffer need not be > combined if there's no new element with the same key. > -dk -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
