[jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

eric baldeschwieler (JIRA) Fri, 14 Jul 2006 11:13:17 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-363?page=comments#action_12421184 ] 
            
eric baldeschwieler commented on HADOOP-363:
--------------------------------------------


Sounds like over kill to me.  Do we have a concrete example where this would 
make a big performance difference today?

I'd suggest a simpler rule of thumb.  If the reduced output is less than a 
quarter of the allowable RAM/keys, keep it and recombine it, assuming you get 
at least a quarter buffer full of new keys.  Otherwise spill it.  This should 
catch the important cases and avoid silly corner cases.

But unless you are tracking RAM usage, this is a bad idea, because combining 
might not be reducing RAM footprint and we don't want to combine until the VM 
blows up / thrashes.  The 100k magic number sounds really dangerous to begin 
with, we need to be tracking RAM.

> When combiners exist, postpone mappers' spills of map output to disk until 
> combiners are unsuccessful.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-363
>                 URL: http://issues.apache.org/jira/browse/HADOOP-363
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Dick King
>
> When a map/reduce job is set up with a combiner, the mapper tasks each build 
> up an in-heap collection of 100K key/value pairs -- and then apply the 
> combiner to reduce that to whatever it becomes by applying the combiner to 
> sets with like keys before spilling to disk to send it to the reducers.
> Typically running the combiner consumes a lot less resources than shipping 
> the data, especially since the data end up in a reducer where probably the 
> same code will be run anyway.
> I would like to see this changed so that when the combiner shrinks the 100K 
> key/value pairs to less than, say, 90K, we just keep running the mapper and 
> combiner alternately until we get enough distinct keys to make this unlikely 
> to be worthwhile [or until we run out of input, of course].
> This has two costs: the whole internal buffer has to be re-sorted so we can 
> apply the combiner even though as few as 10K new elements have been added, 
> and in some cases we'll call the combiner on many singletons.  
> The first of these costs can be avoided by doing a mini-sort in the new pairs 
> section and doing a merge to develop the combiner sets and the new sorted 
> retained elements section.
> The second of these costs can be avoided by detecting what would otherwise be 
> singleton combiner calls and not making them, which is a good idea in itself 
> even if we don't decide to do this reform.
> The two techniques combine well; recycled elements of the buffer need not be 
> combined if there's no new element with the same key.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

Reply via email to