Re: [jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

Eric Baldeschwieler Mon, 17 Jul 2006 11:12:49 -0700

I'd be interested in actual data showing that this maters.


On Jul 15, 2006, at 10:01 AM, Dick King (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-363?page=comments#action_12421313 ]
Dick King commented on HADOOP-363:
----------------------------------

I have to leave soon, but I'll write a quick comment.
By including a combiner at all you're saying that the cost of oneextra deserialization and serialization [to do a little combineinternally rather than a massive one in a reducer] is cheaper thanthe cost of shuffling an extra datum in the big shuffle. Note thatas the output buffer fills with stable items the benefits of anearly combine decreases but so do the costs. Only the costs of thecomparisons remains high as the buffer fills.
Having said that, the point of some records' values getting largeas many values get folded in under one key is salient. However,this problem isn't necxessarily mitigated by changing the bufferresidue trigger point. Still, I suppose I can support making thetest relatively modest simply because it's a conservative thing todo, but this should be a configuration for those who have differentideas for a specific job.
-dk
When combiners exist, postpone mappers' spills of map output todisk until combiners are unsuccessful.------------------------------------------------------------------------------------------------------
                Key: HADOOP-363
                URL: http://issues.apache.org/jira/browse/HADOOP-363
            Project: Hadoop
         Issue Type: Improvement
         Components: mapred
           Reporter: Dick King
When a map/reduce job is set up with a combiner, the mapper taskseach build up an in-heap collection of 100K key/value pairs -- andthen apply the combiner to reduce that to whatever it becomes byapplying the combiner to sets with like keys before spilling todisk to send it to the reducers.Typically running the combiner consumes a lot less resources thanshipping the data, especially since the data end up in a reducerwhere probably the same code will be run anyway.I would like to see this changed so that when the combiner shrinksthe 100K key/value pairs to less than, say, 90K, we just keeprunning the mapper and combiner alternately until we get enoughdistinct keys to make this unlikely to be worthwhile [or until werun out of input, of course].This has two costs: the whole internal buffer has to be re-sortedso we can apply the combiner even though as few as 10K newelements have been added, and in some cases we'll call thecombiner on many singletons.The first of these costs can be avoided by doing a mini-sort inthe new pairs section and doing a merge to develop the combinersets and the new sorted retained elements section.The second of these costs can be avoided by detecting what wouldotherwise be singleton combiner calls and not making them, whichis a good idea in itself even if we don't decide to do this reform.The two techniques combine well; recycled elements of the bufferneed not be combined if there's no new element with the same key.
-dk
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

Reply via email to