I'd be interested in actual data showing that this maters.
On Jul 15, 2006, at 10:01 AM, Dick King (JIRA) wrote:
[ http://issues.apache.org/jira/browse/HADOOP-363?
page=comments#action_12421313 ]
Dick King commented on HADOOP-363:
----------------------------------
I have to leave soon, but I'll write a quick comment.
By including a combiner at all you're saying that the cost of one
extra deserialization and serialization [to do a little combine
internally rather than a massive one in a reducer] is cheaper than
the cost of shuffling an extra datum in the big shuffle. Note that
as the output buffer fills with stable items the benefits of an
early combine decreases but so do the costs. Only the costs of the
comparisons remains high as the buffer fills.
Having said that, the point of some records' values getting large
as many values get folded in under one key is salient. However,
this problem isn't necxessarily mitigated by changing the buffer
residue trigger point. Still, I suppose I can support making the
test relatively modest simply because it's a conservative thing to
do, but this should be a configuration for those who have different
ideas for a specific job.
-dk
When combiners exist, postpone mappers' spills of map output to
disk until combiners are unsuccessful.
---------------------------------------------------------------------
---------------------------------
Key: HADOOP-363
URL: http://issues.apache.org/jira/browse/HADOOP-363
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: Dick King
When a map/reduce job is set up with a combiner, the mapper tasks
each build up an in-heap collection of 100K key/value pairs -- and
then apply the combiner to reduce that to whatever it becomes by
applying the combiner to sets with like keys before spilling to
disk to send it to the reducers.
Typically running the combiner consumes a lot less resources than
shipping the data, especially since the data end up in a reducer
where probably the same code will be run anyway.
I would like to see this changed so that when the combiner shrinks
the 100K key/value pairs to less than, say, 90K, we just keep
running the mapper and combiner alternately until we get enough
distinct keys to make this unlikely to be worthwhile [or until we
run out of input, of course].
This has two costs: the whole internal buffer has to be re-sorted
so we can apply the combiner even though as few as 10K new
elements have been added, and in some cases we'll call the
combiner on many singletons.
The first of these costs can be avoided by doing a mini-sort in
the new pairs section and doing a merge to develop the combiner
sets and the new sorted retained elements section.
The second of these costs can be avoided by detecting what would
otherwise be singleton combiner calls and not making them, which
is a good idea in itself even if we don't decide to do this reform.
The two techniques combine well; recycled elements of the buffer
need not be combined if there's no new element with the same key.
-dk
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/
Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/
software/jira