I agree with the requirement that the key does not change. Of course,
the values can change.
I am primarily worried that the combiner might not be run at all - I
have 'successfully' integrated Hadoop and R i.e the user can provide
map/reduce functions written in R. However, R is not great with memory
management, and if I have N (N is huge) values for a given key K, then
R will baulk when it comes to processing this.
Thus the combiner. The combiner will process n values for K, and
ultimately, a few values for K in the reducer . If the combiner where
not to run, R would collapse under the load.
1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records
emitted from the map, before being fed to the reduce.
This zero case possibility worries me. However you mention, that it
occurs
collector spills in the map
I have noticed this happening - what 'spilling' mean?
Thank you
Saptarshi
On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:
The combiner, if defined, will run zero or more times on records
emitted from the map, before being fed to the reduce. It is run when
the collector spills in the map and in some merge cases. If the
combiner transforms the key, it is illegal to change its type, the
partition to which it is assigned, or its ordering.
For example, if you emit a record (k,v) from your map and (k',v)
from the combiner, your comparator is C(K,K) and your partitioner
function is P(K), it must be the case that P(k) == P(k') and C(k,k')
== 0. If either of these does not hold, the semantics to the reduce
are broken. Clearly, if k is not transformed (as in true for most
combiners), this holds trivially.
As was mentioned earlier, the purpose of the combiner is to compress
data pulled across the network and spilled to disk. It should not
affect the correctness or, in most cases, the output of the job. -C
On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:
Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is
also
a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).
However, I guess it would be a mistake for reducer to expect its
input
coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go
straight
to the reducer or to the reducer via the combiner?
Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.
Thank you
Saptarshi
On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley <[email protected]>
wrote:
The Combiner may be called 0, 1, or many times on each key between
the
mapper and reducer. Combiners are just an application specific
optimization
that compress the intermediate output. They should not have side
effects or
transform the types. Unfortunately, since there isn't a separate
interface
for Combiners, there is isn't a great place to document this
requirement.
I've just filed HADOOP-4668 to improve the documentation.
--
Saptarshi Guha - [email protected]
Saptarshi Guha | [email protected] | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe