Re: Combiner run specification and questions

2009-01-07 Thread Saptarshi Guha
. So as long as the correctness of the computation doesn't
 rely on a transformation performed in the combiner, it should be OK. In

Right, i had the same thought.




 However, this restriction limits the scalability of your solution. It might
 be necessary to work around R's limitations by breaking up large
 computations into intermediate steps, possibly by explicitly instantiating
 and running the combiner in the reduce.


So, i explicitly call the combiner? However at times, the reducer
needs all the values so calling the combiner would not always work
here. However, if i recall correctly(from reading the google paper)
one does not **humongous**  number values for a single key


 1) I am guaranteed a reducer.
 So,

 The combiner, if defined, will run zero or more times on records emitted
 from the map, before being fed to the reduce.


 This zero case possibility worries me. However you mention, that it occurs

 collector spills in the map

 I have noticed this happening - what 'spilling' mean?

 Records emitted from the map are serialized into a buffer, which is
 periodically written to disk when it is (sufficiently) full. Each of these
 batch writes is a spill. In casual usage, it refers to any time when
 records need to be written to disk. The merge of intermediate files into the
 final map output and merging in-memory segments to disk in the reduce are
 two examples. -C


Thanks for the explanation.
Regards
Saptarshi


Re: Combiner run specification and questions

2009-01-06 Thread Saptarshi Guha
I agree with the requirement that the key does not change. Of course,  
the values can change.
I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with memory  
management, and if I have N (N is huge) values for a given key K, then  
R will baulk when it comes to processing this.
Thus the combiner. The combiner will process n values for K, and  
ultimately, a few values for K in the reducer . If the combiner where  
not to run, R would collapse under the load.


1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce.



This zero case possibility worries me. However you mention, that it  
occurs

collector spills in the map


I have noticed this happening - what 'spilling' mean?

Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run when  
the collector spills in the map and in some merge cases. If the  
combiner transforms the key, it is illegal to change its type, the  
partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v)  
from the combiner, your comparator is C(K,K) and your partitioner  
function is P(K), it must be the case that P(k) == P(k') and C(k,k')  
== 0. If either of these does not hold, the semantics to the reduce  
are broken. Clearly, if k is not transformed (as in true for most  
combiners), this holds trivially.


As was mentioned earlier, the purpose of the combiner is to compress  
data pulled across the network and spilled to disk. It should not  
affect the correctness or, in most cases, the output of the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is  
also

a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its  
input

coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go  
straight

to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org  
wrote:
The Combiner may be called 0, 1, or many times on each key between  
the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe



Re: Combiner run specification and questions

2009-01-06 Thread Chris Douglas
I agree with the requirement that the key does not change. Of  
course, the values can change.


Yes. Key attributes not relevant to ordering are also mutable. If your  
key is a (t1, t2) tuple ordered by t1, then t2 can change without  
affecting the expected semantics.


I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with  
memory management, and if I have N (N is huge) values for a given  
key K, then R will baulk when it comes to processing this.


That sounds really cool. The cases where the combiner is not run are  
currently obscure and future exceptions are unlikely to affect this  
particular case. For example, it doesn't make sense to run the  
combiner for a single record (i.e. there is nothing to combine), for a  
partition with no common keys, etc. So as long as the correctness of  
the computation doesn't rely on a transformation performed in the  
combiner, it should be OK. In general, it shouldn't be- and in the  
future may not be- run where there is little or no compression for it  
to effect, which doesn't exacerbate this case.


However, this restriction limits the scalability of your solution. It  
might be necessary to work around R's limitations by breaking up large  
computations into intermediate steps, possibly by explicitly  
instantiating and running the combiner in the reduce.



1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce.



This zero case possibility worries me. However you mention, that it  
occurs

collector spills in the map


I have noticed this happening - what 'spilling' mean?


Records emitted from the map are serialized into a buffer, which is  
periodically written to disk when it is (sufficiently) full. Each of  
these batch writes is a spill. In casual usage, it refers to any  
time when records need to be written to disk. The merge of  
intermediate files into the final map output and merging in-memory  
segments to disk in the reduce are two examples. -C



Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run  
when the collector spills in the map and in some merge cases. If  
the combiner transforms the key, it is illegal to change its type,  
the partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v)  
from the combiner, your comparator is C(K,K) and your partitioner  
function is P(K), it must be the case that P(k) == P(k') and  
C(k,k') == 0. If either of these does not hold, the semantics to  
the reduce are broken. Clearly, if k is not transformed (as in true  
for most combiners), this holds trivially.


As was mentioned earlier, the purpose of the combiner is to  
compress data pulled across the network and spilled to disk. It  
should not affect the correctness or, in most cases, the output of  
the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is  
also
a chance for a function to transform the key/value (keeping the  
class

the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its  
input

coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go  
straight

to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley  
omal...@apache.org wrote:
The Combiner may be called 0, 1, or many times on each key  
between the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe





Re: Combiner run specification and questions

2009-01-05 Thread Chris Douglas
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run when  
the collector spills in the map and in some merge cases. If the  
combiner transforms the key, it is illegal to change its type, the  
partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v) from  
the combiner, your comparator is C(K,K) and your partitioner function  
is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If  
either of these does not hold, the semantics to the reduce are broken.  
Clearly, if k is not transformed (as in true for most combiners), this  
holds trivially.


As was mentioned earlier, the purpose of the combiner is to compress  
data pulled across the network and spilled to disk. It should not  
affect the correctness or, in most cases, the output of the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is also
a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its input
coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go straight
to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org  
wrote:
The Combiner may be called 0, 1, or many times on each key between  
the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Combiner run specification and questions

2009-01-02 Thread Saptarshi Guha
Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is also
a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its input
coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go straight
to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote:
 The Combiner may be called 0, 1, or many times on each key between the
 mapper and reducer. Combiners are just an application specific optimization
 that compress the intermediate output. They should not have side effects or
 transform the types. Unfortunately, since there isn't a separate interface
 for Combiners, there is isn't a great place to document this requirement.
 I've just filed HADOOP-4668 to improve the documentation.



-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Combiner run specification and questions

2009-01-02 Thread Jim Twensky
Hello Saptarshi,

E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go straight
to the reducer or to the reducer via the combiner?

It depends on whether or not you use the method JobConf.setCombinerClass()
or not. If you don't, Hadoop  does not run any combiners by default. If you
use your reducer class as the combiner, you must make sure that your mapper
and reducer outputs are of same type. Because otherwise you will get a
runtime error about types not matching. In your case, I strongly recommend
you to use a combiner to reduce the size of the intermediate data. My
understanding is that, combiners are just local reducers that run right
after the completion of the map step.

Jim

On Fri, Jan 2, 2009 at 11:57 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 I would just like to confirm, when does the Combiner run(since it
 might not be run at all,see below). I read somewhere that it is run,
 if there is at least one reduce (which in my case i can be sure of).
 I also read, that the combiner is an optimization. However, it is also
 a chance for a function to transform the key/value (keeping the class
 the same i.e the combiner semantics are not changed) and deal with a
 smaller set ( this could be done in the reducer but the number of
 values for a key might be relatively large).

 However, I guess it would be a mistake for reducer to expect its input
 coming from a combiner? E.g if there are only 10 value corresponding
 to a key(as outputted by the mapper), will these 10 values go straight
 to the reducer or to the reducer via the combiner?

 Here I am assuming my reduce operations does not need all the values
 for a key to work(so that a combiner can be used) i.e additive
 operations.

 Thank you
 Saptarshi


 On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote:
  The Combiner may be called 0, 1, or many times on each key between the
  mapper and reducer. Combiners are just an application specific
 optimization
  that compress the intermediate output. They should not have side effects
 or
  transform the types. Unfortunately, since there isn't a separate
 interface
  for Combiners, there is isn't a great place to document this requirement.
  I've just filed HADOOP-4668 to improve the documentation.



 --
 Saptarshi Guha - saptarshi.g...@gmail.com