Re: Combiner run specification and questions
. So as long as the correctness of the computation doesn't rely on a transformation performed in the combiner, it should be OK. In Right, i had the same thought. However, this restriction limits the scalability of your solution. It might be necessary to work around R's limitations by breaking up large computations into intermediate steps, possibly by explicitly instantiating and running the combiner in the reduce. So, i explicitly call the combiner? However at times, the reducer needs all the values so calling the combiner would not always work here. However, if i recall correctly(from reading the google paper) one does not **humongous** number values for a single key 1) I am guaranteed a reducer. So, The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. This zero case possibility worries me. However you mention, that it occurs collector spills in the map I have noticed this happening - what 'spilling' mean? Records emitted from the map are serialized into a buffer, which is periodically written to disk when it is (sufficiently) full. Each of these batch writes is a spill. In casual usage, it refers to any time when records need to be written to disk. The merge of intermediate files into the final map output and merging in-memory segments to disk in the reduce are two examples. -C Thanks for the explanation. Regards Saptarshi
Re: Combiner run specification and questions
I agree with the requirement that the key does not change. Of course, the values can change. I am primarily worried that the combiner might not be run at all - I have 'successfully' integrated Hadoop and R i.e the user can provide map/reduce functions written in R. However, R is not great with memory management, and if I have N (N is huge) values for a given key K, then R will baulk when it comes to processing this. Thus the combiner. The combiner will process n values for K, and ultimately, a few values for K in the reducer . If the combiner where not to run, R would collapse under the load. 1) I am guaranteed a reducer. So, The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. This zero case possibility worries me. However you mention, that it occurs collector spills in the map I have noticed this happening - what 'spilling' mean? Thank you Saptarshi On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote: The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. It is run when the collector spills in the map and in some merge cases. If the combiner transforms the key, it is illegal to change its type, the partition to which it is assigned, or its ordering. For example, if you emit a record (k,v) from your map and (k',v) from the combiner, your comparator is C(K,K) and your partitioner function is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If either of these does not hold, the semantics to the reduce are broken. Clearly, if k is not transformed (as in true for most combiners), this holds trivially. As was mentioned earlier, the purpose of the combiner is to compress data pulled across the network and spilled to disk. It should not affect the correctness or, in most cases, the output of the job. -C On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha The way of the world is to praise dead saints and prosecute live ones. -- Nathaniel Howe
Re: Combiner run specification and questions
I agree with the requirement that the key does not change. Of course, the values can change. Yes. Key attributes not relevant to ordering are also mutable. If your key is a (t1, t2) tuple ordered by t1, then t2 can change without affecting the expected semantics. I am primarily worried that the combiner might not be run at all - I have 'successfully' integrated Hadoop and R i.e the user can provide map/reduce functions written in R. However, R is not great with memory management, and if I have N (N is huge) values for a given key K, then R will baulk when it comes to processing this. That sounds really cool. The cases where the combiner is not run are currently obscure and future exceptions are unlikely to affect this particular case. For example, it doesn't make sense to run the combiner for a single record (i.e. there is nothing to combine), for a partition with no common keys, etc. So as long as the correctness of the computation doesn't rely on a transformation performed in the combiner, it should be OK. In general, it shouldn't be- and in the future may not be- run where there is little or no compression for it to effect, which doesn't exacerbate this case. However, this restriction limits the scalability of your solution. It might be necessary to work around R's limitations by breaking up large computations into intermediate steps, possibly by explicitly instantiating and running the combiner in the reduce. 1) I am guaranteed a reducer. So, The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. This zero case possibility worries me. However you mention, that it occurs collector spills in the map I have noticed this happening - what 'spilling' mean? Records emitted from the map are serialized into a buffer, which is periodically written to disk when it is (sufficiently) full. Each of these batch writes is a spill. In casual usage, it refers to any time when records need to be written to disk. The merge of intermediate files into the final map output and merging in-memory segments to disk in the reduce are two examples. -C Thank you Saptarshi On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote: The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. It is run when the collector spills in the map and in some merge cases. If the combiner transforms the key, it is illegal to change its type, the partition to which it is assigned, or its ordering. For example, if you emit a record (k,v) from your map and (k',v) from the combiner, your comparator is C(K,K) and your partitioner function is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If either of these does not hold, the semantics to the reduce are broken. Clearly, if k is not transformed (as in true for most combiners), this holds trivially. As was mentioned earlier, the purpose of the combiner is to compress data pulled across the network and spilled to disk. It should not affect the correctness or, in most cases, the output of the job. -C On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha The way of the world is to praise dead saints and prosecute live ones. -- Nathaniel Howe
Re: Combiner run specification and questions
The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. It is run when the collector spills in the map and in some merge cases. If the combiner transforms the key, it is illegal to change its type, the partition to which it is assigned, or its ordering. For example, if you emit a record (k,v) from your map and (k',v) from the combiner, your comparator is C(K,K) and your partitioner function is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If either of these does not hold, the semantics to the reduce are broken. Clearly, if k is not transformed (as in true for most combiners), this holds trivially. As was mentioned earlier, the purpose of the combiner is to compress data pulled across the network and spilled to disk. It should not affect the correctness or, in most cases, the output of the job. -C On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com
Combiner run specification and questions
Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Combiner run specification and questions
Hello Saptarshi, E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? It depends on whether or not you use the method JobConf.setCombinerClass() or not. If you don't, Hadoop does not run any combiners by default. If you use your reducer class as the combiner, you must make sure that your mapper and reducer outputs are of same type. Because otherwise you will get a runtime error about types not matching. In your case, I strongly recommend you to use a combiner to reduce the size of the intermediate data. My understanding is that, combiners are just local reducers that run right after the completion of the map step. Jim On Fri, Jan 2, 2009 at 11:57 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com