. So as long as the correctness of the computation doesn't > rely on a transformation performed in the combiner, it should be OK. In
Right, i had the same thought. > > However, this restriction limits the scalability of your solution. It might > be necessary to work around R's limitations by breaking up large > computations into intermediate steps, possibly by explicitly instantiating > and running the combiner in the reduce. > So, i explicitly call the combiner? However at times, the reducer needs all the values so calling the combiner would not always work here. However, if i recall correctly(from reading the google paper) one does not **humongous** number values for a single key >> 1) I am guaranteed a reducer. >> So, >>> >>> The combiner, if defined, will run zero or more times on records emitted >>> from the map, before being fed to the reduce. >> >> >> This zero case possibility worries me. However you mention, that it occurs >>> >>> collector spills in the map >> >> I have noticed this happening - what 'spilling' mean? > > Records emitted from the map are serialized into a buffer, which is > periodically written to disk when it is (sufficiently) full. Each of these > batch writes is a "spill". In casual usage, it refers to any time when > records need to be written to disk. The merge of intermediate files into the > final map output and merging in-memory segments to disk in the reduce are > two examples. -C > Thanks for the explanation. Regards Saptarshi
