On 30 October 2011 20:22, Owen O'Malley <[email protected]> wrote:
> On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts < > [email protected]> wrote: > > > My question is, what happens if the combiner outputs different keys > > than what it is being fed? The output of the combiner will suffer two > > flaws: > > > > 1. It won't be sorted > > 2. It might end up in the wrong partition > > > > Yes. We've talked about adding various checks, but I don't think anyone has > added them. We obviously have the input key and one option would be to > ignore the output key. > > > > Since a Combiner is simply a Reducer with no other constraints, > > > > That isn't true. Combiners are required to be: > 1. Idempotent - The number of times the combiner is applied can't change > the output > 2. Transititive - The order of the inputs can't change the output > 3. Side-effect free - Combiners can't have side effects (or they won't be > idempotent). > 4. Preserve the sort order - They can't change the keys to disrupt the > sort order > 5. Preserve the partitioning - They can't change the keys to change the > parititioning > > All 5 of them are required for combiners. > I can't add to Owen's reply about Hadoop, other than perhaps to suggest "associative" might be the word? However, I also mess with these things at the theoretical level for my own entertainment. This same question caught me out at first, because mathematically speaking, the ability to change the keys in the combiner does not prevent an efficient implementation from being constructed, but would make the output buffer management somewhat more complex. So if the restriction seems nonobvious, that's why. Also, if you're explaining pure mapreduce, it also works without sorting (assuming the user application doesn't require sorted output), all you need is a gather operator, but (skipping tricks like bucket sort) since the most efficient way to gather is to use a comparison-sort, one might as well. Back to the grindstone. S.
