On Tue, 2 Aug 2011 21:25:47 +0800 (CST), "Daniel,Wu" <[email protected]> wrote: > at page 243: > Per my understanding, The reducer is supposed to output the first value > (the maximum) for each year. But I just don't know how it work. > > suppose we have the data > 1901 200 > 1901 300 > 1901 400 > > Since group is done by the year, so we have only one group, but we have 3 > different key as the key is a combination of year and temperature. for the > reduce, the output should be key, list(value) pair, since we have 3 key, > so we should output 3 rows, but since we have only one group, we only > output 1 rows. So where is the conflict? Where do I misunderstand?
Keep reading the section in the book: "This still isn't enough to achieve our coal, however. A partitioner ensures only that one reducer receives all the records for a year; it doesn't change the fact that the reducer groups by key within the partition... The final piece of the puzzle is the setting to control the grouping. If we group values in the reducer by the year part of the key, then we will see all the records for the same year in one reduce group. And since they are sorted by temperature in descending order, the first is the maximum temperature." That is, in that example they also change the way the reducer groups its inputs.
