Arun, >From the Tenzing paper:
<quote> Hash table based aggregation is common in RDBMS sys- tems. However, it is impossible to implement eciently on the basic MapReduce framework, since the reducer al- ways unnecessarily sorts the data by key. We enhanced the MapReduce framework to relax this restriction so that all values for the same key end up in the same reducer shard, but not necessarily in the same Reduce() call. This made it possible to completely avoid the sorting step on the re- ducer and implement a pure hash table based aggregation on MapReduce. This can have a signicant impact on the performance on certain types of queries. Due to optimizer limitations, the user must explicitly indicate that they want hash based aggregation. A query using hash-based aggre- gation will fail if there is not enough memory for the hash table. </quote> So, you need sorting by partition anyway, which is exactly what would happen if I set key comparator to return equals always. - milind On 10/19/11 3:26 PM, "Arun C Murthy" <[email protected]> wrote: >Milind the map-side sort uses the partiion as the primary key. So, you >still sort. > >See MR-1639 for more details. > >On Oct 19, 2011, at 3:22 PM, <[email protected]> wrote: > >> How is that different from specifying a comparator that always returns >> that k1 and k2 are equal regardless of k1 and k2 ? So, you will get only >> partitioning, and not sorting. >> >> - Milind >> >> >> On 10/19/11 2:42 PM, "Zheng Shao" <[email protected]> wrote: >> >>> Google's Tenzing paper mentioned that they modified MR to make sorting >>>in >>> reducer optional: >>> >>>http://static.googleusercontent.com/external_content/untrusted_dlcp/rese >>>ar >>> ch.google.com/en/us/pubs/archive/37200.pdf >>> >>> Is there any plan to support that in MR 2.0? >>> >>> Zheng >> > >
