I'm talking about the Hadoop impl of map-task. On Oct 19, 2011, at 3:45 PM, <[email protected]> <[email protected]> wrote:
> Arun, > > From the Tenzing paper: > > <quote> > Hash table based aggregation is common in RDBMS sys- > tems. However, it is impossible to implement eciently > on the basic MapReduce framework, since the reducer al- > ways unnecessarily sorts the data by key. We enhanced the > MapReduce framework to relax this restriction so that all > values for the same key end up in the same reducer shard, > but not necessarily in the same Reduce() call. This made > it possible to completely avoid the sorting step on the re- > ducer and implement a pure hash table based aggregation > on MapReduce. This can have a signicant impact on the > performance on certain types of queries. Due to optimizer > limitations, the user must explicitly indicate that they want > hash based aggregation. A query using hash-based aggre- > gation will fail if there is not enough memory for the hash > table. > </quote> > > So, you need sorting by partition anyway, which is exactly what would > happen if I set key comparator to return equals always. > > - milind > > > > > On 10/19/11 3:26 PM, "Arun C Murthy" <[email protected]> wrote: > >> Milind the map-side sort uses the partiion as the primary key. So, you >> still sort. >> >> See MR-1639 for more details. >> >> On Oct 19, 2011, at 3:22 PM, <[email protected]> wrote: >> >>> How is that different from specifying a comparator that always returns >>> that k1 and k2 are equal regardless of k1 and k2 ? So, you will get only >>> partitioning, and not sorting. >>> >>> - Milind >>> >>> >>> On 10/19/11 2:42 PM, "Zheng Shao" <[email protected]> wrote: >>> >>>> Google's Tenzing paper mentioned that they modified MR to make sorting >>>> in >>>> reducer optional: >>>> >>>> http://static.googleusercontent.com/external_content/untrusted_dlcp/rese >>>> ar >>>> ch.google.com/en/us/pubs/archive/37200.pdf >>>> >>>> Is there any plan to support that in MR 2.0? >>>> >>>> Zheng >>> >> >> >
