Yeah - we definitely want to convert it to a MFU type flush algorithm. If someone wants to take a crack at it before we can get to it - that would be awesome
________________________________ From: Namit Jain [mailto:[email protected]] Sent: Friday, February 27, 2009 1:59 PM To: [email protected] Subject: RE: Combine() optimization It dumps 10% of the hash table randomly today From: Scott Carey [mailto:[email protected]] Sent: Friday, February 27, 2009 1:41 PM To: [email protected] Subject: Re: Combine() optimization Does it dump all contents and start over, or use a LRU or MFU algorithm? LinkedHashMap makes LRUs and similar constructs fairly easy to make. My guess is that most data types have biased value distributions that will take advantage of map side partial aggregation fairly well. On 2/26/09 6:02 PM, "Namit Jain" <[email protected]> wrote: Yes, it flushes the data when the hash table is occupying too much memory From: Qing Yan [mailto:[email protected]] Sent: Thursday, February 26, 2009 5:58 PM To: [email protected] Subject: Re: Combine() optimization Got it. Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work? On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao <[email protected]> wrote: Hi Qing, We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases. In order to enable map-side aggregation, we just need to do the following before running the hive query: set hive.map.aggr=true; Zheng On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy <[email protected]> wrote: Right now Hive does not exploit the combiner. But hash-based map-side aggregation in hive (controlled by hints) provides a similar optimization. Using the combiner in addition to map-side aggregation should improve the performance even more if the combiner can further aggregate the partial aggregates generated from the mapper. On 2/26/09 5:57 AM, "Qing Yan" <[email protected]> wrote: > Is there any way/plan for Hive to take advantage of M/R's combine() > phrase? There can be either rules embedded in in the query optimizer or hints > passed by user... > GROUP BY should benefit from this alot.. > > Any comment? > > >
