[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Joydeep Sen Sarma (JIRA) Tue, 02 Jun 2009 15:52:33 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715737#action_12715737
 ]


Joydeep Sen Sarma commented on HIVE-535:
----------------------------------------

+1 on A3 (its become one of my interview questions :-)).

one of the issues with it that one would need to have rewritten version of 
HashMap that does not store a object reference in the value (but rather a 
primitive field directly). (current hashmap is from object to object). 

there is a tradeoff between memory usage and speed (since serialization will 
cost cpu) - might be worth going the serialized route only if cardinality is 
observed to be high enough.

the other low hanging fruit is to not spill hash map contents randomly (that's 
probably a separate jira somewhere and somewhat unrelated).

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in 
> GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in 
> its HashTable, and flushes frequently, and won't be able to achieve very good 
> partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte 
> per-object overhead of Java in keys (depending on how many objects there are 
> in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 
> bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF 
> should manage the array of aggregation results, so UDAFCount should manage a 
> long[], UDAFAvg should manage a double[] and a long[]. The external code 
> should pass an index to iterate/merge/terminal an aggregation result. This 
> will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Reply via email to