Memory-efficient hash-based Aggregation
---------------------------------------
Key: HIVE-535
URL: https://issues.apache.org/jira/browse/HIVE-535
Project: Hadoop Hive
Issue Type: Improvement
Affects Versions: 0.4.0
Reporter: Zheng Shao
Currently there are a lot of memory overhead in the hash-based aggregation in
GroupByOperator.
The net result is that GroupByOperator won't be able to store many entries in
its HashTable, and flushes frequently, and won't be able to achieve very good
partial aggregation result.
Here are some initial thoughts (some of them are from Joydeep long time ago):
A1. Serialize the key of the HashTable. This will eliminate the 16-byte
per-object overhead of Java in keys (depending on how many objects there are in
the key, the saving can be substantial).
A2. Use more memory-efficient hash tables - java.util.HashMap has about 64
bytes of overhead per entry.
A3. Use primitive array to store aggregation results. Basically, the UDAF
should manage the array of aggregation results, so UDAFCount should manage a
long[], UDAFAvg should manage a double[] and a long[]. The external code should
pass an index to iterate/merge/terminal an aggregation result. This will
eliminate the 16-byte per-object overhead of Java.
More ideas are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.