[
https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715727#action_12715727
]
Zheng Shao edited comment on HIVE-535 at 6/2/09 3:34 PM:
---------------------------------------------------------
Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4
bytes). We let he UDAFs manage the array of aggregation results, and store the
index of the aggregation results in the hashTable.
{code}
SELECT departmentid, count(1), sum(revenue)
FROM sales
GROUP BY departmentid;
int key_id = -1;
if (!hashmap.contains(departmentid)) {
key_id = hashmap.size() + 1;
hashmap.insert(departmentid, key_id);
count_1.initAggregation(key_id);
sum_revenue.initAggregation(key_id);
} else {
key_id = hashmap.get(departmentid);
}
count_1.iterate(key_id, 1);
sum_revenue.iterate(key_id, revenue);
{code}
was (Author: zshao):
Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4
bytes). We let he UDAFs manage the array of aggregation results, and store the
index of the aggregation results in the hashTable.
{code}
SELECT departmentid, count(1), sum(revenue)
FROM sales
GROUP BY departmentid;
on a new departmentid:
int key_id = -1;
if (!hashmap.contains(departmentid)) {
key_id = hashmap.size() + 1;
hashmap.insert(departmentid, key_id);
count_1.initAggregation(key_id);
sum_revenue.initAggregation(key_id);
} else {
key_id = hashmap.get(departmentid);
}
count_1.iterate(key_id, 1);
sum_revenue.iterate(key_id, revenue);
{code}
> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
> Key: HIVE-535
> URL: https://issues.apache.org/jira/browse/HIVE-535
> Project: Hadoop Hive
> Issue Type: Improvement
> Affects Versions: 0.4.0
> Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in
> GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in
> its HashTable, and flushes frequently, and won't be able to achieve very good
> partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte
> per-object overhead of Java in keys (depending on how many objects there are
> in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64
> bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF
> should manage the array of aggregation results, so UDAFCount should manage a
> long[], UDAFAvg should manage a double[] and a long[]. The external code
> should pass an index to iterate/merge/terminal an aggregation result. This
> will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.