[
https://issues.apache.org/jira/browse/HIVE-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pengcheng Xiong updated HIVE-12763:
-----------------------------------
Attachment: aggrStatsPerformance.png
as per [~jpullokkaran]'s request, I tested the time/space complexity of
aggrStats performance on my mac. The x-axis is the #partitions. y-axis is the
time take to aggregate the stats of #partitions in ms. We can see that as
#partition increases, the aggrStats time increases. But it runs quite fast,
475ms for 1000 partitions. I can not go beyond 1000 as my mac dies after I
increase it to 2000. Thus, the time complexity is pretty good mainly due to the
simple operation that we have (bit or). The space complexity is also good. For
16 bit vectors, each bit vector is an array of at most 31 integers. And then
multiply by the number of partitions. In an extreme case, 1 million partition,
the total space is 16*31*4B*1M (around 2GB). This is the space we need when we
want to store every bit vector in HBaseStore (without consideration of
serialization). When we aggregate the partition stats one by one, we need the
memory of 16*31*4B*2 (around 4KB).
> Use bit vector to track NDV
> ---------------------------
>
> Key: HIVE-12763
> URL: https://issues.apache.org/jira/browse/HIVE-12763
> Project: Hive
> Issue Type: Improvement
> Reporter: Pengcheng Xiong
> Assignee: Pengcheng Xiong
> Attachments: HIVE-12763.01.patch, HIVE-12763.02.patch,
> HIVE-12763.03.patch, HIVE-12763.04.patch, HIVE-12763.05.patch,
> aggrStatsPerformance.png
>
>
> This will improve merging of per partitions stats. It will also help merge
> NDV for auto-gather column stats.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)