[
https://issues.apache.org/jira/browse/KYLIN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoxiang Yu closed KYLIN-3487.
-------------------------------
Resolved in release 3.1.0 (2020-07-03)
> Create a new measure for precise count distinct
> -----------------------------------------------
>
> Key: KYLIN-3487
> URL: https://issues.apache.org/jira/browse/KYLIN-3487
> Project: Kylin
> Issue Type: Improvement
> Reporter: Zhong Yanghong
> Assignee: Zhong Yanghong
> Priority: Major
> Fix For: v3.1.0
>
>
> To compute the precise count distinct, we can use bitmap and global
> dictionary. However, there's a limitation for the global dictionary. It maps
> from values to ids whose type is integer, which means the number of ids will
> be less than 2B. And it's like a Pixiu for which there's increase but no
> decrease.
> In eBay, there's a requirement of calculating precise count distinct of
> session. The session cardinality is large and will grow as time goes on. It
> will not be feasible to use the global dictionary when its cardinality
> exceeds the upper bound 2B. How can we deal with this?
> There's good news that a session never crosses days. With this feature, we
> don't need to merge bitmap across days. To calculate precise session
> cardinality, we can assign each day a bitmap and directly summarize the
> cardinalities estimated by each bitmap. No bitmap merge is needed.
> To use bitmap for cardinality calculation, we need to map raw data from value
> to an integer id, which is achieved by encoding the value with a dictionary.
> Previously, for the ability of merging bitmaps from multiple segments, global
> dictionary is used. However, in this case, there's no need of bitmap merge,
> the global dictionary is not needed.
> And we don't need to filter by or group by session. Then there's no need to
> map from value to id and from id to value after the related bitmap is
> constructed. Therefore, we don't need to store dictionaries for session. Only
> the bitmap is enough.
> To deal with segment merge, since bitmaps of each segment are not able to
> merge to one bitmap, we use a map for storing multiple bitmaps. In the map,
> the key is the segment name and the value is the segment-level bitmap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)