[ 
https://issues.apache.org/jira/browse/KYLIN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nichunen updated KYLIN-3487:
----------------------------
    Sprint: Sprint 51

> Create a new measure for precise count distinct
> -----------------------------------------------
>
>                 Key: KYLIN-3487
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3487
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v3.1.0
>
>
> To compute the precise count distinct, we can use bitmap and global 
> dictionary. However, there's a limitation for the global dictionary. It maps 
> from values to ids whose type is integer, which means the number of ids will 
> be less than 2B. And it's like a Pixiu for which there's increase but no 
> decrease. 
> In eBay, there's a requirement of calculating precise count distinct of 
> session. The session cardinality is large and will grow as time goes on. It 
> will not be feasible to use the global dictionary when its cardinality 
> exceeds the upper bound 2B. How can we deal with this?
> There's good news that a session never crosses days. With this feature, we 
> don't need to merge bitmap across days. To calculate precise session 
> cardinality, we can assign each day a bitmap and directly summarize the 
> cardinalities estimated by each bitmap. No bitmap merge is needed. 
> To use bitmap for cardinality calculation, we need to map raw data from value 
> to an integer id, which is achieved by encoding the value with a dictionary. 
> Previously, for the ability of merging bitmaps from multiple segments, global 
> dictionary is used. However, in this case, there's no need of bitmap merge, 
> the global dictionary is not needed. 
> And we don't need to filter by or group by session. Then there's no need to 
> map from value to id and from id to value after the related bitmap is 
> constructed. Therefore, we don't need to store dictionaries for session. Only 
> the bitmap is enough.
> To deal with segment merge, since bitmaps of each segment are not able to 
> merge to one bitmap, we use a map for storing multiple bitmaps. In the map, 
> the key is the segment name and the value is the segment-level bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to