[ 
https://issues.apache.org/jira/browse/KYLIN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhong Yanghong updated KYLIN-3487:
----------------------------------
    Description: 
To compute the precise count distinct, we can use bitmap and global dictionary. 
However, there's a limitation for the global dictionary. It maps from values to 
ids whose type is integer, which means the number of ids will be less than 2B. 
And it's like a Pixiu for which there's increase but no decrease. 

In eBay, there's a requirement of calculating precise count distinct of 
session. The session cardinality is large and will grow as time goes on. It 
will not be feasible to use the global dictionary when its cardinality exceeds 
the upper bound 2B. How can we deal with this?

There's good news that a session never crosses days. With this feature, we 
don't need to merge bitmap across days. To calculate precise session 
cardinality, we can assign each day a bitmap and directly summarize the 
cardinalities estimated by each bitmap. No bitmap merge is needed. 

To use bitmap for cardinality calculation, we need to map raw data from value 
to an integer id, which is achieved by encoding the value with a dictionary. 
Previously, for the ability of merging bitmaps from multiple segments, global 
dictionary is used. However, in this case, there's no need of bitmap merge, the 
global dictionary is not needed. 

And we don't need to filter by or group by session. Then there's no need to map 
from value to id and from id to value after the related bitmap is constructed. 
Therefore, we don't need to store dictionaries for session. Only the bitmap is 
enough.

To deal with segment merge, since bitmaps of each segment are not able to merge 
to one bitmap, we use a map for storing multiple bitmaps. In the map, the key 
is the segment name and the value is the segment-level bitmap.

  was:
In eBay, there'll be around 20M sessions each day. And there's a requirement to 
calculate the count distinct of sessions

For deep dive, users want to get the session cardinality in a year, or even 
several years. If just for one year, the total cardinality will be around 
20M*360 = 7B > 2B. It will exceed the the upper limitation of bitmap, and will 
not good for 


To calculate the count distinct of session, if a session never crosses days, 
it's meaningless to merge the related counter, bitmap or hll, across days.


For count distinct session, it's meaningless to merge across days, for session 
is never across days. Therefore, we may need a new measure containing a map, 
using the date info as the key, and using bitmap or hll as the value. When 
calculating count distinct, it's only need to get the state for each key-value 
entry and then to summarize the states. And we don't need merge bitmap or hll 
across different key-value entries.


> Create a new measure for precise count distinct
> -----------------------------------------------
>
>                 Key: KYLIN-3487
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3487
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>
> To compute the precise count distinct, we can use bitmap and global 
> dictionary. However, there's a limitation for the global dictionary. It maps 
> from values to ids whose type is integer, which means the number of ids will 
> be less than 2B. And it's like a Pixiu for which there's increase but no 
> decrease. 
> In eBay, there's a requirement of calculating precise count distinct of 
> session. The session cardinality is large and will grow as time goes on. It 
> will not be feasible to use the global dictionary when its cardinality 
> exceeds the upper bound 2B. How can we deal with this?
> There's good news that a session never crosses days. With this feature, we 
> don't need to merge bitmap across days. To calculate precise session 
> cardinality, we can assign each day a bitmap and directly summarize the 
> cardinalities estimated by each bitmap. No bitmap merge is needed. 
> To use bitmap for cardinality calculation, we need to map raw data from value 
> to an integer id, which is achieved by encoding the value with a dictionary. 
> Previously, for the ability of merging bitmaps from multiple segments, global 
> dictionary is used. However, in this case, there's no need of bitmap merge, 
> the global dictionary is not needed. 
> And we don't need to filter by or group by session. Then there's no need to 
> map from value to id and from id to value after the related bitmap is 
> constructed. Therefore, we don't need to store dictionaries for session. Only 
> the bitmap is enough.
> To deal with segment merge, since bitmaps of each segment are not able to 
> merge to one bitmap, we use a map for storing multiple bitmaps. In the map, 
> the key is the segment name and the value is the segment-level bitmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to