[jira] [Created] (KYLIN-5640) Support to automatically adjust the Bloom Filter based on data distribution

Zhiting Guo (Jira) Mon, 17 Jul 2023 18:35:04 -0700

Zhiting Guo created KYLIN-5640:
----------------------------------

             Summary: Support to automatically adjust the Bloom Filter based on 
data distribution
                 Key: KYLIN-5640
                 URL: https://issues.apache.org/jira/browse/KYLIN-5640
             Project: Kylin
          Issue Type: Improvement
          Components: Query Engine
    Affects Versions: 5.0-alpha
            Reporter: Zhiting Guo
             Fix For: 5.0-alpha



h3. Why are the changes needed?

Now the usage of bloom filter is to specify the NDV(number of distinct values), 
and then build BloomFilter. In general scenarios, it is actually not sure how 
much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file 
size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?

{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
candidates and inserts values in the candidates at the same time. Use the 
largest bloom filter as an approximate deduplication counter, and then remove 
incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KYLIN-5640) Support to automatically adjust the Bloom Filter based on data distribution

Reply via email to