[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2254:
--------------------------
    Description: 
h3. Why are the changes needed?

Now the usage of bloom filter is to specify the NDV(number of distinct values), 
and then build BloomFilter. In general scenarios, it is actually not sure how 
much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file 
size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?

{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
candidates and inserts values in the candidates at the same time. Use the 
largest bloom filter as an approximate deduplication counter, and then remove 
incapable bloom filter candidates during data insertion.

  was:
Now the usage is to specify the size, and then build BloomFilter. In general 
scenarios, it is actually not sure how much the distinct value is. 
If BloomFilter can be automatically generated according to the data, the file 
size can be reduced and the reading efficiency can also be improved.

I have an idea that the user can specify a maximum BloomFilter filter size, 
then we build multiple BloomFilter at the same time, we can use the largest 
BloomFilter as a counting tool( If there is no hit when inserting a value, the 
counter will be +1, of course this may be imprecise but enough)
Then at the end of the write, choose a BloomFilter of a more appropriate size 
when the file is finally written.

I want to implement this feature and hope to get your opinions, thank you


> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to