[jira] [Updated] (PARQUET-2254) Build a BloomFilter with a more precise size

Mars (Jira) Thu, 11 May 2023 19:50:06 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mars updated PARQUET-2254:
--------------------------
    Description: 
*Why are the changes needed?*
Now the usage of bloom filter is to specify the NDV(number of distinct values) 
or max bytes, and then build BloomFilter. In general scenarios, it is actually 
not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file 
size can be reduced and the reading efficiency can also be improved.

*What changes were proposed in this pull request?*
`AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
 the candidates at the same time. Finally we will choose the smallest candidate 
to write out.


*Does this PR introduce any user-facing change?*
add new configuration:
`parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable 
writing adaptive bloom filter.  
If it is true, the bloom filter will be generated with the optimal bit size 
according to the number of real data distinct values. If it is false, it will 
not take effect.
Note that the maximum bytes of the bloom filter will not exceed 
`parquet.bloom.filter.max.bytes` configuration (if it is 
set too small, the generated bloom filter will not be efficient).

`parquet.bloom.filter.candidates.number`: default 5, the number of candidate 
bloom filters written at the same time.  
When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate bloom 
filters will be inserted 
at the same time, finally a bloom filter with the optimal bit size will be 
selected and written to the file.

 

  was:
h3. Why are the changes needed?

Now the usage of bloom filter is to specify the NDV(number of distinct values), 
and then build BloomFilter. In general scenarios, it is actually not sure how 
much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file 
size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?

{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
candidates and inserts values in the candidates at the same time. Use the 
largest bloom filter as an approximate deduplication counter, and then remove 
incapable bloom filter candidates during data insertion.


> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> *Why are the changes needed?*
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values) or max bytes, and then build BloomFilter. In general scenarios, it is 
> actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> *What changes were proposed in this pull request?*
> `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as 
> candidates and inserts values in
>  the candidates at the same time. Finally we will choose the smallest 
> candidate to write out.
> *Does this PR introduce any user-facing change?*
> add new configuration:
> `parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable 
> writing adaptive bloom filter.  
> If it is true, the bloom filter will be generated with the optimal bit size 
> according to the number of real data distinct values. If it is false, it will 
> not take effect.
> Note that the maximum bytes of the bloom filter will not exceed 
> `parquet.bloom.filter.max.bytes` configuration (if it is 
> set too small, the generated bloom filter will not be efficient).
> `parquet.bloom.filter.candidates.number`: default 5, the number of candidate 
> bloom filters written at the same time.  
> When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate 
> bloom filters will be inserted 
> at the same time, finally a bloom filter with the optimal bit size will be 
> selected and written to the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-2254) Build a BloomFilter with a more precise size

Reply via email to