[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mars updated PARQUET-2254:
--------------------------
Description:
*Why are the changes needed?*
Now the usage of bloom filter is to specify the NDV(number of distinct values)
or max bytes, and then build BloomFilter. In general scenarios, it is actually
not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file
size can be reduced and the reading efficiency can also be improved.
*What changes were proposed in this pull request?*
`AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as
candidates and inserts values in
the candidates at the same time. Finally we will choose the smallest candidate
to write out.
*Does this PR introduce any user-facing change?*
add new configuration:
`parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable
writing adaptive bloom filter.
If it is true, the bloom filter will be generated with the optimal bit size
according to the number of real data distinct values. If it is false, it will
not take effect.
Note that the maximum bytes of the bloom filter will not exceed
`parquet.bloom.filter.max.bytes` configuration (if it is
set too small, the generated bloom filter will not be efficient).
`parquet.bloom.filter.candidates.number`: default 5, the number of candidate
bloom filters written at the same time.
When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate bloom
filters will be inserted
at the same time, finally a bloom filter with the optimal bit size will be
selected and written to the file.
was:
h3. Why are the changes needed?
Now the usage of bloom filter is to specify the NDV(number of distinct values),
and then build BloomFilter. In general scenarios, it is actually not sure how
much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file
size can be reduced and the reading efficiency can also be improved.
h3. What changes were proposed in this pull request?
{{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as
candidates and inserts values in the candidates at the same time. Use the
largest bloom filter as an approximate deduplication counter, and then remove
incapable bloom filter candidates during data insertion.
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> *Why are the changes needed?*
> Now the usage of bloom filter is to specify the NDV(number of distinct
> values) or max bytes, and then build BloomFilter. In general scenarios, it is
> actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> *What changes were proposed in this pull request?*
> `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as
> candidates and inserts values in
> the candidates at the same time. Finally we will choose the smallest
> candidate to write out.
> *Does this PR introduce any user-facing change?*
> add new configuration:
> `parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable
> writing adaptive bloom filter.
> If it is true, the bloom filter will be generated with the optimal bit size
> according to the number of real data distinct values. If it is false, it will
> not take effect.
> Note that the maximum bytes of the bloom filter will not exceed
> `parquet.bloom.filter.max.bytes` configuration (if it is
> set too small, the generated bloom filter will not be efficient).
> `parquet.bloom.filter.candidates.number`: default 5, the number of candidate
> bloom filters written at the same time.
> When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate
> bloom filters will be inserted
> at the same time, finally a bloom filter with the optimal bit size will be
> selected and written to the file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)