[
https://issues.apache.org/jira/browse/PARQUET-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061614#comment-17061614
]
Walid Gara commented on PARQUET-1815:
-------------------------------------
In the parquet-mr, we use bloom filters to filter values. Since we already
computed them and they exist in the footer, they can be exploited beyond
internal use. Just by performing the union on all bloom filters per parquet
file, we can create one bloom filter with a higher false-positive rate. Then,
it will be used as an index (kind of metadata) in some projects such as [Apache
Iceberg|https://iceberg.apache.org/].
This is just a simple use case, you can find in this paper more use cases like
bloom joins and others:
[Role of Bloom Filter in Big Data Research: A
Survey|https://arxiv.org/pdf/1903.06565.pdf]
> Add union API to BloomFilter interface
> --------------------------------------
>
> Key: PARQUET-1815
> URL: https://issues.apache.org/jira/browse/PARQUET-1815
> Project: Parquet
> Issue Type: Improvement
> Reporter: Junjie Chen
> Priority: Minor
> Labels: pull-request-available
>
> Sometimes, one may want to build a file-level bloom filter by union all row
> groups bloom filters so that to save some memory. Add a union API that could
> make it easy to use.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)