[jira] [Commented] (PARQUET-1815) Add union API to BloomFilter interface

Walid Gara (Jira) Wed, 18 Mar 2020 03:51:49 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061614#comment-17061614
 ]


Walid Gara commented on PARQUET-1815:
-------------------------------------

In the parquet-mr, we use bloom filters to filter values. Since we already 
computed them and they exist in the footer, they can be exploited beyond 
internal use. Just by performing the union on all bloom filters per parquet 
file, we can create one bloom filter with a higher false-positive rate. Then, 
it will be used as an index (kind of metadata) in some projects such as [Apache 
Iceberg|https://iceberg.apache.org/].

This is just a simple use case, you can find in this paper more use cases like 
bloom joins and others:
[Role of Bloom Filter in Big Data Research: A 
Survey|https://arxiv.org/pdf/1903.06565.pdf]

> Add union API to BloomFilter interface
> --------------------------------------
>
>                 Key: PARQUET-1815
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1815
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Junjie Chen
>            Priority: Minor
>              Labels: pull-request-available
>
> Sometimes, one may want to build a file-level bloom filter by union all row 
> groups bloom filters so that to save some memory. Add a union API that could 
> make it easy to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1815) Add union API to BloomFilter interface

Reply via email to