[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597859#comment-14597859
]
Jason Altekruse commented on PARQUET-41:
----------------------------------------
I did not get a chance to look through the code yet, but one possible
consideration to think about in regards to Hive. For their ACID support, it
would be useful to allow the bloom filter to be modifiable upon deletion of a
record before a compaction of the original data and the diff file. If you
design the bloom filter to be an array of integers instead of an array of bits
you can remove elements from it as well as add, as each position can now store
the number of hashes that ended up in that position, rather than a flag to say
at least one ended up there. This would allow them and other users of parquet
that may want to implement update/delete to subtract out the elements from the
bloom filter that have been deleted or changed.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)