[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Jason Altekruse (JIRA) Tue, 23 Jun 2015 09:03:45 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597859#comment-14597859
 ]


Jason Altekruse commented on PARQUET-41:
----------------------------------------

I did not get a chance to look through the code yet, but one possible 
consideration to think about in regards to Hive. For their ACID support, it 
would be useful to allow the bloom filter to be modifiable upon deletion of a 
record before a compaction of the original data and the diff file. If you 
design the bloom filter to be an array of integers instead of an array of bits 
you can remove elements from it as well as add, as each position can now store 
the number of hashes that ended up in that position, rather than a flag to say 
at least one ended up there. This would allow them and other users of parquet 
that may want to implement update/delete to subtract out the elements from the 
bloom filter that have been deleted or changed.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Reply via email to