[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597943#comment-14597943
 ] 

Jason Altekruse commented on PARQUET-41:
----------------------------------------

If rewriting happens frequently enough there may not be a need for it. However 
I think there are some cases that can not be completely solved by two 
independent bit filters in the base and delta files. Additions can certainly 
work, but I don't think it is possible to incorporate deletes or updates with 
this strategy. The bloom filter for your base file will return true when you 
look up a deleted record.

Even if the delta file absolutely replaces rows from your base file, there is 
no association between a result of the bloom filter lookup and a row number. 
You could not know that a particular update in the delta file relates to a 
lookup in the base file bloom filter.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to