[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597943#comment-14597943
]
Jason Altekruse commented on PARQUET-41:
----------------------------------------
If rewriting happens frequently enough there may not be a need for it. However
I think there are some cases that can not be completely solved by two
independent bit filters in the base and delta files. Additions can certainly
work, but I don't think it is possible to incorporate deletes or updates with
this strategy. The bloom filter for your base file will return true when you
look up a deleted record.
Even if the delta file absolutely replaces rows from your base file, there is
no association between a result of the bloom filter lookup and a row number.
You could not know that a particular update in the delta file relates to a
lookup in the base file bloom filter.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)