[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598090#comment-14598090
]
Nezih Yigitbasi commented on PARQUET-41:
----------------------------------------
Not sure we need a counting bloom filter to support Hive acid tx (at least for
now). The base file is updated with the deltas with some frequency (called
major compaction, frequency depends on the hive.compactor.delta.pct.threshold
config parameter) and the bloom filter of the base file will get rewritten with
major compaction. One thing I don't understand is why does a delta file need a
bloom filter? As far as I understand how Hive's acid support works, it's enough
for just the base file to contain a filter, and I guess it's OK for the bloom
filter to return true when a delta file has a delete for that particular record
as currently Hive only supports snapshot isolation.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)