[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Nezih Yigitbasi (JIRA) Tue, 23 Jun 2015 11:04:50 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598090#comment-14598090
 ]


Nezih Yigitbasi commented on PARQUET-41:
----------------------------------------

Not sure we need a counting bloom filter to support Hive acid tx (at least for 
now). The base file is updated with the deltas with some frequency (called 
major compaction, frequency depends on the hive.compactor.delta.pct.threshold 
config parameter) and the bloom filter of the base file will get rewritten with 
major compaction. One thing I don't understand is why does a delta file need a 
bloom filter? As far as I understand how Hive's acid support works, it's enough 
for just the base file to contain a filter, and I guess it's OK for the bloom 
filter to return true when a delta file has a delete for that particular record 
as currently Hive only supports snapshot isolation.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Reply via email to