[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Ferdinand Xu (JIRA) Tue, 23 Jun 2015 20:16:58 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598800#comment-14598800
 ]


Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~nezihyigitbasi] [~jaltekruse], I don’t think we need to handle the 
deletion case for bloom filter when comes to an ACID table. For ACID in Hive, 
we should consider three things: read, write, compactor (merge). For R/W 
operation, it doesn't require the deletion. For read operation, the hive merger 
reader handles this by creating a user-view data using the base& delta files. 
For the compaction, there are two kinds: *Minor Compaction* and *Major 
Compaction*.
Minor compaction is used to compact the delta files. Several delta files are 
compacted into a single delta file (_delta_x_y_ where x is the begin and y is 
end transaction id). So it’s a different file after compaction.
Major compaction is used to compact delta files into the base file (_base_z_ 
where z stands for the highest transaction id compacted). 
So we should generate a new bloom filter instead of merging the previous 
ones(Compactor is writing another file in fact). Any thoughts about this 
[~spena] [~rdblue]?

Besides, I read again about the configuration for the support of bloom filter 
in ORC. It seems we could add one more configuration to specify which column we 
will apply the bloom filter.


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Reply via email to