[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605183#comment-14605183
 ] 

Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~rdblue], really appreciate for your long comments and the concrete data. 
To ensure I follow your points well, I’d like to make a short summary at first. 
For current solution or design, we get two cons. The first is taking the space 
efficient as a consideration. According to the calculations in the sheet, the 
bloom filter bit set will occupied much more space than expected. The second is 
about the approach to obtain the setting for the expected number of entries. 
For the first one, I am thinking about adding a header (kind of Statistics 
header) as the dictionary did. We may create a map-like data structure with 
datapage as key and bloom filter as value. Just a rough idea and more 
investigation needed here. WRT the setting of expected numbers, I like your 
second idea too. We could obtain it at the runtime and write to the bloom 
filter when flush happened. Any thoughts here?

Thank you!
Ferd


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to