[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617693#comment-14617693
 ] 

Ryan Blue commented on PARQUET-41:
----------------------------------

[~Ferd], I think you're right about the conclusions from my last note:

1. The filter is larger than expected and might not be worth storing
2. How to avoid over-filling

For the first, we need to plan how to determine whether or not a bloom filter 
should be written. For the second, I think the solution I proposed is a good 
enough estimate, though we will probably want to fail fast when determining the 
rules under which the bloom filter will not be needed. That way we can avoid 
the memory overhead of keeping those hashes around.

I've been thinking more about where we will put this data in the format and 
whether the data will be per page or per row group. We definitely want to be 
able to skip entire row groups. But if the filter is large, it would be nice to 
be able to skip pages as well because one positive check would require reading 
the whole row group, even if the value is in a single page. I think it makes 
sense to use page-level filters but keep them at the column-chunk level (index 
page?) so they can be used for both filtering row groups and pages.

I've added another page to the [bloom filter 
spreadsheet|https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=1995371636]
 for some calculations around this, in the "row group filters" page. I compared 
2 strategies:

1. The idea from the Scalable Bloom Filters paper to geometrically decrease the 
FPP
2. To estimate the number of pages and calculate a uniform per-page FPP that 
results in an overall row group FPP (page-FPP = 1 - (1 - 
row-group-FPP)^(1/num-pages))

The second strategy is the clear winner. The FPP for pages is the same for all 
pages in a row group, you get an overall FPP for the row group, and the size 
required is under 3x the size of one big bloom filter for the row group. This 
also performs much better (space and FPP) than the scalable bloom filters 
strategy when there are few pages. I think the drawback is that this will limit 
the ideal number of pages per row group.

So I'm proposing we go with this tentative design:
* Build one bloom filter per page (or one for the dictionary page) using the 
strategy I proposed above (keep hashes around, but not data)
* Collect all page filters and write them in index pages to allow skipping the 
filters since they may be quite large
* Make the FPP setting control the column chunk/row group FPP and calculate the 
per-page FPP to make that happen. Default at around 10% should work.
* Base fallback on the filter size as a fraction of the total column chunk 
size: only write a bloom filter if it is less than X% of the total size.

Then we would have a bunch of rules for detecting that the bloom filter is too 
large so that we can free up memory early.


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to