[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617693#comment-14617693
]
Ryan Blue commented on PARQUET-41:
----------------------------------
[~Ferd], I think you're right about the conclusions from my last note:
1. The filter is larger than expected and might not be worth storing
2. How to avoid over-filling
For the first, we need to plan how to determine whether or not a bloom filter
should be written. For the second, I think the solution I proposed is a good
enough estimate, though we will probably want to fail fast when determining the
rules under which the bloom filter will not be needed. That way we can avoid
the memory overhead of keeping those hashes around.
I've been thinking more about where we will put this data in the format and
whether the data will be per page or per row group. We definitely want to be
able to skip entire row groups. But if the filter is large, it would be nice to
be able to skip pages as well because one positive check would require reading
the whole row group, even if the value is in a single page. I think it makes
sense to use page-level filters but keep them at the column-chunk level (index
page?) so they can be used for both filtering row groups and pages.
I've added another page to the [bloom filter
spreadsheet|https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=1995371636]
for some calculations around this, in the "row group filters" page. I compared
2 strategies:
1. The idea from the Scalable Bloom Filters paper to geometrically decrease the
FPP
2. To estimate the number of pages and calculate a uniform per-page FPP that
results in an overall row group FPP (page-FPP = 1 - (1 -
row-group-FPP)^(1/num-pages))
The second strategy is the clear winner. The FPP for pages is the same for all
pages in a row group, you get an overall FPP for the row group, and the size
required is under 3x the size of one big bloom filter for the row group. This
also performs much better (space and FPP) than the scalable bloom filters
strategy when there are few pages. I think the drawback is that this will limit
the ideal number of pages per row group.
So I'm proposing we go with this tentative design:
* Build one bloom filter per page (or one for the dictionary page) using the
strategy I proposed above (keep hashes around, but not data)
* Collect all page filters and write them in index pages to allow skipping the
filters since they may be quite large
* Make the FPP setting control the column chunk/row group FPP and calculate the
per-page FPP to make that happen. Default at around 10% should work.
* Base fallback on the filter size as a fraction of the total column chunk
size: only write a bloom filter if it is less than X% of the total size.
Then we would have a bunch of rules for detecting that the bloom filter is too
large so that we can free up memory early.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)