[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606664#comment-14606664
]
Ryan Blue commented on PARQUET-41:
----------------------------------
No problem.
The second page calculates the effective false-positive probability given a
filter of some size and an amount of overloading. The first table calculates
the size of a bloom filter for a given FPP and number of expected values. The
second table to the right of it shows the actual FPP for all of the filter
sizes on the left if they are overloaded by the overloading factor, the green
box just below.
For example, the table on the left calculates a size for storing 512 values
with a 1% FPP: 614 bytes. The table on the right then multiplies the number of
values by the overloading factor: 512 * 1.25 = 640. Then assuming we stored 640
values in that 614 byte filter, it calculates that the actual FPP will be 2.5%
instead of the 1% FPP we wanted.
This shows that we need to base the size of a filter on the actual number of
values stored. Like I said above, overloading a 1% filter with 125% of its
capacity results in a 2.5% actual FPP. 200% load results in a 10% actual FPP.
And the actual expectation is that the capacity we guess would be off by an
order of magnitude, not just double.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)