[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606664#comment-14606664
 ] 

Ryan Blue commented on PARQUET-41:
----------------------------------

No problem.

The second page calculates the effective false-positive probability given a 
filter of some size and an amount of overloading. The first table calculates 
the size of a bloom filter for a given FPP and number of expected values. The 
second table to the right of it shows the actual FPP for all of the filter 
sizes on the left if they are overloaded by the overloading factor, the green 
box just below.

For example, the table on the left calculates a size for storing 512 values 
with a 1% FPP: 614 bytes. The table on the right then multiplies the number of 
values by the overloading factor: 512 * 1.25 = 640. Then assuming we stored 640 
values in that 614 byte filter, it calculates that the actual FPP will be 2.5% 
instead of the 1% FPP we wanted.

This shows that we need to base the size of a filter on the actual number of 
values stored. Like I said above, overloading a 1% filter with 125% of its 
capacity results in a 2.5% actual FPP. 200% load results in a 10% actual FPP. 
And the actual expectation is that the capacity we guess would be off by an 
order of magnitude, not just double.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to