[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602532#comment-14602532
 ] 

Ryan Blue commented on PARQUET-41:
----------------------------------

Thanks for working on this, [~Ferd], it's great to be making some good progress 
on it. This is getting to be a pretty long comment. I don't have all that many 
conclusions, but I wanted to share some observations to start a discussion 
around how this feature should be done.

I've mostly been thinking lately about the bloom filter configuration. I like 
that FPP is a user setting because the query patterns really affect what value 
you want for it. You can get much better space savings with a high FPP if you 
know that typical queries will only look for a few items.

We can think of FPP as the probability that we will have to read a data page 
even though it doesn't actually have the item we are looking for. That is 
multiplied by the number of items in a query, which could be large but I think 
will generally be less than ~10 elements (for basing a default). That puts a 
general upper limit on the FPP because if it is something too high, like 10%, a 
fair number of queries will end up reading unnecessary data with a 50+% 
probability (anything checking for 5 or more unique items).

I think we should have a way to read the page stats without the filter, since 
they can be pretty big. I took a look at a real-world dataset with 8-byte 
timestamps that are ~75% unique, which put the expected filter size for a 2.5% 
false-positive rate at 9% of the block size. If I'm looking for 32 timestamps 
at once, I have an 80% chance of reading pages I don't need to read, and end up 
reading an extra 9% for every page's bloom filter alone.

I don't think we want a setting for the expected number of entries. For one 
thing, this varies widely across pages. I have a dataset with 20-30 values per 
page in one column and 131,000 values per page in another. A setting for all 
columns will definitely be a problem and I don't think we can trust users to 
set this correctly for their data on every column.

We also don't know much about how many unique values are in a column or how 
that column will compress with the encodings. Bloom filters are surprisingly 
expensive in terms of space considering some of the encoding sizes we can get 
in Parquet. For example, if we have a column where delta integer encoding is 
doing a good job, values might be ~2 bytes each. If the column is 75% unique, 
then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, 
and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 
8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is 
still significant. As encoding gets better, pages have more values and the 
bloom filter needs to be larger.

Without knowing the percentage of unique values or the encoding size, choosing 
the expected number of values for a page is impossible. Because of the 
potential size of the filters compared to the page size, over-estimating the 
filter size isn't enough: we don't want something 10% of the page size or 
larger. That means that if we chose an estimate for the number of values, we 
would still end up overloading filters fairly often. I took a look at the 
false-positive probability for overloaded filters: if a filter is 125% loaded, 
then the actual false-positive probability at least doubles, and for an 
original 1% FPP, it triples. It gets much worse as the overloading increases: 
200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind 
that the expected overloading is probably not as low as 200% given that the 
number of values per page can vary from tens to tens of thousands.

I think there are 2 approaches to fixing this. First, there's a paper, 
[Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], 
that has a strategy to use a series of bloom filters so you don't have to know 
the size in advance. It's a good paper, but we would want to change the 
heuristics for growing the filter because we know when we are getting close to 
the total number of elements in the page. Another draw-back is that it uses a 
series of filters, so testing for an element has to be done in each filter.

I think a second approach is to keep the data in memory until we have enough to 
determine the properties of the bloom filter. This would only need to be done 
for the first few pages, while memory consumption is still small. We could keep 
the hashed values instead of the actual data to get the size down to a set of 
integers that will be approximately the number of uniques items in the page 
(minus collisions). I like this option better because it is all on the write 
side and trades a reasonable amount of memory for a more complicated filter. 
The read side would be as it is now.

Okay, this is long enough. I'll clean up the spreadsheet I'm basing my numbers 
on and share it tomorrow.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to