[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602532#comment-14602532 ]
Ryan Blue commented on PARQUET-41: ---------------------------------- Thanks for working on this, [~Ferd], it's great to be making some good progress on it. This is getting to be a pretty long comment. I don't have all that many conclusions, but I wanted to share some observations to start a discussion around how this feature should be done. I've mostly been thinking lately about the bloom filter configuration. I like that FPP is a user setting because the query patterns really affect what value you want for it. You can get much better space savings with a high FPP if you know that typical queries will only look for a few items. We can think of FPP as the probability that we will have to read a data page even though it doesn't actually have the item we are looking for. That is multiplied by the number of items in a query, which could be large but I think will generally be less than ~10 elements (for basing a default). That puts a general upper limit on the FPP because if it is something too high, like 10%, a fair number of queries will end up reading unnecessary data with a 50+% probability (anything checking for 5 or more unique items). I think we should have a way to read the page stats without the filter, since they can be pretty big. I took a look at a real-world dataset with 8-byte timestamps that are ~75% unique, which put the expected filter size for a 2.5% false-positive rate at 9% of the block size. If I'm looking for 32 timestamps at once, I have an 80% chance of reading pages I don't need to read, and end up reading an extra 9% for every page's bloom filter alone. I don't think we want a setting for the expected number of entries. For one thing, this varies widely across pages. I have a dataset with 20-30 values per page in one column and 131,000 values per page in another. A setting for all columns will definitely be a problem and I don't think we can trust users to set this correctly for their data on every column. We also don't know much about how many unique values are in a column or how that column will compress with the encodings. Bloom filters are surprisingly expensive in terms of space considering some of the encoding sizes we can get in Parquet. For example, if we have a column where delta integer encoding is doing a good job, values might be ~2 bytes each. If the column is 75% unique, then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is still significant. As encoding gets better, pages have more values and the bloom filter needs to be larger. Without knowing the percentage of unique values or the encoding size, choosing the expected number of values for a page is impossible. Because of the potential size of the filters compared to the page size, over-estimating the filter size isn't enough: we don't want something 10% of the page size or larger. That means that if we chose an estimate for the number of values, we would still end up overloading filters fairly often. I took a look at the false-positive probability for overloaded filters: if a filter is 125% loaded, then the actual false-positive probability at least doubles, and for an original 1% FPP, it triples. It gets much worse as the overloading increases: 200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind that the expected overloading is probably not as low as 200% given that the number of values per page can vary from tens to tens of thousands. I think there are 2 approaches to fixing this. First, there's a paper, [Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf], that has a strategy to use a series of bloom filters so you don't have to know the size in advance. It's a good paper, but we would want to change the heuristics for growing the filter because we know when we are getting close to the total number of elements in the page. Another draw-back is that it uses a series of filters, so testing for an element has to be done in each filter. I think a second approach is to keep the data in memory until we have enough to determine the properties of the bloom filter. This would only need to be done for the first few pages, while memory consumption is still small. We could keep the hashed values instead of the actual data to get the size down to a set of integers that will be approximately the number of uniques items in the page (minus collisions). I like this option better because it is all on the write side and trades a reasonable amount of memory for a more complicated filter. The read side would be as it is now. Okay, this is long enough. I'll clean up the spreadsheet I'm basing my numbers on and share it tomorrow. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)