[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602532#comment-14602532
]
Ryan Blue commented on PARQUET-41:
----------------------------------
Thanks for working on this, [~Ferd], it's great to be making some good progress
on it. This is getting to be a pretty long comment. I don't have all that many
conclusions, but I wanted to share some observations to start a discussion
around how this feature should be done.
I've mostly been thinking lately about the bloom filter configuration. I like
that FPP is a user setting because the query patterns really affect what value
you want for it. You can get much better space savings with a high FPP if you
know that typical queries will only look for a few items.
We can think of FPP as the probability that we will have to read a data page
even though it doesn't actually have the item we are looking for. That is
multiplied by the number of items in a query, which could be large but I think
will generally be less than ~10 elements (for basing a default). That puts a
general upper limit on the FPP because if it is something too high, like 10%, a
fair number of queries will end up reading unnecessary data with a 50+%
probability (anything checking for 5 or more unique items).
I think we should have a way to read the page stats without the filter, since
they can be pretty big. I took a look at a real-world dataset with 8-byte
timestamps that are ~75% unique, which put the expected filter size for a 2.5%
false-positive rate at 9% of the block size. If I'm looking for 32 timestamps
at once, I have an 80% chance of reading pages I don't need to read, and end up
reading an extra 9% for every page's bloom filter alone.
I don't think we want a setting for the expected number of entries. For one
thing, this varies widely across pages. I have a dataset with 20-30 values per
page in one column and 131,000 values per page in another. A setting for all
columns will definitely be a problem and I don't think we can trust users to
set this correctly for their data on every column.
We also don't know much about how many unique values are in a column or how
that column will compress with the encodings. Bloom filters are surprisingly
expensive in terms of space considering some of the encoding sizes we can get
in Parquet. For example, if we have a column where delta integer encoding is
doing a good job, values might be ~2 bytes each. If the column is 75% unique,
then even a 10% FPP will create a bloom filter that is ~22.5% of the page size,
and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding,
8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is
still significant. As encoding gets better, pages have more values and the
bloom filter needs to be larger.
Without knowing the percentage of unique values or the encoding size, choosing
the expected number of values for a page is impossible. Because of the
potential size of the filters compared to the page size, over-estimating the
filter size isn't enough: we don't want something 10% of the page size or
larger. That means that if we chose an estimate for the number of values, we
would still end up overloading filters fairly often. I took a look at the
false-positive probability for overloaded filters: if a filter is 125% loaded,
then the actual false-positive probability at least doubles, and for an
original 1% FPP, it triples. It gets much worse as the overloading increases:
200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind
that the expected overloading is probably not as low as 200% given that the
number of values per page can vary from tens to tens of thousands.
I think there are 2 approaches to fixing this. First, there's a paper,
[Scalable Bloom Filters|http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf],
that has a strategy to use a series of bloom filters so you don't have to know
the size in advance. It's a good paper, but we would want to change the
heuristics for growing the filter because we know when we are getting close to
the total number of elements in the page. Another draw-back is that it uses a
series of filters, so testing for an element has to be done in each filter.
I think a second approach is to keep the data in memory until we have enough to
determine the properties of the bloom filter. This would only need to be done
for the first few pages, while memory consumption is still small. We could keep
the hashed values instead of the actual data to get the size down to a set of
integers that will be approximately the number of uniques items in the page
(minus collisions). I like this option better because it is all on the write
side and trades a reasonable amount of memory for a more complicated filter.
The read side would be as it is now.
Okay, this is long enough. I'll clean up the spreadsheet I'm basing my numbers
on and share it tomorrow.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)