The main reason for bloom filter header encryption is tamper-proofing it with AES GCM. If not done, an attacker can alter the hash/algorithm fields, making the readers lose relevant data on the one hand, and load/filter irrelevant data on the other hand. Bloom filters are encrypted differently from pages (GCM only, no page ordinal, etc).
On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <[email protected]> wrote: > Many thanks for these useful comments! > > I will create a PR to update format according to these comments. > Reply some of them inline: > > The spec should give more detail on how to choose the number of blocks > and on false positive rates. The sentence with “11.54 bits for each > distinct value inserted into the filter” is vague: is this the > multi-block filter? Why is a 1% false-positive rate “recommended”? I > think it is okay to use 0.5% as each block’s false-positive rate, but > then this should state how to achieve an overall false-positive rate > as a function of the number of distinct values. > [junjie]: IIUC, 0.5% is overall FPR. The reason to recommend 1% FPR > is that it can use about 1.2MB bitset to store one million distinct > values which should satisfy most cases, but I cannot guarantee that > fit extreme cases such as only one column in a table, so I also > recommend to use a configuration to set the FPR. In the implementation > in the bloom-filter branch, I also set default max bloom filter size > to 1MB. > > Should the bloom filters support compression? > [junjie], The compression should help as you mentioned. I didn't add > it because I think the size of a bloom filter should be small for a > column chunk. Also, we may skip building bloom filter if there is an > existing dictionary. > > Why is the bloom filter header encrypted? > In a discussion with Gidon, we 'd like to follow the current way when > encrypting pages. > > > On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <[email protected]> wrote: > > > > Right now, I’m voting -1 on the spec as written because I don’t think > that it is clear enough to be correctly implemented from just the spec. > I’ll change that with a few corrections. > > > > Here are the items I think need to be fixed before I’ll change my vote > to support this spec: > > > > The algorithm should be more clear: > > > > How is the bloom filter block selected from the 32 most-significant bits > from of the hash function? These details must be in the spec and not in > papers linked from the spec. > > How is the number of blocks determined? From the overall filter size? > > I think that the exact procedure for a lookup in each block should be > covered in a section, followed by a section for how to perform a look up in > the multi-block filter. The wording also needs to be cleaned up so that it > is always clear whether the filter that is referenced is a block or the > multi-block filter. > > > > The spec should give more detail on how to choose the number of blocks > and on false positive rates. The sentence with “11.54 bits for each > distinct value inserted into the filter” is vague: is this the multi-block > filter? Why is a 1% false-positive rate “recommended”? > > > > I think it is okay to use 0.5% as each block’s false-positive rate, but > then this should state how to achieve an overall false-positive rate as a > function of the number of distinct values. > > > > This should be more clear about the scope of bloom filters. The only > reference right now is “The Bloom filter data of a column chunk, …”. If > each filter is for a column chunk, then this should be stated as a > requirement. > > The position of bloom filter headers and bloom filter bit sets is too > vague: “The Bloom filter data of a column chunk, which contains the size of > the filter in bytes, the algorithm, the hash function and the Bloom filter > bitset, is stored near the footer.” > > > > This should state that the column chunk offset points to the start of > the bloom filter header, which immediately precedes the bloom filter bytes. > The number of bytes must be stored as numBytes in the header. Include > layout diagrams like the ones in the encryption spec. > > The position of all bloom filters in relation to the page indexes should > be clear. Bloom filters should be written just before column indexes? > > The position of each bloom filter within the bloom filter data section > should be clear. Are bloom filters for a column located together, or bloom > filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2) > > Does the format allow locating bloom filter data anywhere else other > than just before the header? Maybe the locations are recommendations and > not requirements? > > > > Should the bloom filters support compression? If the strategy for a > lower false-positive rate is to under-fill the multi-block filter, then > would compression help reduce the storage cost? > > Why is the bloom filter header encrypted? > > > > > > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <[email protected]> wrote: > >> > >> Hi Wes, > >> > >> Thanks for the reply. It indeed does not have automation support in > >> the test. Since now we have the repo parquet-testing for the > >> integration test, I think we can do some automation base on that. I > >> will take some time to look at this and update the integration plan in > >> PARQUET-1326. > >> > >> Besides that, we can open a thread to discuss building automation > >> integration test framework for features WIP and in the future. > >> > >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <[email protected]> > wrote: > >> > > >> > Wes, > >> > > >> > Since the v2 format is unfinished, I don't think anyone should be > writing > >> > v2 pages. In fact, I don't think v2 pages would be recommended in the > v2 > >> > spec at this point. Do you know why Spark is writing this data? I > didn't > >> > know that Spark would write v2 by default. > >> > > >> > rb > >> > > >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <[email protected]> > wrote: > >> > > >> > > I'm not sure when I can have a more thorough look at this. > >> > > > >> > > To be honest, I'm personally struggling with the burden of > supporting > >> > > the existing feature set in the Parquet C++ library. The integration > >> > > testing strategy for this as well as other features that are being > >> > > added to both the Java and C++ libraries (e.g. encryption) make me > >> > > uncomfortable due to the lack of automation. As an example, > DataPageV2 > >> > > support in parquet-cpp has been broken since the beginning of the > >> > > project (see PARQUET-458) but it's only recently become an issue > when > >> > > people have been trying to read such files produced by Spark. More > >> > > comprehensive integration testing would help ensure that the > libraries > >> > > remain compatible. > >> > > > >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <[email protected]> wrote: > >> > > > > >> > > > Dear Parquet developers > >> > > > > >> > > > We still need your vote! > >> > > > > >> > > > > >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <[email protected]> wrote: > >> > > > > > >> > > > > Hi @Ryan Blue @Wes McKinney > >> > > > > > >> > > > > We need your valuable vote, any feedback is welcome as well. > >> > > > > > >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <[email protected]> wrote: > >> > > > > > > >> > > > > > Call for voting again. > >> > > > > > > >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <[email protected]> > wrote: > >> > > > > > > > >> > > > > > > Dear Parquet developers > >> > > > > > > > >> > > > > > > We need more votes, please help to vote on this. > >> > > > > > > > >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky > >> > > > > > > <[email protected]> wrote: > >> > > > > > > > > >> > > > > > > > After getting in PARQUET-1625 I vote again for having > bloom > >> > > filter spec and > >> > > > > > > > the thrift file update as is in parquet-format master. > >> > > > > > > > +1 (binding) > >> > > > > > > > > >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <[email protected]> > wrote: > >> > > > > > > > > >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We > don't > >> > > have to > >> > > > > > > > > run it in a hurry, it has been developed for a long > time yet.:) > >> > > > > > > > > > >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As > the > >> > > spec > >> > > > > > > > > defined, the bloom filter data is stored near the > footer which > >> > > means > >> > > > > > > > > we don't have to handle it like the page. Therefore, I > just > >> > > opened a > >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader > >> > > structure, while > >> > > > > > > > > the BloomFitlerHeader is kept intentionally for > convenience. > >> > > Since the > >> > > > > > > > > spec and the thrift should be aligned with each other > >> > > eventually, so > >> > > > > > > > > the vote target is both of them. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky > >> > > > > > > > > <[email protected]> wrote: > >> > > > > > > > > > > >> > > > > > > > > > Hi Junjie, > >> > > > > > > > > > > >> > > > > > > > > > Sorry for bringing up this a bit late but I have some > >> > > problems with the > >> > > > > > > > > > format update. The parquet.thrift file is updated to > have > >> > > the bloom > >> > > > > > > > > filters > >> > > > > > > > > > as a page (just as dictionaries and data pages). > Meanwhile, > >> > > the spec > >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored > near > >> > > the footer. > >> > > > > > > > > So, > >> > > > > > > > > > if the bloom filter is not part of the row-groups > (like > >> > > column indexes) I > >> > > > > > > > > > would not add it as a page. See the struct > ColumnIndex in > >> > > the thrift > >> > > > > > > > > file. > >> > > > > > > > > > This struct is not referenced anywhere in it only > declared. > >> > > It was done > >> > > > > > > > > > this way because we don't parse it in the same way as > we > >> > > parse the pages. > >> > > > > > > > > > > >> > > > > > > > > > Currently, I am not 100% sure about the target of > this vote. > >> > > If it is a > >> > > > > > > > > > vote about adding bloom filters in general then it is > a +1 > >> > > (binding). If > >> > > > > > > > > it > >> > > > > > > > > > is about adding the bloom filters to parquet-format > as is > >> > > then, it is a > >> > > > > > > > > -1 > >> > > > > > > > > > (binding) until we fix the issue above. > >> > > > > > > > > > > >> > > > > > > > > > Regards, > >> > > > > > > > > > Gabor > >> > > > > > > > > > > >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky < > >> > > [email protected]> > >> > > > > > > > > wrote: > >> > > > > > > > > > > >> > > > > > > > > > > +1 (non-binding) > >> > > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi > >> > > <[email protected] > >> > > > > > > > > > > >> > > > > > > > > > > wrote: > >> > > > > > > > > > > > >> > > > > > > > > > > > +1 (binding) > >> > > > > > > > > > > > > >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 < > [email protected]> > >> > > wrote: > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > Dear Parquet developers > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > I'd like to resume this vote, you can start to > vote > >> > > now. Thanks for > >> > > > > > > > > > > your > >> > > > > > > > > > > > time. > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 < > >> > > [email protected]> wrote: > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > I see, will resume this next week. Thanks. > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi > >> > > > > > > > > > > <[email protected]> > >> > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi Junjie, > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Since there are ongoing improvements > addressing > >> > > review > >> > > > > > > > > comments, I > >> > > > > > > > > > > > would > >> > > > > > > > > > > > > > > hold off with the vote for a few more days > until > >> > > the > >> > > > > > > > > specification > >> > > > > > > > > > > > settles. > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Br, > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Zoltan > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 < > >> > > [email protected]> > >> > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hi Parquet committers and developers > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > We are waiting for your important ballot:) > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 < > >> > > [email protected]> > >> > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark > results, > >> > > such as the > >> > > > > > > > > > > > official > >> > > > > > > > > > > > > > > > > benchmark from xxhash site ( > >> > > http://www.xxhash.com/) and > >> > > > > > > > > > > > published > >> > > > > > > > > > > > > > > > > comparison from smhasher project > >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/). > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes > McKinney < > >> > > > > > > > > > > [email protected]> > >> > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to > support > >> > > the choice of > >> > > > > > > > > hash > >> > > > > > > > > > > > function? > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 < > >> > > [email protected]> > >> > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Dear Parquet developers > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to > >> > > update voting > >> > > > > > > > > content > >> > > > > > > > > > > > to the > >> > > > > > > > > > > > > > > > spec > >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you > can > >> > > reply with +1 > >> > > > > > > > > or -1. > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for your participation. > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 > < > >> > > > > > > > > [email protected]> > >> > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been > developed > >> > > for a while, > >> > > > > > > > > per > >> > > > > > > > > > > > the > >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to > call a > >> > > vote for > >> > > > > > > > > spec to > >> > > > > > > > > > > > move > >> > > > > > > > > > > > > > > > forward. The current spec can be found at > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md > . > >> > > > > > > > > > > > > > > > There are some different options about the > >> > > internal hash > >> > > > > > > > > choice > >> > > > > > > > > > > of > >> > > > > > > > > > > > Bloom > >> > > > > > > > > > > > > > > > filter and the PR is for that concern. > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote > the spec > >> > > + hash > >> > > > > > > > > option, > >> > > > > > > > > > > for > >> > > > > > > > > > > > > > > > example: > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash > >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3 > >> > > > > > > > > > > > > > > > > > > > ... > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback > is > >> > > also welcome in > >> > > > > > > > > the > >> > > > > > > > > > > > > > > > discussion thread. > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > -- > >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > -- > >> > > > > > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > -- > >> > > > > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > -- > >> > > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > -- > >> > > > > > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > -- > >> > > > > > > > > Thanks & Best Regards > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > -- > >> > > > > > > Thanks & Best Regards > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > -- > >> > > > > > Thanks & Best Regards > >> > > > > > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > Thanks & Best Regards > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Thanks & Best Regards > >> > > > >> > > >> > > >> > -- > >> > Ryan Blue > >> > Software Engineer > >> > Netflix > >> > >> > >> > >> -- > >> Thanks & Best Regards > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > -- > Thanks & Best Regards >
