Thanks Gidon to correct me. On Sun, Aug 4, 2019 at 10:37 PM Gidon Gershinsky <[email protected]> wrote: > > The main reason for bloom filter header encryption is tamper-proofing it > with AES GCM. > If not done, an attacker can alter the hash/algorithm fields, making the > readers lose relevant data on the one hand, and load/filter irrelevant data > on the other hand. > Bloom filters are encrypted differently from pages (GCM only, no page > ordinal, etc). > > On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <[email protected]> wrote: > > > Many thanks for these useful comments! > > > > I will create a PR to update format according to these comments. > > Reply some of them inline: > > > > The spec should give more detail on how to choose the number of blocks > > and on false positive rates. The sentence with “11.54 bits for each > > distinct value inserted into the filter” is vague: is this the > > multi-block filter? Why is a 1% false-positive rate “recommended”? I > > think it is okay to use 0.5% as each block’s false-positive rate, but > > then this should state how to achieve an overall false-positive rate > > as a function of the number of distinct values. > > [junjie]: IIUC, 0.5% is overall FPR. The reason to recommend 1% FPR > > is that it can use about 1.2MB bitset to store one million distinct > > values which should satisfy most cases, but I cannot guarantee that > > fit extreme cases such as only one column in a table, so I also > > recommend to use a configuration to set the FPR. In the implementation > > in the bloom-filter branch, I also set default max bloom filter size > > to 1MB. > > > > Should the bloom filters support compression? > > [junjie], The compression should help as you mentioned. I didn't add > > it because I think the size of a bloom filter should be small for a > > column chunk. Also, we may skip building bloom filter if there is an > > existing dictionary. > > > > Why is the bloom filter header encrypted? > > In a discussion with Gidon, we 'd like to follow the current way when > > encrypting pages. > > > > > > On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <[email protected]> wrote: > > > > > > Right now, I’m voting -1 on the spec as written because I don’t think > > that it is clear enough to be correctly implemented from just the spec. > > I’ll change that with a few corrections. > > > > > > Here are the items I think need to be fixed before I’ll change my vote > > to support this spec: > > > > > > The algorithm should be more clear: > > > > > > How is the bloom filter block selected from the 32 most-significant bits > > from of the hash function? These details must be in the spec and not in > > papers linked from the spec. > > > How is the number of blocks determined? From the overall filter size? > > > I think that the exact procedure for a lookup in each block should be > > covered in a section, followed by a section for how to perform a look up in > > the multi-block filter. The wording also needs to be cleaned up so that it > > is always clear whether the filter that is referenced is a block or the > > multi-block filter. > > > > > > The spec should give more detail on how to choose the number of blocks > > and on false positive rates. The sentence with “11.54 bits for each > > distinct value inserted into the filter” is vague: is this the multi-block > > filter? Why is a 1% false-positive rate “recommended”? > > > > > > I think it is okay to use 0.5% as each block’s false-positive rate, but > > then this should state how to achieve an overall false-positive rate as a > > function of the number of distinct values. > > > > > > This should be more clear about the scope of bloom filters. The only > > reference right now is “The Bloom filter data of a column chunk, …”. If > > each filter is for a column chunk, then this should be stated as a > > requirement. > > > The position of bloom filter headers and bloom filter bit sets is too > > vague: “The Bloom filter data of a column chunk, which contains the size of > > the filter in bytes, the algorithm, the hash function and the Bloom filter > > bitset, is stored near the footer.” > > > > > > This should state that the column chunk offset points to the start of > > the bloom filter header, which immediately precedes the bloom filter bytes. > > The number of bytes must be stored as numBytes in the header. Include > > layout diagrams like the ones in the encryption spec. > > > The position of all bloom filters in relation to the page indexes should > > be clear. Bloom filters should be written just before column indexes? > > > The position of each bloom filter within the bloom filter data section > > should be clear. Are bloom filters for a column located together, or bloom > > filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2) > > > Does the format allow locating bloom filter data anywhere else other > > than just before the header? Maybe the locations are recommendations and > > not requirements? > > > > > > Should the bloom filters support compression? If the strategy for a > > lower false-positive rate is to under-fill the multi-block filter, then > > would compression help reduce the storage cost? > > > Why is the bloom filter header encrypted? > > > > > > > > > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <[email protected]> wrote: > > >> > > >> Hi Wes, > > >> > > >> Thanks for the reply. It indeed does not have automation support in > > >> the test. Since now we have the repo parquet-testing for the > > >> integration test, I think we can do some automation base on that. I > > >> will take some time to look at this and update the integration plan in > > >> PARQUET-1326. > > >> > > >> Besides that, we can open a thread to discuss building automation > > >> integration test framework for features WIP and in the future. > > >> > > >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <[email protected]> > > wrote: > > >> > > > >> > Wes, > > >> > > > >> > Since the v2 format is unfinished, I don't think anyone should be > > writing > > >> > v2 pages. In fact, I don't think v2 pages would be recommended in the > > v2 > > >> > spec at this point. Do you know why Spark is writing this data? I > > didn't > > >> > know that Spark would write v2 by default. > > >> > > > >> > rb > > >> > > > >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <[email protected]> > > wrote: > > >> > > > >> > > I'm not sure when I can have a more thorough look at this. > > >> > > > > >> > > To be honest, I'm personally struggling with the burden of > > supporting > > >> > > the existing feature set in the Parquet C++ library. The integration > > >> > > testing strategy for this as well as other features that are being > > >> > > added to both the Java and C++ libraries (e.g. encryption) make me > > >> > > uncomfortable due to the lack of automation. As an example, > > DataPageV2 > > >> > > support in parquet-cpp has been broken since the beginning of the > > >> > > project (see PARQUET-458) but it's only recently become an issue > > when > > >> > > people have been trying to read such files produced by Spark. More > > >> > > comprehensive integration testing would help ensure that the > > libraries > > >> > > remain compatible. > > >> > > > > >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <[email protected]> wrote: > > >> > > > > > >> > > > Dear Parquet developers > > >> > > > > > >> > > > We still need your vote! > > >> > > > > > >> > > > > > >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <[email protected]> wrote: > > >> > > > > > > >> > > > > Hi @Ryan Blue @Wes McKinney > > >> > > > > > > >> > > > > We need your valuable vote, any feedback is welcome as well. > > >> > > > > > > >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <[email protected]> wrote: > > >> > > > > > > > >> > > > > > Call for voting again. > > >> > > > > > > > >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <[email protected]> > > wrote: > > >> > > > > > > > > >> > > > > > > Dear Parquet developers > > >> > > > > > > > > >> > > > > > > We need more votes, please help to vote on this. > > >> > > > > > > > > >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky > > >> > > > > > > <[email protected]> wrote: > > >> > > > > > > > > > >> > > > > > > > After getting in PARQUET-1625 I vote again for having > > bloom > > >> > > filter spec and > > >> > > > > > > > the thrift file update as is in parquet-format master. > > >> > > > > > > > +1 (binding) > > >> > > > > > > > > > >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <[email protected]> > > wrote: > > >> > > > > > > > > > >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We > > don't > > >> > > have to > > >> > > > > > > > > run it in a hurry, it has been developed for a long > > time yet.:) > > >> > > > > > > > > > > >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As > > the > > >> > > spec > > >> > > > > > > > > defined, the bloom filter data is stored near the > > footer which > > >> > > means > > >> > > > > > > > > we don't have to handle it like the page. Therefore, I > > just > > >> > > opened a > > >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader > > >> > > structure, while > > >> > > > > > > > > the BloomFitlerHeader is kept intentionally for > > convenience. > > >> > > Since the > > >> > > > > > > > > spec and the thrift should be aligned with each other > > >> > > eventually, so > > >> > > > > > > > > the vote target is both of them. > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky > > >> > > > > > > > > <[email protected]> wrote: > > >> > > > > > > > > > > > >> > > > > > > > > > Hi Junjie, > > >> > > > > > > > > > > > >> > > > > > > > > > Sorry for bringing up this a bit late but I have some > > >> > > problems with the > > >> > > > > > > > > > format update. The parquet.thrift file is updated to > > have > > >> > > the bloom > > >> > > > > > > > > filters > > >> > > > > > > > > > as a page (just as dictionaries and data pages). > > Meanwhile, > > >> > > the spec > > >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored > > near > > >> > > the footer. > > >> > > > > > > > > So, > > >> > > > > > > > > > if the bloom filter is not part of the row-groups > > (like > > >> > > column indexes) I > > >> > > > > > > > > > would not add it as a page. See the struct > > ColumnIndex in > > >> > > the thrift > > >> > > > > > > > > file. > > >> > > > > > > > > > This struct is not referenced anywhere in it only > > declared. > > >> > > It was done > > >> > > > > > > > > > this way because we don't parse it in the same way as > > we > > >> > > parse the pages. > > >> > > > > > > > > > > > >> > > > > > > > > > Currently, I am not 100% sure about the target of > > this vote. > > >> > > If it is a > > >> > > > > > > > > > vote about adding bloom filters in general then it is > > a +1 > > >> > > (binding). If > > >> > > > > > > > > it > > >> > > > > > > > > > is about adding the bloom filters to parquet-format > > as is > > >> > > then, it is a > > >> > > > > > > > > -1 > > >> > > > > > > > > > (binding) until we fix the issue above. > > >> > > > > > > > > > > > >> > > > > > > > > > Regards, > > >> > > > > > > > > > Gabor > > >> > > > > > > > > > > > >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky < > > >> > > [email protected]> > > >> > > > > > > > > wrote: > > >> > > > > > > > > > > > >> > > > > > > > > > > +1 (non-binding) > > >> > > > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi > > >> > > <[email protected] > > >> > > > > > > > > > > > >> > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > >> > > > > > > > > > > > +1 (binding) > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 < > > [email protected]> > > >> > > wrote: > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > Dear Parquet developers > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > I'd like to resume this vote, you can start to > > vote > > >> > > now. Thanks for > > >> > > > > > > > > > > your > > >> > > > > > > > > > > > time. > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 < > > >> > > [email protected]> wrote: > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > I see, will resume this next week. Thanks. > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi > > >> > > > > > > > > > > <[email protected]> > > >> > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi Junjie, > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Since there are ongoing improvements > > addressing > > >> > > review > > >> > > > > > > > > comments, I > > >> > > > > > > > > > > > would > > >> > > > > > > > > > > > > > > hold off with the vote for a few more days > > until > > >> > > the > > >> > > > > > > > > specification > > >> > > > > > > > > > > > settles. > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Br, > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Zoltan > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 < > > >> > > [email protected]> > > >> > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hi Parquet committers and developers > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > We are waiting for your important ballot:) > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 < > > >> > > [email protected]> > > >> > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark > > results, > > >> > > such as the > > >> > > > > > > > > > > > official > > >> > > > > > > > > > > > > > > > > benchmark from xxhash site ( > > >> > > http://www.xxhash.com/) and > > >> > > > > > > > > > > > published > > >> > > > > > > > > > > > > > > > > comparison from smhasher project > > >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/). > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes > > McKinney < > > >> > > > > > > > > > > [email protected]> > > >> > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to > > support > > >> > > the choice of > > >> > > > > > > > > hash > > >> > > > > > > > > > > > function? > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 < > > >> > > [email protected]> > > >> > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Dear Parquet developers > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to > > >> > > update voting > > >> > > > > > > > > content > > >> > > > > > > > > > > > to the > > >> > > > > > > > > > > > > > > > spec > > >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you > > can > > >> > > reply with +1 > > >> > > > > > > > > or -1. > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for your participation. > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 > > < > > >> > > > > > > > > [email protected]> > > >> > > > > > > > > > > > wrote: > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been > > developed > > >> > > for a while, > > >> > > > > > > > > per > > >> > > > > > > > > > > > the > > >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to > > call a > > >> > > vote for > > >> > > > > > > > > spec to > > >> > > > > > > > > > > > move > > >> > > > > > > > > > > > > > > > forward. The current spec can be found at > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md > > . > > >> > > > > > > > > > > > > > > > There are some different options about the > > >> > > internal hash > > >> > > > > > > > > choice > > >> > > > > > > > > > > of > > >> > > > > > > > > > > > Bloom > > >> > > > > > > > > > > > > > > > filter and the PR is for that concern. > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote > > the spec > > >> > > + hash > > >> > > > > > > > > option, > > >> > > > > > > > > > > for > > >> > > > > > > > > > > > > > > > example: > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash > > >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3 > > >> > > > > > > > > > > > > > > > > > > > ... > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback > > is > > >> > > also welcome in > > >> > > > > > > > > the > > >> > > > > > > > > > > > > > > > discussion thread. > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > -- > > >> > > > > > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > -- > > >> > > > > > > > > Thanks & Best Regards > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > -- > > >> > > > > > > Thanks & Best Regards > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > -- > > >> > > > > > Thanks & Best Regards > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > -- > > >> > > > > Thanks & Best Regards > > >> > > > > > >> > > > > > >> > > > > > >> > > > -- > > >> > > > Thanks & Best Regards > > >> > > > > >> > > > >> > > > >> > -- > > >> > Ryan Blue > > >> > Software Engineer > > >> > Netflix > > >> > > >> > > >> > > >> -- > > >> Thanks & Best Regards > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > -- > > Thanks & Best Regards > >
-- Thanks & Best Regards
