Re: [VOTE] Parquet Bloom filter spec sign-off

俊杰陈 Sun, 04 Aug 2019 08:11:09 -0700

Thanks Gidon to correct me.

On Sun, Aug 4, 2019 at 10:37 PM Gidon Gershinsky <[email protected]> wrote:
>
> The main reason for bloom filter header encryption is tamper-proofing it
> with AES GCM.
> If not done, an attacker can alter the hash/algorithm fields, making the
> readers lose relevant data on the one hand, and load/filter irrelevant data
> on the other hand.
> Bloom filters are encrypted differently from pages (GCM only, no page
> ordinal, etc).
>
> On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <[email protected]> wrote:
>
> > Many thanks for these useful comments！
> >
> > I will create a PR to update format according to these comments.
> > Reply some of them inline:
> >
> > The spec should give more detail on how to choose the number of blocks
> > and on false positive rates. The sentence with “11.54 bits for each
> > distinct value inserted into the filter” is vague: is this the
> > multi-block filter? Why is a 1% false-positive rate “recommended”? I
> > think it is okay to use 0.5% as each block’s false-positive rate, but
> > then this should state how to achieve an overall false-positive rate
> > as a function of the number of distinct values.
> > [junjie]: IIUC, 0.5% is overall FPR.  The reason to recommend 1% FPR
> > is that it can use about 1.2MB bitset to store one million distinct
> > values which should satisfy most cases, but I cannot guarantee that
> > fit extreme cases such as only one column in a table, so I also
> > recommend to use a configuration to set the FPR. In the implementation
> > in the bloom-filter branch, I also set default max bloom filter size
> > to 1MB.
> >
> > Should the bloom filters support compression?
> > [junjie], The compression should help as you mentioned. I didn't add
> > it because I think the size of a bloom filter should be small for a
> > column chunk. Also, we may skip building bloom filter if there is an
> > existing dictionary.
> >
> > Why is the bloom filter header encrypted?
> > In a discussion with Gidon, we 'd like to follow the current way when
> > encrypting pages.
> >
> >
> > On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <[email protected]> wrote:
> > >
> > > Right now, I’m voting -1 on the spec as written because I don’t think
> > that it is clear enough to be correctly implemented from just the spec.
> > I’ll change that with a few corrections.
> > >
> > > Here are the items I think need to be fixed before I’ll change my vote
> > to support this spec:
> > >
> > > The algorithm should be more clear:
> > >
> > > How is the bloom filter block selected from the 32 most-significant bits
> > from of the hash function? These details must be in the spec and not in
> > papers linked from the spec.
> > > How is the number of blocks determined? From the overall filter size?
> > > I think that the exact procedure for a lookup in each block should be
> > covered in a section, followed by a section for how to perform a look up in
> > the multi-block filter. The wording also needs to be cleaned up so that it
> > is always clear whether the filter that is referenced is a block or the
> > multi-block filter.
> > >
> > > The spec should give more detail on how to choose the number of blocks
> > and on false positive rates. The sentence with “11.54 bits for each
> > distinct value inserted into the filter” is vague: is this the multi-block
> > filter? Why is a 1% false-positive rate “recommended”?
> > >
> > > I think it is okay to use 0.5% as each block’s false-positive rate, but
> > then this should state how to achieve an overall false-positive rate as a
> > function of the number of distinct values.
> > >
> > > This should be more clear about the scope of bloom filters. The only
> > reference right now is “The Bloom filter data of a column chunk, …”. If
> > each filter is for a column chunk, then this should be stated as a
> > requirement.
> > > The position of bloom filter headers and bloom filter bit sets is too
> > vague: “The Bloom filter data of a column chunk, which contains the size of
> > the filter in bytes, the algorithm, the hash function and the Bloom filter
> > bitset, is stored near the footer.”
> > >
> > > This should state that the column chunk offset points to the start of
> > the bloom filter header, which immediately precedes the bloom filter bytes.
> > The number of bytes must be stored as numBytes in the header. Include
> > layout diagrams like the ones in the encryption spec.
> > > The position of all bloom filters in relation to the page indexes should
> > be clear. Bloom filters should be written just before column indexes?
> > > The position of each bloom filter within the bloom filter data section
> > should be clear. Are bloom filters for a column located together, or bloom
> > filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2)
> > > Does the format allow locating bloom filter data anywhere else other
> > than just before the header? Maybe the locations are recommendations and
> > not requirements?
> > >
> > > Should the bloom filters support compression? If the strategy for a
> > lower false-positive rate is to under-fill the multi-block filter, then
> > would compression help reduce the storage cost?
> > > Why is the bloom filter header encrypted?
> > >
> > >
> > > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <[email protected]> wrote:
> > >>
> > >> Hi Wes,
> > >>
> > >> Thanks for the reply. It indeed does not have automation support in
> > >> the test. Since now we have the repo parquet-testing for the
> > >> integration test, I think we can do some automation base on that. I
> > >> will take some time to look at this and update the integration plan in
> > >> PARQUET-1326.
> > >>
> > >> Besides that, we can open a thread to discuss building automation
> > >> integration test framework for features WIP and in the future.
> > >>
> > >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <[email protected]>
> > wrote:
> > >> >
> > >> > Wes,
> > >> >
> > >> > Since the v2 format is unfinished, I don't think anyone should be
> > writing
> > >> > v2 pages. In fact, I don't think v2 pages would be recommended in the
> > v2
> > >> > spec at this point. Do you know why Spark is writing this data? I
> > didn't
> > >> > know that Spark would write v2 by default.
> > >> >
> > >> > rb
> > >> >
> > >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <[email protected]>
> > wrote:
> > >> >
> > >> > > I'm not sure when I can have a more thorough look at this.
> > >> > >
> > >> > > To be honest, I'm personally struggling with the burden of
> > supporting
> > >> > > the existing feature set in the Parquet C++ library. The integration
> > >> > > testing strategy for this as well as other features that are being
> > >> > > added to both the Java and C++ libraries (e.g. encryption) make me
> > >> > > uncomfortable due to the lack of automation. As an example,
> > DataPageV2
> > >> > > support in parquet-cpp has been broken since the beginning of the
> > >> > > project (see PARQUET-458) but it's only recently become an issue
> > when
> > >> > > people have been trying to read such files produced by Spark. More
> > >> > > comprehensive integration testing would help ensure that the
> > libraries
> > >> > > remain compatible.
> > >> > >
> > >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <[email protected]> wrote:
> > >> > > >
> > >> > > > Dear Parquet developers
> > >> > > >
> > >> > > > We still need your vote!
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <[email protected]> wrote:
> > >> > > > >
> > >> > > > > Hi @Ryan Blue  @Wes McKinney
> > >> > > > >
> > >> > > > > We need your valuable vote, any feedback is welcome as well.
> > >> > > > >
> > >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <[email protected]> wrote:
> > >> > > > > >
> > >> > > > > > Call for voting again.
> > >> > > > > >
> > >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <[email protected]>
> > wrote:
> > >> > > > > > >
> > >> > > > > > > Dear Parquet developers
> > >> > > > > > >
> > >> > > > > > > We need more votes, please help to vote on this.
> > >> > > > > > >
> > >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > >> > > > > > > <[email protected]> wrote:
> > >> > > > > > > >
> > >> > > > > > > > After getting in PARQUET-1625 I vote again for having
> > bloom
> > >> > > filter spec and
> > >> > > > > > > > the thrift file update as is in parquet-format master.
> > >> > > > > > > > +1 (binding)
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <[email protected]>
> > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We
> > don't
> > >> > > have to
> > >> > > > > > > > > run it in a hurry, it has been developed for a long
> > time yet.:)
> > >> > > > > > > > >
> > >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As
> > the
> > >> > > spec
> > >> > > > > > > > > defined, the bloom filter data is stored near the
> > footer which
> > >> > > means
> > >> > > > > > > > > we don't have to handle it like the page. Therefore, I
> > just
> > >> > > opened a
> > >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > >> > > structure, while
> > >> > > > > > > > > the BloomFitlerHeader is kept intentionally for
> > convenience.
> > >> > > Since the
> > >> > > > > > > > > spec and the thrift should be aligned with each other
> > >> > > eventually, so
> > >> > > > > > > > > the vote target is both of them.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > >> > > > > > > > > <[email protected]> wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > Hi Junjie,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> > >> > > problems with the
> > >> > > > > > > > > > format update. The parquet.thrift file is updated to
> > have
> > >> > > the bloom
> > >> > > > > > > > > filters
> > >> > > > > > > > > > as a page (just as dictionaries and data pages).
> > Meanwhile,
> > >> > > the spec
> > >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored
> > near
> > >> > > the footer.
> > >> > > > > > > > > So,
> > >> > > > > > > > > > if the bloom filter is not part of the row-groups
> > (like
> > >> > > column indexes) I
> > >> > > > > > > > > > would not add it as a page. See the struct
> > ColumnIndex in
> > >> > > the thrift
> > >> > > > > > > > > file.
> > >> > > > > > > > > > This struct is not referenced anywhere in it only
> > declared.
> > >> > > It was done
> > >> > > > > > > > > > this way because we don't parse it in the same way as
> > we
> > >> > > parse the pages.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Currently, I am not 100% sure about the target of
> > this vote.
> > >> > > If it is a
> > >> > > > > > > > > > vote about adding bloom filters in general then it is
> > a +1
> > >> > > (binding). If
> > >> > > > > > > > > it
> > >> > > > > > > > > > is about adding the bloom filters to parquet-format
> > as is
> > >> > > then, it is a
> > >> > > > > > > > > -1
> > >> > > > > > > > > > (binding) until we fix the issue above.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Regards,
> > >> > > > > > > > > > Gabor
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > >> > > [email protected]>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > +1 (non-binding)
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > >> > > <[email protected]
> > >> > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > +1 (binding)
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <
> > [email protected]>
> > >> > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I'd like to resume this vote, you can start to
> > vote
> > >> > > now. Thanks for
> > >> > > > > > > > > > > your
> > >> > > > > > > > > > > > time.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > >> > > [email protected]> wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > >> > > > > > > > > > > <[email protected]>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Hi Junjie,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Since there are ongoing improvements
> > addressing
> > >> > > review
> > >> > > > > > > > > comments, I
> > >> > > > > > > > > > > > would
> > >> > > > > > > > > > > > > > > hold off with the vote for a few more days
> > until
> > >> > > the
> > >> > > > > > > > > specification
> > >> > > > > > > > > > > > settles.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Br,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Zoltan
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > >> > > [email protected]>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > >> > > [email protected]>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark
> > results,
> > >> > > such as the
> > >> > > > > > > > > > > > official
> > >> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> > >> > > http://www.xxhash.com/) and
> > >> > > > > > > > > > > > published
> > >> > > > > > > > > > > > > > > > > comparison from smhasher project
> > >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes
> > McKinney <
> > >> > > > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to
> > support
> > >> > > the choice of
> > >> > > > > > > > > hash
> > >> > > > > > > > > > > > function?
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > >> > > [email protected]>
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > >> > > update voting
> > >> > > > > > > > > content
> > >> > > > > > > > > > > > to the
> > >> > > > > > > > > > > > > > > > spec
> > >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you
> > can
> > >> > > reply with +1
> > >> > > > > > > > > or -1.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈
> > <
> > >> > > > > > > > > [email protected]>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been
> > developed
> > >> > > for a while,
> > >> > > > > > > > > per
> > >> > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to
> > call a
> > >> > > vote for
> > >> > > > > > > > > spec to
> > >> > > > > > > > > > > > move
> > >> > > > > > > > > > > > > > > > forward. The current spec can be found at
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md
> > .
> > >> > > > > > > > > > > > > > > > There are some different options about the
> > >> > > internal hash
> > >> > > > > > > > > choice
> > >> > > > > > > > > > > of
> > >> > > > > > > > > > > > Bloom
> > >> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote
> > the spec
> > >> > > + hash
> > >> > > > > > > > > option,
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > > > > example:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > >> > > > > > > > > > > > > > > > > > > > ...
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback
> > is
> > >> > > also welcome in
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > > > > discussion thread.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > Thanks & Best Regards
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Thanks & Best Regards
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Thanks & Best Regards
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Thanks & Best Regards
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Ryan Blue
> > >> > Software Engineer
> > >> > Netflix
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
> >
> > --
> > Thanks & Best Regards
> >




-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Reply via email to