The main reason for bloom filter header encryption is tamper-proofing it
with AES GCM.
If not done, an attacker can alter the hash/algorithm fields, making the
readers lose relevant data on the one hand, and load/filter irrelevant data
on the other hand.
Bloom filters are encrypted differently from pages (GCM only, no page
ordinal, etc).

On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <[email protected]> wrote:

> Many thanks for these useful comments!
>
> I will create a PR to update format according to these comments.
> Reply some of them inline:
>
> The spec should give more detail on how to choose the number of blocks
> and on false positive rates. The sentence with “11.54 bits for each
> distinct value inserted into the filter” is vague: is this the
> multi-block filter? Why is a 1% false-positive rate “recommended”? I
> think it is okay to use 0.5% as each block’s false-positive rate, but
> then this should state how to achieve an overall false-positive rate
> as a function of the number of distinct values.
> [junjie]: IIUC, 0.5% is overall FPR.  The reason to recommend 1% FPR
> is that it can use about 1.2MB bitset to store one million distinct
> values which should satisfy most cases, but I cannot guarantee that
> fit extreme cases such as only one column in a table, so I also
> recommend to use a configuration to set the FPR. In the implementation
> in the bloom-filter branch, I also set default max bloom filter size
> to 1MB.
>
> Should the bloom filters support compression?
> [junjie], The compression should help as you mentioned. I didn't add
> it because I think the size of a bloom filter should be small for a
> column chunk. Also, we may skip building bloom filter if there is an
> existing dictionary.
>
> Why is the bloom filter header encrypted?
> In a discussion with Gidon, we 'd like to follow the current way when
> encrypting pages.
>
>
> On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <[email protected]> wrote:
> >
> > Right now, I’m voting -1 on the spec as written because I don’t think
> that it is clear enough to be correctly implemented from just the spec.
> I’ll change that with a few corrections.
> >
> > Here are the items I think need to be fixed before I’ll change my vote
> to support this spec:
> >
> > The algorithm should be more clear:
> >
> > How is the bloom filter block selected from the 32 most-significant bits
> from of the hash function? These details must be in the spec and not in
> papers linked from the spec.
> > How is the number of blocks determined? From the overall filter size?
> > I think that the exact procedure for a lookup in each block should be
> covered in a section, followed by a section for how to perform a look up in
> the multi-block filter. The wording also needs to be cleaned up so that it
> is always clear whether the filter that is referenced is a block or the
> multi-block filter.
> >
> > The spec should give more detail on how to choose the number of blocks
> and on false positive rates. The sentence with “11.54 bits for each
> distinct value inserted into the filter” is vague: is this the multi-block
> filter? Why is a 1% false-positive rate “recommended”?
> >
> > I think it is okay to use 0.5% as each block’s false-positive rate, but
> then this should state how to achieve an overall false-positive rate as a
> function of the number of distinct values.
> >
> > This should be more clear about the scope of bloom filters. The only
> reference right now is “The Bloom filter data of a column chunk, …”. If
> each filter is for a column chunk, then this should be stated as a
> requirement.
> > The position of bloom filter headers and bloom filter bit sets is too
> vague: “The Bloom filter data of a column chunk, which contains the size of
> the filter in bytes, the algorithm, the hash function and the Bloom filter
> bitset, is stored near the footer.”
> >
> > This should state that the column chunk offset points to the start of
> the bloom filter header, which immediately precedes the bloom filter bytes.
> The number of bytes must be stored as numBytes in the header. Include
> layout diagrams like the ones in the encryption spec.
> > The position of all bloom filters in relation to the page indexes should
> be clear. Bloom filters should be written just before column indexes?
> > The position of each bloom filter within the bloom filter data section
> should be clear. Are bloom filters for a column located together, or bloom
> filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2)
> > Does the format allow locating bloom filter data anywhere else other
> than just before the header? Maybe the locations are recommendations and
> not requirements?
> >
> > Should the bloom filters support compression? If the strategy for a
> lower false-positive rate is to under-fill the multi-block filter, then
> would compression help reduce the storage cost?
> > Why is the bloom filter header encrypted?
> >
> >
> > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <[email protected]> wrote:
> >>
> >> Hi Wes,
> >>
> >> Thanks for the reply. It indeed does not have automation support in
> >> the test. Since now we have the repo parquet-testing for the
> >> integration test, I think we can do some automation base on that. I
> >> will take some time to look at this and update the integration plan in
> >> PARQUET-1326.
> >>
> >> Besides that, we can open a thread to discuss building automation
> >> integration test framework for features WIP and in the future.
> >>
> >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <[email protected]>
> wrote:
> >> >
> >> > Wes,
> >> >
> >> > Since the v2 format is unfinished, I don't think anyone should be
> writing
> >> > v2 pages. In fact, I don't think v2 pages would be recommended in the
> v2
> >> > spec at this point. Do you know why Spark is writing this data? I
> didn't
> >> > know that Spark would write v2 by default.
> >> >
> >> > rb
> >> >
> >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <[email protected]>
> wrote:
> >> >
> >> > > I'm not sure when I can have a more thorough look at this.
> >> > >
> >> > > To be honest, I'm personally struggling with the burden of
> supporting
> >> > > the existing feature set in the Parquet C++ library. The integration
> >> > > testing strategy for this as well as other features that are being
> >> > > added to both the Java and C++ libraries (e.g. encryption) make me
> >> > > uncomfortable due to the lack of automation. As an example,
> DataPageV2
> >> > > support in parquet-cpp has been broken since the beginning of the
> >> > > project (see PARQUET-458) but it's only recently become an issue
> when
> >> > > people have been trying to read such files produced by Spark. More
> >> > > comprehensive integration testing would help ensure that the
> libraries
> >> > > remain compatible.
> >> > >
> >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <[email protected]> wrote:
> >> > > >
> >> > > > Dear Parquet developers
> >> > > >
> >> > > > We still need your vote!
> >> > > >
> >> > > >
> >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <[email protected]> wrote:
> >> > > > >
> >> > > > > Hi @Ryan Blue  @Wes McKinney
> >> > > > >
> >> > > > > We need your valuable vote, any feedback is welcome as well.
> >> > > > >
> >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <[email protected]> wrote:
> >> > > > > >
> >> > > > > > Call for voting again.
> >> > > > > >
> >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <[email protected]>
> wrote:
> >> > > > > > >
> >> > > > > > > Dear Parquet developers
> >> > > > > > >
> >> > > > > > > We need more votes, please help to vote on this.
> >> > > > > > >
> >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> >> > > > > > > <[email protected]> wrote:
> >> > > > > > > >
> >> > > > > > > > After getting in PARQUET-1625 I vote again for having
> bloom
> >> > > filter spec and
> >> > > > > > > > the thrift file update as is in parquet-format master.
> >> > > > > > > > +1 (binding)
> >> > > > > > > >
> >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <[email protected]>
> wrote:
> >> > > > > > > >
> >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We
> don't
> >> > > have to
> >> > > > > > > > > run it in a hurry, it has been developed for a long
> time yet.:)
> >> > > > > > > > >
> >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As
> the
> >> > > spec
> >> > > > > > > > > defined, the bloom filter data is stored near the
> footer which
> >> > > means
> >> > > > > > > > > we don't have to handle it like the page. Therefore, I
> just
> >> > > opened a
> >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> >> > > structure, while
> >> > > > > > > > > the BloomFitlerHeader is kept intentionally for
> convenience.
> >> > > Since the
> >> > > > > > > > > spec and the thrift should be aligned with each other
> >> > > eventually, so
> >> > > > > > > > > the vote target is both of them.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> >> > > > > > > > > <[email protected]> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > Hi Junjie,
> >> > > > > > > > > >
> >> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> >> > > problems with the
> >> > > > > > > > > > format update. The parquet.thrift file is updated to
> have
> >> > > the bloom
> >> > > > > > > > > filters
> >> > > > > > > > > > as a page (just as dictionaries and data pages).
> Meanwhile,
> >> > > the spec
> >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored
> near
> >> > > the footer.
> >> > > > > > > > > So,
> >> > > > > > > > > > if the bloom filter is not part of the row-groups
> (like
> >> > > column indexes) I
> >> > > > > > > > > > would not add it as a page. See the struct
> ColumnIndex in
> >> > > the thrift
> >> > > > > > > > > file.
> >> > > > > > > > > > This struct is not referenced anywhere in it only
> declared.
> >> > > It was done
> >> > > > > > > > > > this way because we don't parse it in the same way as
> we
> >> > > parse the pages.
> >> > > > > > > > > >
> >> > > > > > > > > > Currently, I am not 100% sure about the target of
> this vote.
> >> > > If it is a
> >> > > > > > > > > > vote about adding bloom filters in general then it is
> a +1
> >> > > (binding). If
> >> > > > > > > > > it
> >> > > > > > > > > > is about adding the bloom filters to parquet-format
> as is
> >> > > then, it is a
> >> > > > > > > > > -1
> >> > > > > > > > > > (binding) until we fix the issue above.
> >> > > > > > > > > >
> >> > > > > > > > > > Regards,
> >> > > > > > > > > > Gabor
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> >> > > [email protected]>
> >> > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > +1 (non-binding)
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> >> > > <[email protected]
> >> > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > +1 (binding)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <
> [email protected]>
> >> > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I'd like to resume this vote, you can start to
> vote
> >> > > now. Thanks for
> >> > > > > > > > > > > your
> >> > > > > > > > > > > > time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> >> > > [email protected]> wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> >> > > > > > > > > > > <[email protected]>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi Junjie,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Since there are ongoing improvements
> addressing
> >> > > review
> >> > > > > > > > > comments, I
> >> > > > > > > > > > > > would
> >> > > > > > > > > > > > > > > hold off with the vote for a few more days
> until
> >> > > the
> >> > > > > > > > > specification
> >> > > > > > > > > > > > settles.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Br,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Zoltan
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> >> > > [email protected]>
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> >> > > [email protected]>
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark
> results,
> >> > > such as the
> >> > > > > > > > > > > > official
> >> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> >> > > http://www.xxhash.com/) and
> >> > > > > > > > > > > > published
> >> > > > > > > > > > > > > > > > > comparison from smhasher project
> >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes
> McKinney <
> >> > > > > > > > > > > [email protected]>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to
> support
> >> > > the choice of
> >> > > > > > > > > hash
> >> > > > > > > > > > > > function?
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> >> > > [email protected]>
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> >> > > update voting
> >> > > > > > > > > content
> >> > > > > > > > > > > > to the
> >> > > > > > > > > > > > > > > > spec
> >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you
> can
> >> > > reply with +1
> >> > > > > > > > > or -1.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈
> <
> >> > > > > > > > > [email protected]>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been
> developed
> >> > > for a while,
> >> > > > > > > > > per
> >> > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to
> call a
> >> > > vote for
> >> > > > > > > > > spec to
> >> > > > > > > > > > > > move
> >> > > > > > > > > > > > > > > > forward. The current spec can be found at
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md
> .
> >> > > > > > > > > > > > > > > > There are some different options about the
> >> > > internal hash
> >> > > > > > > > > choice
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > Bloom
> >> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote
> the spec
> >> > > + hash
> >> > > > > > > > > option,
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > > > > example:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> >> > > > > > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback
> is
> >> > > also welcome in
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > > > discussion thread.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > --
> >> > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Thanks & Best Regards
> >> > > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Thanks & Best Regards
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks & Best Regards
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Thanks & Best Regards
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > >
> >> >
> >> >
> >> > --
> >> > Ryan Blue
> >> > Software Engineer
> >> > Netflix
> >>
> >>
> >>
> >> --
> >> Thanks & Best Regards
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
> --
> Thanks & Best Regards
>

Reply via email to