Hi Antoine,

Thanks. Replying inline to the discussion threads.


> > https://github.com/apache/parquet-testing/pull/100/files.
> > You might have to checkout that branch to be able to run the benchmarks.
>

Yeah just cherry-picking the branch should work. Let me me know if it ends
up
being some kind of permission issue.


>
> Hmm, I'll take a look at that later and come back.
>
> > * the encoding of integers uses a custom framing with frame-of-reference
> >> encoding inside it, but Parquet implementations already implement
> >> DELTA_BINARY_PACKED which should have similar characteristics, so why
> >> not reuse that?
>
> Thanks. At worse it seems it would need one more bit per integer than
> the proposed FOR scheme (it might need less if delta encoding allows
> reducing entropy among the encoded integers). I'm not sure how that
> makes it "fail".
>


My bad. I wanted to add another point.
1) So I looked at at a few other examples and in most cases FOR ended up
using
     let's bits per value. My thinking is that for 1M (1B values) , it will
add up.
2) Another major point was that the bitunpacker + FOR was much much faster
than
    DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but
DeltaBitUnpack
    not so much. I vaguely remember the difference being around 2x. I can
try and
    compute the numbers again. But today the entire bottleneck in the
decoder shows
    up in bit-unpacking.


> >> * there are a lot of fields in the headers that look a bit superfluous
> >> (though of course those bits are relatively cheap); for example, why
> >> have a format "version" while we could define a new encoding for
> >> incompatible evolutions?
> >
> > We discussed this point in the Spec
> > <
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit>
> > document a lot and have gravitated
> > towards a versioning scheme for easier evolution.
>

> * Meta-answer:
>
> I don't see any such discussion in the comments and the document doesn't
> state any rationale for it.
>
> As a reference, Python PEPs include a discussion of rejected
> alternatives so that people don't ask the same questions over and over
> (see https://peps.python.org/pep-0810/#alternate-implementation-ideas
> for an example).
>
> * Actual answer:
>
> My problem with this is that it will complicate understanding and
> communicating the feature set supported by each Parquet implementation.
> "ALP" will not be a single spec but an evolving one with its own version
> numbers. I'm not sure why that is better than adding a "ALP2" if we even
> want to evolve the spec.
>
> It will also complicate APIs that currently accept encoding numbers,
> such as
>
> https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221
>
> We need a clear explanation of what makes ALP so special that it *needs*
> its own versioning scheme. Otherwise we should remove it IMHO.
>

Based on the feedback I think it is more of a clarification/API
complication discussion.
I was thinking along these lines.

Things that I thought will had to change but then we added enough
flexibility in metadata structs to allow for these.
1) Addition of AlpRd (if that takes time to get in for read doubles). (This
is easily addable with provided AlgorithmEnum)
2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily
addable with provided AlgorithmEnum)
3) Addition of different encoding (This is easily addable with provided
AlgorithmEnum).

Things that I think might need a version field.
1) Right now FOR and ALP structs which are interleaved store certain set of
fields, given there have been lots of
     in coming suggestions my thinking was that having a version will allow
us to easily change them with minor
     changes to the code in all writers/readers (maybe that is a very bit
task).


>
> >> * the "Total Encoded Element count" duplicates information already in
> >> the page header, with risks of inconsistent values (including security
> >> risks that require specific care in implementations)
> >
> > 'num_elements' : let me re-read and get back on this.
>
> Ok, Micah's answer cleared that up.
>

Cool.


>
> >> * what happens if the number of exceptions is above 65535? their indices
> >> are coded as 16-bit uints. How about using the same encoding as for
> >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
> >> the 65535 limitation.
> >
> > So, I don't see a need for a vector larger than 65535. With that large
> > vectors the
> > overhead of metadata is small and you might as well break it into
> multiple
> > vectors. I'm gonna give it some more thought and get back.
>
> Ah, yes, I think you're right. Let's forget this :-)
>


:).


>
> > Sampling process should be statistically significant. It should pick
> enough
> > values
> > and not have bias towards just the values towards the start. ALP
> algorithm
> > ensures
> > that and tries to balance between not spending enough cycles to get right
> > parameters
> > and picking incorrect parameters.
> >
> > For a very large row group we can change the constant and have it
> > selects over a larger data set.
> > Or one can do it at page level too. Happy to discuss more on this.
>
> In Parquet C++, encoding happens at page level, and I would guess other
> implementations do something similar. Sampling cannot reasonably be done
> at a higher level, that would require invasive architectural changes.
>
> But this raises another question: why does this have to be a constant?
> When encoding a page, you know the population size (i.e. the actual
> number of values that will be encoded). You don't need to estimate it
> with a constant.
>

Some engines might decide to write large pages (large number of values in a
page) ~2B.
One might not want to select all the values towards sampling process to
keep sampling
a slightly light weight operation. Also as @Andrew Lamb
<[email protected]> pointed out at times engines
already know what the best values of hyper parameters are might want to
provide it as
input via an epic to send in pre-populated presets. This can happen across
multiple pages
or multiple row groups after having calculated things for the first batch
of data. This might
further make sampling more light weight. I would gravitate towards
flexibility here :).


Thanks for this discussion
Prateek

Reply via email to