Hi Prateek,
Thanks a lot for the answers.
Le 27/01/2026 à 04:14, PRATEEK GAUR a écrit :
* I cannot seem to run the C++ benchmarks because of the git submodule
configuration. It may be easier to fix but I'm looking for guidance here
:-)
```
$ LANG=C git submodule update
fatal: transport 'file' not allowed
fatal: Fetched in submodule path 'submodules/parquet-testing', but it
did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
fetching of that commit failed.
```
I think that is because the dataset branch hasn't been merged in yet.
The files are in this pull request
https://github.com/apache/parquet-testing/pull/100/files.
You might have to checkout that branch to be able to run the benchmarks.
Hmm, I'll take a look at that later and come back.
* the encoding of integers uses a custom framing with frame-of-reference
encoding inside it, but Parquet implementations already implement
DELTA_BINARY_PACKED which should have similar characteristics, so why
not reuse that?
I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
didn't
fit the needs.
My understanding of DELTA_BINARY_PACKED is this
delta[i] = value[i] - value[i-1]
Pasting an example of why it may fail.
Thanks. At worse it seems it would need one more bit per integer than
the proposed FOR scheme (it might need less if delta encoding allows
reducing entropy among the encoded integers). I'm not sure how that
makes it "fail".
* there are a lot of fields in the headers that look a bit superfluous
(though of course those bits are relatively cheap); for example, why
have a format "version" while we could define a new encoding for
incompatible evolutions?
We discussed this point in the Spec
<https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit>
document a lot and have gravitated
towards a versioning scheme for easier evolution.
* Meta-answer:
I don't see any such discussion in the comments and the document doesn't
state any rationale for it.
As a reference, Python PEPs include a discussion of rejected
alternatives so that people don't ask the same questions over and over
(see https://peps.python.org/pep-0810/#alternate-implementation-ideas
for an example).
* Actual answer:
My problem with this is that it will complicate understanding and
communicating the feature set supported by each Parquet implementation.
"ALP" will not be a single spec but an evolving one with its own version
numbers. I'm not sure why that is better than adding a "ALP2" if we even
want to evolve the spec.
It will also complicate APIs that currently accept encoding numbers,
such as
https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221
We need a clear explanation of what makes ALP so special that it *needs*
its own versioning scheme. Otherwise we should remove it IMHO.
* the "Total Encoded Element count" duplicates information already in
the page header, with risks of inconsistent values (including security
risks that require specific care in implementations)
'num_elements' : let me re-read and get back on this.
Ok, Micah's answer cleared that up.
* what happens if the number of exceptions is above 65535? their indices
are coded as 16-bit uints. How about using the same encoding as for
bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
the 65535 limitation.
So, I don't see a need for a vector larger than 65535. With that large
vectors the
overhead of metadata is small and you might as well break it into multiple
vectors. I'm gonna give it some more thought and get back.
Ah, yes, I think you're right. Let's forget this :-)
Sampling process should be statistically significant. It should pick enough
values
and not have bias towards just the values towards the start. ALP algorithm
ensures
that and tries to balance between not spending enough cycles to get right
parameters
and picking incorrect parameters.
For a very large row group we can change the constant and have it
selects over a larger data set.
Or one can do it at page level too. Happy to discuss more on this.
In Parquet C++, encoding happens at page level, and I would guess other
implementations do something similar. Sampling cannot reasonably be done
at a higher level, that would require invasive architectural changes.
But this raises another question: why does this have to be a constant?
When encoding a page, you know the population size (i.e. the actual
number of values that will be encoded). You don't need to estimate it
with a constant.
Thank you
Antoine.