Hi Antoine,

I think I can help perhaps add some more details to Prateek's answer.

>
> * the "Total Encoded Element count" duplicates information already in
> the page header, with risks of inconsistent values (including security
> risks that require specific care in implementations)
>

'num_elements' : let me re-read and get back on this.


This is confusing but the data page header only contains the total number
of values "including null" [1], so it effectively has the number of
repetition and definition levels not the number of encoded value count.  We
expect these to be inconsistent (DataPageV2 and statistics have null counts
but generally I don't think these are passed to decoders anyways).  The
only other way of retrieving this count is to decode all
repetition/definitions up front to get the true count.

DELTA_BINARY_PACKED also stores the total number of values [2]

BYTE_STREAM_SPLIT actually required a spec update [3] to state no padding
is allowed because otherwise there would be no other way to get this number.

> * the C++ implementation has a `kSamplerRowgroupSize` constant, which
> worries me; row group size can vary *a lot* between workloads (from
> thousands to millions of elements typically), the sampling process
> should not depend on that.

The last time I reviewed the C++ implementation I think we were actually
recalculating these values per page.   So I think this might just be a
naming issue (as long as the implementation doesn't change.  Page sizes
could get larger then this constant but probably not by too much?

Cheers,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L679
[2]
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
[3]
https://github.com/apache/parquet-format/commit/230711fbfd8d3399cce935a4f39d1be7b6ad5ad5

On Mon, Jan 26, 2026 at 7:15 PM PRATEEK GAUR <[email protected]> wrote:

> Thanks Andrew for building momentum.
>
> Hi Antoine,
>
> Replies to your questions are inline.
>
> On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Hey all,
> >
> > Thanks Prateek and Dhirhan for submitting this as it's clear you've been
> > putting quite a bit of work into it. IMHO, the ALP encoding looks very
> > promising as an addition to Parquet format.
> >
> > That said, I have a few technical concerns:
> >
> > * I cannot seem to run the C++ benchmarks because of the git submodule
> > configuration. It may be easier to fix but I'm looking for guidance here
> > :-)
> >
> > ```
> > $ LANG=C git submodule update
> > fatal: transport 'file' not allowed
> > fatal: Fetched in submodule path 'submodules/parquet-testing', but it
> > did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
> > fetching of that commit failed.
> > ```
> >
>
> I think that is because the dataset branch hasn't been merged in yet.
> The files are in this pull request
> https://github.com/apache/parquet-testing/pull/100/files.
> You might have to checkout that branch to be able to run the benchmarks.
>
> * the encoding of integers uses a custom framing with frame-of-reference
> > encoding inside it, but Parquet implementations already implement
> > DELTA_BINARY_PACKED which should have similar characteristics, so why
> > not reuse that?
> >
>
> I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
> didn't
> fit the needs.
>
> My understanding of DELTA_BINARY_PACKED is this
> delta[i] = value[i] - value[i-1]
>
> Pasting an example of why it may fail.
>
> Input:    [19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices:
> [1999, 549, 14900, 99, 29999]
> ═══════════════════════════════════════════════════════════════════════════
>                         DELTA_BINARY_PACKED (Two Levels)
> ═══════════════════════════════════════════════════════════════════════════
>
> STEP 1: Adjacent differences
>   first_value = 1999  (stored in header)
>
>   delta[0] = 549 - 1999     = -1450
>   delta[1] = 14900 - 549    = +14351
>   delta[2] = 99 - 14900     = -14801
>   delta[3] = 29999 - 99     = +29900
>
>   Adjacent deltas = [-1450, +14351, -14801, +29900]
>
> STEP 2: Frame of reference on the deltas
>   min_delta = -14801  (stored as zigzag ULEB128)
>
>   adjusted[0] = -1450 - (-14801)  = 13351
>   adjusted[1] = +14351 - (-14801) = 29152
>   adjusted[2] = -14801 - (-14801) = 0
>   adjusted[3] = +29900 - (-14801) = 44701
>
>   Adjusted deltas = [13351, 29152, 0, 44701]
>
>   Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits
>
> ═══════════════════════════════════════════════════════════════════════════
>                               FOR (Single Level)
> ═══════════════════════════════════════════════════════════════════════════
>
>   min = 99  (stored as frame_of_reference)
>
>   delta[0] = 1999 - 99   = 1900
>   delta[1] = 549 - 99    = 450
>   delta[2] = 14900 - 99  = 14801
>   delta[3] = 99 - 99     = 0
>   delta[4] = 29999 - 99  = 29900
>
>   Deltas = [1900, 450, 14801, 0, 29900]
>
>   Range: 0 to 29900 → bit_width = ceil(log2(29901)) = 15 bits
>
> I think for floating we need min and not adjacent subtractions which will
> also include negative values.
>
>
> > * there are a lot of fields in the headers that look a bit superfluous
> > (though of course those bits are relatively cheap); for example, why
> > have a format "version" while we could define a new encoding for
> > incompatible evolutions?
> >
>
> We discussed this point in the Spec
> <https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
> >
> document a lot and have gravitated
> towards a versioning scheme for easier evolution.
>
>
> >
> > * the "Total Encoded Element count" duplicates information already in
> > the page header, with risks of inconsistent values (including security
> > risks that require specific care in implementations)
> >
>
> 'num_elements' : let me re-read and get back on this.
>
>
> >
> > * what happens if the number of exceptions is above 65535? their indices
> > are coded as 16-bit uints. How about using the same encoding as for
> > bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
> > the 65535 limitation.
> >
>
> So, I don't see a need for a vector larger than 65535. With that large
> vectors the
> overhead of metadata is small and you might as well break it into multiple
> vectors. I'm gonna give it some more thought and get back.
>
>
> >
> > * the C++ implementation has a `kSamplerRowgroupSize` constant, which
> > worries me; row group size can vary *a lot* between workloads (from
> > thousands to millions of elements typically), the sampling process
> > should not depend on that.
> >
>
> Sampling process should be statistically significant. It should pick enough
> values
> and not have bias towards just the values towards the start. ALP algorithm
> ensures
> that and tries to balance between not spending enough cycles to get right
> parameters
> and picking incorrect parameters.
>
> For a very large row group we can change the constant and have it
> selects over a larger data set.
> Or one can do it at page level too. Happy to discuss more on this.
>
>
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :
> > > Hi team,
> > >
> > > We spent some time evaluating ALP compression and decompression
> compared
> > to
> > > other encoding alternatives like CHIMP/GORILLA and compression
> techniques
> > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community
> members
> > > on October 15th in the biweekly parquet meeting. ( I can't seem to
> access
> > > the recording, so please let me know what access rules I need to get to
> > be
> > > able to view it )
> > >
> > > We did this evaluation over some datasets pointed by the ALP paper and
> > some
> > > pointed by the parquet community.
> > >
> > > The results are available in the following document
> > > <
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> > >
> > > :
> > >
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> > >
> > > Based on the numbers we see
> > >
> > >     -  ALP is comparable to ZSTD(level=1) in terms of compression ratio
> > and
> > >     much better compared to other schemes. (numbers in the sheet are
> > bytes
> > >     needed to encode each value )
> > >     - ALP going quite well in terms of decompression speed (numbers in
> > the
> > >     sheet are bytes decompressed per second)
> > >
> > > As next steps we will
> > >
> > >     - Get the numbers for compression on top of byte stream split.
> > >     - Evaluate the algorithm over a few more datasets.
> > >     - Have an implementation in the arrow-parquet repo.
> > >
> > > Looking forward to feedback from the community.
> > >
> > > Best
> > > Prateek and Dhirhan
> > >
> >
> >
> >
>

Reply via email to