Thanks Andrew for building momentum.

Hi Antoine,

Replies to your questions are inline.

On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou <[email protected]> wrote:

>
> Hey all,
>
> Thanks Prateek and Dhirhan for submitting this as it's clear you've been
> putting quite a bit of work into it. IMHO, the ALP encoding looks very
> promising as an addition to Parquet format.
>
> That said, I have a few technical concerns:
>
> * I cannot seem to run the C++ benchmarks because of the git submodule
> configuration. It may be easier to fix but I'm looking for guidance here
> :-)
>
> ```
> $ LANG=C git submodule update
> fatal: transport 'file' not allowed
> fatal: Fetched in submodule path 'submodules/parquet-testing', but it
> did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
> fetching of that commit failed.
> ```
>

I think that is because the dataset branch hasn't been merged in yet.
The files are in this pull request
https://github.com/apache/parquet-testing/pull/100/files.
You might have to checkout that branch to be able to run the benchmarks.

* the encoding of integers uses a custom framing with frame-of-reference
> encoding inside it, but Parquet implementations already implement
> DELTA_BINARY_PACKED which should have similar characteristics, so why
> not reuse that?
>

I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it
didn't
fit the needs.

My understanding of DELTA_BINARY_PACKED is this
delta[i] = value[i] - value[i-1]

Pasting an example of why it may fail.

Input:    [19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices:
[1999, 549, 14900, 99, 29999]
═══════════════════════════════════════════════════════════════════════════
                        DELTA_BINARY_PACKED (Two Levels)
═══════════════════════════════════════════════════════════════════════════

STEP 1: Adjacent differences
  first_value = 1999  (stored in header)

  delta[0] = 549 - 1999     = -1450
  delta[1] = 14900 - 549    = +14351
  delta[2] = 99 - 14900     = -14801
  delta[3] = 29999 - 99     = +29900

  Adjacent deltas = [-1450, +14351, -14801, +29900]

STEP 2: Frame of reference on the deltas
  min_delta = -14801  (stored as zigzag ULEB128)

  adjusted[0] = -1450 - (-14801)  = 13351
  adjusted[1] = +14351 - (-14801) = 29152
  adjusted[2] = -14801 - (-14801) = 0
  adjusted[3] = +29900 - (-14801) = 44701

  Adjusted deltas = [13351, 29152, 0, 44701]

  Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits

═══════════════════════════════════════════════════════════════════════════
                              FOR (Single Level)
═══════════════════════════════════════════════════════════════════════════

  min = 99  (stored as frame_of_reference)

  delta[0] = 1999 - 99   = 1900
  delta[1] = 549 - 99    = 450
  delta[2] = 14900 - 99  = 14801
  delta[3] = 99 - 99     = 0
  delta[4] = 29999 - 99  = 29900

  Deltas = [1900, 450, 14801, 0, 29900]

  Range: 0 to 29900 → bit_width = ceil(log2(29901)) = 15 bits

I think for floating we need min and not adjacent subtractions which will
also include negative values.


> * there are a lot of fields in the headers that look a bit superfluous
> (though of course those bits are relatively cheap); for example, why
> have a format "version" while we could define a new encoding for
> incompatible evolutions?
>

We discussed this point in the Spec
<https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit>
document a lot and have gravitated
towards a versioning scheme for easier evolution.


>
> * the "Total Encoded Element count" duplicates information already in
> the page header, with risks of inconsistent values (including security
> risks that require specific care in implementations)
>

'num_elements' : let me re-read and get back on this.


>
> * what happens if the number of exceptions is above 65535? their indices
> are coded as 16-bit uints. How about using the same encoding as for
> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
> the 65535 limitation.
>

So, I don't see a need for a vector larger than 65535. With that large
vectors the
overhead of metadata is small and you might as well break it into multiple
vectors. I'm gonna give it some more thought and get back.


>
> * the C++ implementation has a `kSamplerRowgroupSize` constant, which
> worries me; row group size can vary *a lot* between workloads (from
> thousands to millions of elements typically), the sampling process
> should not depend on that.
>

Sampling process should be statistically significant. It should pick enough
values
and not have bias towards just the values towards the start. ALP algorithm
ensures
that and tries to balance between not spending enough cycles to get right
parameters
and picking incorrect parameters.

For a very large row group we can change the constant and have it
selects over a larger data set.
Or one can do it at page level too. Happy to discuss more on this.


>
> Regards
>
> Antoine.
>
>
>
> Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :
> > Hi team,
> >
> > We spent some time evaluating ALP compression and decompression compared
> to
> > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > the recording, so please let me know what access rules I need to get to
> be
> > able to view it )
> >
> > We did this evaluation over some datasets pointed by the ALP paper and
> some
> > pointed by the parquet community.
> >
> > The results are available in the following document
> > <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> > :
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >
> > Based on the numbers we see
> >
> >     -  ALP is comparable to ZSTD(level=1) in terms of compression ratio
> and
> >     much better compared to other schemes. (numbers in the sheet are
> bytes
> >     needed to encode each value )
> >     - ALP going quite well in terms of decompression speed (numbers in
> the
> >     sheet are bytes decompressed per second)
> >
> > As next steps we will
> >
> >     - Get the numbers for compression on top of byte stream split.
> >     - Evaluate the algorithm over a few more datasets.
> >     - Have an implementation in the arrow-parquet repo.
> >
> > Looking forward to feedback from the community.
> >
> > Best
> > Prateek and Dhirhan
> >
>
>
>

Reply via email to