Re: [Parquet] ALP Encoding for Floating point data

Antoine Pitrou Mon, 26 Jan 2026 02:45:17 -0800


Hey all,

Thanks Prateek and Dhirhan for submitting this as it's clear you've beenputting quite a bit of work into it. IMHO, the ALP encoding looks verypromising as an addition to Parquet format.


That said, I have a few technical concerns:

* I cannot seem to run the C++ benchmarks because of the git submoduleconfiguration. It may be easier to fix but I'm looking for guidance here :-)


```
$ LANG=C git submodule update
fatal: transport 'file' not allowed

fatal: Fetched in submodule path 'submodules/parquet-testing', but itdid not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Directfetching of that commit failed.

```

* the encoding of integers uses a custom framing with frame-of-referenceencoding inside it, but Parquet implementations already implementDELTA_BINARY_PACKED which should have similar characteristics, so whynot reuse that?

* there are a lot of fields in the headers that look a bit superfluous(though of course those bits are relatively cheap); for example, whyhave a format "version" while we could define a new encoding forincompatible evolutions?

* the "Total Encoded Element count" duplicates information already inthe page header, with risks of inconsistent values (including securityrisks that require specific care in implementations)

* what happens if the number of exceptions is above 65535? their indicesare coded as 16-bit uints. How about using the same encoding as forbit-packed integers (e.g. DELTA_BINARY_PACKED), which will also removethe 65535 limitation.

* the C++ implementation has a `kSamplerRowgroupSize` constant, whichworries me; row group size can vary *a lot* between workloads (fromthousands to millions of elements typically), the sampling processshould not depend on that.


Regards

Antoine.



Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :

Hi team,

We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )

We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.

The results are available in the following document
<https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0>
:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg

Based on the numbers we see

    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
    much better compared to other schemes. (numbers in the sheet are bytes
    needed to encode each value )
    - ALP going quite well in terms of decompression speed (numbers in the
    sheet are bytes decompressed per second)

As next steps we will

    - Get the numbers for compression on top of byte stream split.
    - Evaluate the algorithm over a few more datasets.
    - Have an implementation in the arrow-parquet repo.

Looking forward to feedback from the community.

Best
Prateek and Dhirhan

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to