Hey all,

Thanks Prateek and Dhirhan for submitting this as it's clear you've been putting quite a bit of work into it. IMHO, the ALP encoding looks very promising as an addition to Parquet format.

That said, I have a few technical concerns:

* I cannot seem to run the C++ benchmarks because of the git submodule configuration. It may be easier to fix but I'm looking for guidance here :-)

```
$ LANG=C git submodule update
fatal: transport 'file' not allowed
fatal: Fetched in submodule path 'submodules/parquet-testing', but it did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct fetching of that commit failed.
```

* the encoding of integers uses a custom framing with frame-of-reference encoding inside it, but Parquet implementations already implement DELTA_BINARY_PACKED which should have similar characteristics, so why not reuse that?

* there are a lot of fields in the headers that look a bit superfluous (though of course those bits are relatively cheap); for example, why have a format "version" while we could define a new encoding for incompatible evolutions?

* the "Total Encoded Element count" duplicates information already in the page header, with risks of inconsistent values (including security risks that require specific care in implementations)

* what happens if the number of exceptions is above 65535? their indices are coded as 16-bit uints. How about using the same encoding as for bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove the 65535 limitation.

* the C++ implementation has a `kSamplerRowgroupSize` constant, which worries me; row group size can vary *a lot* between workloads (from thousands to millions of elements typically), the sampling process should not depend on that.

Regards

Antoine.



Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :
Hi team,

We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )

We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.

The results are available in the following document
<https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0>
:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg

Based on the numbers we see

    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
    much better compared to other schemes. (numbers in the sheet are bytes
    needed to encode each value )
    - ALP going quite well in terms of decompression speed (numbers in the
    sheet are bytes decompressed per second)

As next steps we will

    - Get the numbers for compression on top of byte stream split.
    - Evaluate the algorithm over a few more datasets.
    - Have an implementation in the arrow-parquet repo.

Looking forward to feedback from the community.

Best
Prateek and Dhirhan



Reply via email to