Hey all,
Thanks Prateek and Dhirhan for submitting this as it's clear you've been
putting quite a bit of work into it. IMHO, the ALP encoding looks very
promising as an addition to Parquet format.
That said, I have a few technical concerns:
* I cannot seem to run the C++ benchmarks because of the git submodule
configuration. It may be easier to fix but I'm looking for guidance here :-)
```
$ LANG=C git submodule update
fatal: transport 'file' not allowed
fatal: Fetched in submodule path 'submodules/parquet-testing', but it
did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct
fetching of that commit failed.
```
* the encoding of integers uses a custom framing with frame-of-reference
encoding inside it, but Parquet implementations already implement
DELTA_BINARY_PACKED which should have similar characteristics, so why
not reuse that?
* there are a lot of fields in the headers that look a bit superfluous
(though of course those bits are relatively cheap); for example, why
have a format "version" while we could define a new encoding for
incompatible evolutions?
* the "Total Encoded Element count" duplicates information already in
the page header, with risks of inconsistent values (including security
risks that require specific care in implementations)
* what happens if the number of exceptions is above 65535? their indices
are coded as 16-bit uints. How about using the same encoding as for
bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove
the 65535 limitation.
* the C++ implementation has a `kSamplerRowgroupSize` constant, which
worries me; row group size can vary *a lot* between workloads (from
thousands to millions of elements typically), the sampling process
should not depend on that.
Regards
Antoine.
Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit :
Hi team,
We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )
We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.
The results are available in the following document
<https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0>
:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
Based on the numbers we see
- ALP is comparable to ZSTD(level=1) in terms of compression ratio and
much better compared to other schemes. (numbers in the sheet are bytes
needed to encode each value )
- ALP going quite well in terms of decompression speed (numbers in the
sheet are bytes decompressed per second)
As next steps we will
- Get the numbers for compression on top of byte stream split.
- Evaluate the algorithm over a few more datasets.
- Have an implementation in the arrow-parquet repo.
Looking forward to feedback from the community.
Best
Prateek and Dhirhan