Thank you for sharing that. Very interesting. I do think decompression speed is generally more important than compression speed. Another thing to consider is the possibility of operating on the compressed data e.g. for filtering: zstd data for example has to be decompressed before any filtering, arithmetic, etc. can be done. I believe at least filtering could be done on some of these other encodings. Apologies if this was discussed in the meeting already.
> On Oct 16, 2025, at 4:47 PM, PRATEEK GAUR <[email protected]> wrote: > > Hi team, > > We spent some time evaluating ALP compression and decompression compared to > other encoding alternatives like CHIMP/GORILLA and compression techniques > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members > on October 15th in the biweekly parquet meeting. ( I can't seem to access > the recording, so please let me know what access rules I need to get to be > able to view it ) > > We did this evaluation over some datasets pointed by the ALP paper and some > pointed by the parquet community. > > The results are available in the following document > <https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0> > : > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > Based on the numbers we see > > - ALP is comparable to ZSTD(level=1) in terms of compression ratio and > much better compared to other schemes. (numbers in the sheet are bytes > needed to encode each value ) > - ALP going quite well in terms of decompression speed (numbers in the > sheet are bytes decompressed per second) > > As next steps we will > > - Get the numbers for compression on top of byte stream split. > - Evaluate the algorithm over a few more datasets. > - Have an implementation in the arrow-parquet repo. > > Looking forward to feedback from the community. > > Best > Prateek and Dhirhan
