Hi team,

We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )

We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.

The results are available in the following document
<https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0>
:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg

Based on the numbers we see

   -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
   much better compared to other schemes. (numbers in the sheet are bytes
   needed to encode each value )
   - ALP going quite well in terms of decompression speed (numbers in the
   sheet are bytes decompressed per second)

As next steps we will

   - Get the numbers for compression on top of byte stream split.
   - Evaluate the algorithm over a few more datasets.
   - Have an implementation in the arrow-parquet repo.

Looking forward to feedback from the community.

Best
Prateek and Dhirhan

Reply via email to