Hi team, We spent some time evaluating ALP compression and decompression compared to other encoding alternatives like CHIMP/GORILLA and compression techniques like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members on October 15th in the biweekly parquet meeting. ( I can't seem to access the recording, so please let me know what access rules I need to get to be able to view it )
We did this evaluation over some datasets pointed by the ALP paper and some pointed by the parquet community. The results are available in the following document <https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0> : https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg Based on the numbers we see - ALP is comparable to ZSTD(level=1) in terms of compression ratio and much better compared to other schemes. (numbers in the sheet are bytes needed to encode each value ) - ALP going quite well in terms of decompression speed (numbers in the sheet are bytes decompressed per second) As next steps we will - Get the numbers for compression on top of byte stream split. - Evaluate the algorithm over a few more datasets. - Have an implementation in the arrow-parquet repo. Looking forward to feedback from the community. Best Prateek and Dhirhan
