Thanks Micah for a round of feedback. Here is a link to the spec document : https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
On Tue, Nov 25, 2025 at 8:57 AM PRATEEK GAUR <[email protected]> wrote: > On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran <[email protected]> > wrote: > >> First, sorry: I think I accidentally marked as done the comment in the >> doc about x86 performance. >> > > No worries, I restored the thread :). > > Those x86 numbers are critical, especially AVX512 in a recent intel part. >> There's a notorious feature in the early ones where the cores would reduce >> frequency after you used the opcodes as a way of managing die temperature ( >> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency >> ); the later ones and AMD models are the ones to worry about. >> > > We did collect performance numbers in our early prototype and they looked > good on x86 hardware. Though I didn't check the processor family. > In our arrow implementation we are also working on a comprehensive > benchmarking script which will help everyone run it on different CPU > families to get a good idea of performance. > > Best > Prateek > > >> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev < >> [email protected]> wrote: >> >>> Hi team, >>> >>> *ALP ---> ALP PeudoDecimal* >>> >>> As is visible from the numbers above and as stated in the paper too for >>> real double values, i.e the values with high precision points, it is very >>> difficult to get a good compression ratio. >>> >>> This combined with the fact that we want to keep the spec/implementation >>> simpler, stating Antoine directly here >>> >>> `*2. Do not include the ALPrd fallback which is a homegrown dictionary* >>> >>> *encoding without dictionary reuse accross pages, and instead rely on >>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*` >>> >>> Also based on some discussion I had with Julien in person and the >>> biweekly >>> meeting with a number of you. >>> >>> We'll be going with ALPpd (pseudo decimal) as the first >>> implementation relying on the query engine based on its own heuristics to >>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD. >>> >>> Best >>> Prateek >>> >>> >>> >>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <[email protected] >>> > >>> wrote: >>> >>> > Sheet with numbers >>> > < >>> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517 >>> > >>> > . >>> > >>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR <[email protected]> >>> wrote: >>> > >>> >> Hi team, >>> >> >>> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve >>> if I >>> >> remember correctly, to perform the experiment on some of the papers >>> that >>> >> talked about BYTE_STREAM_SPLIT for completeness. >>> >> I wanted to share the numbers for the same in this sheet. At this >>> point >>> >> we have numbers on a wide variety of data. >>> >> (Will have to share the sheet from my snowflake account as our laptops >>> >> have fair bit of restriction with respect to copy paste permissions >>> :) ) >>> >> >>> >> Best >>> >> Prateek >>> >> >>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]> >>> wrote: >>> >> >>> >>> Hi Julien, >>> >>> >>> >>> Yes based on >>> >>> >>> >>> - Numbers presented >>> >>> - Discussions over the doc and >>> >>> - Multiple discussions in the biweekly meeting >>> >>> >>> >>> We are in a stage where we agree this is the right encoding to add >>> and >>> >>> we can move to the DRAFT/POC stage from DISCUSS stage. >>> >>> Will start working on the PR for the same. >>> >>> >>> >>> Thanks for bringing this up. >>> >>> Prateek >>> >>> >>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]> >>> wrote: >>> >>> >>> >>>> @PRATEEK GAUR <[email protected]> : Would you agree that we are >>> past >>> >>>> the DISCUSS step and into the DRAFT/POC phase according to the >>> proposals >>> >>>> process < >>> https://github.com/apache/parquet-format/tree/master/proposals >>> >>>> >? >>> >>>> If yes, could you open a PR on this page to add this proposal to the >>> >>>> list? >>> >>>> https://github.com/apache/parquet-format/tree/master/proposals >>> >>>> Thank you! >>> >>>> >>> >>>> >>> >>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected] >>> > >>> >>>> wrote: >>> >>>> >>> >>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in >>> the >>> >>>> Rust >>> >>>> > Parquet reader if anyone is interested >>> >>>> > >>> >>>> > Andrew >>> >>>> > >>> >>>> > [1]: https://github.com/apache/arrow-rs/issues/8748 >>> >>>> > >>> >>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield < >>> >>>> [email protected]> >>> >>>> > wrote: >>> >>>> > >>> >>>> > > > >>> >>>> > > > C++, Java and Rust support them for sure. I feel like we >>> should >>> >>>> > > > probably default to V2 at some point. >>> >>>> > > >>> >>>> > > >>> >>>> > > I seem to recall, some of the vectorized java readers (Iceberg, >>> >>>> Spark) >>> >>>> > > might not support V2 data pages (but I might be confusing this >>> with >>> >>>> > > encodings). But this is only a vague recollection. >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb < >>> [email protected] >>> >>>> > >>> >>>> > > wrote: >>> >>>> > > >>> >>>> > > > > Someone has to add V2 data pages to >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >>> >>>> > > > > :) >>> >>>> > > > >>> >>>> > > > Your wish is my command: >>> >>>> > https://github.com/apache/parquet-site/pull/124 >>> >>>> > > > >>> >>>> > > > As the format grows in popularity and momentum builds to >>> evolve, >>> >>>> I feel >>> >>>> > > the >>> >>>> > > > content on the parquet.apache.org site could use refreshing / >>> >>>> > updating. >>> >>>> > > > So, while I had the site open, I made some other PRs to >>> scratch >>> >>>> various >>> >>>> > > > itches >>> >>>> > > > >>> >>>> > > > (I am absolutely 🎣 for someone to please review 🙏): >>> >>>> > > > >>> >>>> > > > 1. Add Variant/Geometry/Geography types to implementation >>> status >>> >>>> > matrix: >>> >>>> > > > https://github.com/apache/parquet-site/pull/123 >>> >>>> > > > 2. Improve introduction / overview, add more links to spec and >>> >>>> > > > implementation status: >>> >>>> https://github.com/apache/parquet-site/pull/125 >>> >>>> > > > >>> >>>> > > > >>> >>>> > > > Thanks, >>> >>>> > > > Andrew >>> >>>> > > > >>> >>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou < >>> >>>> [email protected]> >>> >>>> > > wrote: >>> >>>> > > > >>> >>>> > > > > >>> >>>> > > > > Hi Julien, hi all, >>> >>>> > > > > >>> >>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700 >>> >>>> > > > > Julien Le Dem <[email protected]> wrote: >>> >>>> > > > > > >>> >>>> > > > > > Another question from me: >>> >>>> > > > > > >>> >>>> > > > > > Since the goal is to not use compression at all in this >>> case >>> >>>> (no >>> >>>> > > ZSTD) >>> >>>> > > > > > I'm assuming we would be using either: >>> >>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the >>> >>>> ColumnMetadata.column >>> >>>> > > > > > < >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887 >>> >>>> > > > > > >>> >>>> > > > > > field. >>> >>>> > > > > > - the Data Page V2 with false in the >>> >>>> DataPageHeaderV2.is_compressed >>> >>>> > > > > > < >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746 >>> >>>> > > > > > >>> >>>> > > > > > field >>> >>>> > > > > > The second helping decide if we can selectively compress >>> some >>> >>>> pages >>> >>>> > > if >>> >>>> > > > > they >>> >>>> > > > > > are less compressed by the >>> >>>> > > > > > A few years ago there was a question on the support of the >>> >>>> > > DATA_PAGE_V2 >>> >>>> > > > > and >>> >>>> > > > > > I was curious to hear a refresh on how that's generally >>> >>>> supported >>> >>>> > in >>> >>>> > > > > > Parquet implementations. The is_compressed field was >>> exactly >>> >>>> > intended >>> >>>> > > > to >>> >>>> > > > > > avoid block compression when the encoding itself is good >>> >>>> enough. >>> >>>> > > > > >>> >>>> > > > > Someone has to add V2 data pages to >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >>> >>>> > > > > :) >>> >>>> > > > > >>> >>>> > > > > C++, Java and Rust support them for sure. I feel like we >>> should >>> >>>> > > > > probably default to V2 at some point. >>> >>>> > > > > >>> >>>> > > > > Also see https://github.com/apache/parquet-java/issues/3344 >>> for >>> >>>> > Java. >>> >>>> > > > > >>> >>>> > > > > Regards >>> >>>> > > > > >>> >>>> > > > > Antoine. >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > > >>> >>>> > > > > > Julien >>> >>>> > > > > > >>> >>>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb >>> >>>> > > > > <[email protected]> wrote: >>> >>>> > > > > > >>> >>>> > > > > > > Thanks again Prateek and co for pushing this along! >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >>> >>>> > > > implementations >>> >>>> > > > > > > > know exactly how to encode and represent data >>> >>>> > > > > > > >>> >>>> > > > > > > 100% agree with this (similar to what was done for >>> >>>> > ParquetVariant) >>> >>>> > > > > > > >>> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't >>> seem >>> >>>> to >>> >>>> > > > > mention >>> >>>> > > > > > > non-finite values (such as +/-Inf and NaNs). >>> >>>> > > > > > > >>> >>>> > > > > > > I think they are handled via the "Exception" mechanism. >>> >>>> Vortex's >>> >>>> > > ALP >>> >>>> > > > > > > implementation (below) does appear to handle finite >>> >>>> numbers[2] >>> >>>> > > > > > > >>> >>>> > > > > > > > 3. It seems there is a single implementation, which is >>> >>>> the one >>> >>>> > > > > published >>> >>>> > > > > > > > together with the paper. It is not obvious that it >>> will be >>> >>>> > > > > > > > maintained in the future, and reusing it is probably >>> not >>> >>>> an >>> >>>> > > option >>> >>>> > > > > for >>> >>>> > > > > > > > non-C++ Parquet implementations >>> >>>> > > > > > > >>> >>>> > > > > > > My understanding from the call was that Prateek and team >>> >>>> > > > re-implemented >>> >>>> > > > > > > ALP (did not use the implementation from CWI[3]) but >>> that >>> >>>> would >>> >>>> > be >>> >>>> > > > > good to >>> >>>> > > > > > > confirm. >>> >>>> > > > > > > >>> >>>> > > > > > > There is also a Rust implementation of ALP[1] that is >>> part >>> >>>> of the >>> >>>> > > > > Vortex >>> >>>> > > > > > > file format implementation. I have not reviewed it to >>> see >>> >>>> if it >>> >>>> > > > > deviates >>> >>>> > > > > > > from the algorithm presented in the paper. >>> >>>> > > > > > > >>> >>>> > > > > > > Andrew >>> >>>> > > > > > > >>> >>>> > > > > > > [1]: >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs >>> >>>> > > > > > > [2]: >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 >>> >>>> > > > > > > [3]: https://github.com/cwida/ALP >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou >>> >>>> > > > > <[email protected]> wrote: >>> >>>> > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > Hello, >>> >>>> > > > > > > > >>> >>>> > > > > > > > Thanks for doing this and I agree the numbers look >>> >>>> impressive. >>> >>>> > > > > > > > >>> >>>> > > > > > > > I would ask if possible for more data points: >>> >>>> > > > > > > > >>> >>>> > > > > > > > 1. More datasets: you could for example look at the >>> >>>> datasets >>> >>>> > that >>> >>>> > > > > were >>> >>>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see >>> >>>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 >>> and >>> >>>> > > > specifically >>> >>>> > > > > > > > the Google Doc linked there) >>> >>>> > > > > > > > >>> >>>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and >>> >>>> BYTE_STREAM_SPLIT >>> >>>> > + >>> >>>> > > > ZSTD >>> >>>> > > > > > > > >>> >>>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I >>> expect >>> >>>> that >>> >>>> > > ALP >>> >>>> > > > > will >>> >>>> > > > > > > > remain very good there as well >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > I also have the following reservations towards ALP: >>> >>>> > > > > > > > >>> >>>> > > > > > > > 1. There is no published official spec AFAICT, just a >>> >>>> research >>> >>>> > > > paper. >>> >>>> > > > > > > > >>> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't >>> seem >>> >>>> to >>> >>>> > > > mention >>> >>>> > > > > > > > non-finite values (such as +/-Inf and NaNs). >>> >>>> > > > > > > > >>> >>>> > > > > > > > 3. It seems there is a single implementation, which is >>> >>>> the one >>> >>>> > > > > published >>> >>>> > > > > > > > together with the paper. It is not obvious that it >>> will be >>> >>>> > > > > > > > maintained in the future, and reusing it is probably >>> not >>> >>>> an >>> >>>> > > option >>> >>>> > > > > for >>> >>>> > > > > > > > non-C++ Parquet implementations >>> >>>> > > > > > > > >>> >>>> > > > > > > > 4. The encoding itself is complex, since it involves a >>> >>>> fallback >>> >>>> > > on >>> >>>> > > > > > > > another encoding if the primary encoding (which >>> >>>> constitutes the >>> >>>> > > > real >>> >>>> > > > > > > > innovation) doesn't work out on a piece of data. >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > Based on this, I would say that if we think ALP is >>> >>>> attractive >>> >>>> > for >>> >>>> > > > us, >>> >>>> > > > > > > > we may want to incorporate our own version of ALP >>> with the >>> >>>> > > > following >>> >>>> > > > > > > > changes: >>> >>>> > > > > > > > >>> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >>> >>>> > > > implementations >>> >>>> > > > > > > > know exactly how to encode and represent data >>> >>>> > > > > > > > >>> >>>> > > > > > > > 2. Do not include the ALPrd fallback which is a >>> homegrown >>> >>>> > > > dictionary >>> >>>> > > > > > > > encoding without dictionary reuse accross pages, and >>> >>>> instead >>> >>>> > rely >>> >>>> > > > on >>> >>>> > > > > a >>> >>>> > > > > > > > well-known Parquet encoding (such as >>> BYTE_STREAM_SPLIT?) >>> >>>> > > > > > > > >>> >>>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at >>> >>>> > compressing >>> >>>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED >>> >>>> (which >>> >>>> > has >>> >>>> > > > the >>> >>>> > > > > > > > same qualities and is already available in Parquet >>> >>>> > > implementations) >>> >>>> > > > > > > > >>> >>>> > > > > > > > Regards >>> >>>> > > > > > > > >>> >>>> > > > > > > > Antoine. >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700 >>> >>>> > > > > > > > PRATEEK GAUR <[email protected]> wrote: >>> >>>> > > > > > > > > Hi team, >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > We spent some time evaluating ALP compression and >>> >>>> > decompression >>> >>>> > > > > > > compared >>> >>>> > > > > > > > to >>> >>>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and >>> >>>> > compression >>> >>>> > > > > > > techniques >>> >>>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to >>> the >>> >>>> > > community >>> >>>> > > > > > > members >>> >>>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I >>> >>>> can't >>> >>>> > seem >>> >>>> > > > > to >>> >>>> > > > > > > access >>> >>>> > > > > > > > > the recording, so please let me know what access >>> rules >>> >>>> I need >>> >>>> > > to >>> >>>> > > > > get to >>> >>>> > > > > > > > be >>> >>>> > > > > > > > > able to view it ) >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > We did this evaluation over some datasets pointed by >>> >>>> the ALP >>> >>>> > > > paper >>> >>>> > > > > and >>> >>>> > > > > > > > some >>> >>>> > > > > > > > > pointed by the parquet community. >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > The results are available in the following document >>> >>>> > > > > > > > > < >>> >>>> > > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 >>> >>>> > > > > >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > : >>> >>>> > > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg >>> >>>> > > > > >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > Based on the numbers we see >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > - ALP is comparable to ZSTD(level=1) in terms of >>> >>>> > > compression >>> >>>> > > > > ratio >>> >>>> > > > > > > > and >>> >>>> > > > > > > > > much better compared to other schemes. (numbers >>> in >>> >>>> the >>> >>>> > sheet >>> >>>> > > > > are >>> >>>> > > > > > > bytes >>> >>>> > > > > > > > > needed to encode each value ) >>> >>>> > > > > > > > > - ALP going quite well in terms of decompression >>> >>>> speed >>> >>>> > > > (numbers >>> >>>> > > > > in >>> >>>> > > > > > > the >>> >>>> > > > > > > > > sheet are bytes decompressed per second) >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > As next steps we will >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > - Get the numbers for compression on top of byte >>> >>>> stream >>> >>>> > > split. >>> >>>> > > > > > > > > - Evaluate the algorithm over a few more >>> datasets. >>> >>>> > > > > > > > > - Have an implementation in the arrow-parquet >>> repo. >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > Looking forward to feedback from the community. >>> >>>> > > > > > > > > >>> >>>> > > > > > > > > Best >>> >>>> > > > > > > > > Prateek and Dhirhan >>> >>>> > > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> >>> >>> >>
