Thanks Micah for a round of feedback.

Here is a link to the spec document :
https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit

On Tue, Nov 25, 2025 at 8:57 AM PRATEEK GAUR <[email protected]> wrote:

> On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran <[email protected]>
> wrote:
>
>> First, sorry: I think I accidentally marked as done the comment in the
>> doc about x86 performance.
>>
>
> No worries, I restored the thread :).
>
> Those x86 numbers are critical, especially AVX512 in a recent intel part.
>> There's a notorious feature in the early ones where the cores would reduce
>> frequency after you used the opcodes as a way of managing die temperature (
>> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
>> ); the later ones and AMD models are the ones to worry about.
>>
>
> We did collect performance numbers in our early prototype and they looked
> good on x86 hardware. Though I didn't check the processor family.
> In our arrow implementation we are also working on a comprehensive
> benchmarking script which will help everyone run it on different CPU
> families to get a good idea of performance.
>
> Best
> Prateek
>
>
>> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev <
>> [email protected]> wrote:
>>
>>> Hi team,
>>>
>>> *ALP ---> ALP PeudoDecimal*
>>>
>>> As is visible from the numbers above and as stated in the paper too for
>>> real double values, i.e the values with high precision points, it is very
>>> difficult to get a good compression ratio.
>>>
>>> This combined with the fact that we want to keep the spec/implementation
>>> simpler, stating Antoine directly here
>>>
>>> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>>>
>>> *encoding without dictionary reuse accross pages, and instead rely on
>>> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>>>
>>> Also based on some discussion I had with Julien in person and the
>>> biweekly
>>> meeting with a number of you.
>>>
>>> We'll be going with ALPpd (pseudo decimal) as the first
>>> implementation relying on the query engine based on its own heuristics to
>>> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>>>
>>> Best
>>> Prateek
>>>
>>>
>>>
>>> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <[email protected]
>>> >
>>> wrote:
>>>
>>> > Sheet with numbers
>>> > <
>>> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
>>> >
>>> > .
>>> >
>>> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR <[email protected]>
>>> wrote:
>>> >
>>> >> Hi team,
>>> >>
>>> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve
>>> if I
>>> >> remember correctly, to perform the experiment on some of the papers
>>> that
>>> >> talked about BYTE_STREAM_SPLIT for completeness.
>>> >> I wanted to share the numbers for the same in this sheet. At this
>>> point
>>> >> we have numbers on a wide variety of data.
>>> >> (Will have to share the sheet from my snowflake account as our laptops
>>> >> have fair bit of restriction with respect to copy paste permissions
>>> :) )
>>> >>
>>> >> Best
>>> >> Prateek
>>> >>
>>> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]>
>>> wrote:
>>> >>
>>> >>> Hi Julien,
>>> >>>
>>> >>> Yes based on
>>> >>>
>>> >>>    - Numbers presented
>>> >>>    - Discussions over the doc and
>>> >>>    - Multiple discussions in the biweekly meeting
>>> >>>
>>> >>> We are in a stage where we agree this is the right encoding to add
>>> and
>>> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
>>> >>> Will start working on the PR for the same.
>>> >>>
>>> >>> Thanks for bringing this up.
>>> >>> Prateek
>>> >>>
>>> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]>
>>> wrote:
>>> >>>
>>> >>>> @PRATEEK GAUR <[email protected]> : Would you agree that we are
>>> past
>>> >>>> the DISCUSS step and into the DRAFT/POC phase according to the
>>> proposals
>>> >>>> process <
>>> https://github.com/apache/parquet-format/tree/master/proposals
>>> >>>> >?
>>> >>>> If yes, could you open a PR on this page to add this proposal to the
>>> >>>> list?
>>> >>>> https://github.com/apache/parquet-format/tree/master/proposals
>>> >>>> Thank you!
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]
>>> >
>>> >>>> wrote:
>>> >>>>
>>> >>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in
>>> the
>>> >>>> Rust
>>> >>>> > Parquet reader if anyone is interested
>>> >>>> >
>>> >>>> > Andrew
>>> >>>> >
>>> >>>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
>>> >>>> >
>>> >>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
>>> >>>> [email protected]>
>>> >>>> > wrote:
>>> >>>> >
>>> >>>> > > >
>>> >>>> > > > C++, Java and Rust support them for sure. I feel like we
>>> should
>>> >>>> > > > probably default to V2 at some point.
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > I seem to recall, some of the vectorized java readers (Iceberg,
>>> >>>> Spark)
>>> >>>> > > might not support V2 data pages (but I might be confusing this
>>> with
>>> >>>> > > encodings).  But this is only a vague recollection.
>>> >>>> > >
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <
>>> [email protected]
>>> >>>> >
>>> >>>> > > wrote:
>>> >>>> > >
>>> >>>> > > > > Someone has to add V2 data pages to
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>>> >>>> > > > > :)
>>> >>>> > > >
>>> >>>> > > > Your wish is my command:
>>> >>>> > https://github.com/apache/parquet-site/pull/124
>>> >>>> > > >
>>> >>>> > > > As the format grows in popularity and momentum builds to
>>> evolve,
>>> >>>> I feel
>>> >>>> > > the
>>> >>>> > > > content on the parquet.apache.org site could use refreshing /
>>> >>>> > updating.
>>> >>>> > > > So, while I had the site open, I made some other PRs to
>>> scratch
>>> >>>> various
>>> >>>> > > > itches
>>> >>>> > > >
>>> >>>> > > > (I am absolutely 🎣 for someone to please review 🙏):
>>> >>>> > > >
>>> >>>> > > > 1. Add Variant/Geometry/Geography types to implementation
>>> status
>>> >>>> > matrix:
>>> >>>> > > > https://github.com/apache/parquet-site/pull/123
>>> >>>> > > > 2. Improve introduction / overview, add more links to spec and
>>> >>>> > > > implementation status:
>>> >>>> https://github.com/apache/parquet-site/pull/125
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > > Thanks,
>>> >>>> > > > Andrew
>>> >>>> > > >
>>> >>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <
>>> >>>> [email protected]>
>>> >>>> > > wrote:
>>> >>>> > > >
>>> >>>> > > > >
>>> >>>> > > > > Hi Julien, hi all,
>>> >>>> > > > >
>>> >>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
>>> >>>> > > > > Julien Le Dem <[email protected]> wrote:
>>> >>>> > > > > >
>>> >>>> > > > > > Another question from me:
>>> >>>> > > > > >
>>> >>>> > > > > > Since the goal is to not use compression at all in this
>>> case
>>> >>>> (no
>>> >>>> > > ZSTD)
>>> >>>> > > > > > I'm assuming we would be using either:
>>> >>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the
>>> >>>> ColumnMetadata.column
>>> >>>> > > > > > <
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
>>> >>>> > > > > >
>>> >>>> > > > > > field.
>>> >>>> > > > > > - the Data Page V2 with false in the
>>> >>>> DataPageHeaderV2.is_compressed
>>> >>>> > > > > > <
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
>>> >>>> > > > > >
>>> >>>> > > > > > field
>>> >>>> > > > > > The second helping decide if we can selectively compress
>>> some
>>> >>>> pages
>>> >>>> > > if
>>> >>>> > > > > they
>>> >>>> > > > > > are less compressed by the
>>> >>>> > > > > > A few years ago there was a question on the support of the
>>> >>>> > > DATA_PAGE_V2
>>> >>>> > > > > and
>>> >>>> > > > > > I was curious to hear a refresh on how that's generally
>>> >>>> supported
>>> >>>> > in
>>> >>>> > > > > > Parquet implementations. The is_compressed field was
>>> exactly
>>> >>>> > intended
>>> >>>> > > > to
>>> >>>> > > > > > avoid block compression when the encoding itself is good
>>> >>>> enough.
>>> >>>> > > > >
>>> >>>> > > > > Someone has to add V2 data pages to
>>> >>>> > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>>> >>>> > > > > :)
>>> >>>> > > > >
>>> >>>> > > > > C++, Java and Rust support them for sure. I feel like we
>>> should
>>> >>>> > > > > probably default to V2 at some point.
>>> >>>> > > > >
>>> >>>> > > > > Also see https://github.com/apache/parquet-java/issues/3344
>>> for
>>> >>>> > Java.
>>> >>>> > > > >
>>> >>>> > > > > Regards
>>> >>>> > > > >
>>> >>>> > > > > Antoine.
>>> >>>> > > > >
>>> >>>> > > > >
>>> >>>> > > > > >
>>> >>>> > > > > > Julien
>>> >>>> > > > > >
>>> >>>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
>>> >>>> > > > > <[email protected]> wrote:
>>> >>>> > > > > >
>>> >>>> > > > > > > Thanks again Prateek and co for pushing this along!
>>> >>>> > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
>>> >>>> > > > implementations
>>> >>>> > > > > > > > know exactly how to encode and represent data
>>> >>>> > > > > > >
>>> >>>> > > > > > > 100% agree with this (similar to what was done for
>>> >>>> > ParquetVariant)
>>> >>>> > > > > > >
>>> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't
>>> seem
>>> >>>> to
>>> >>>> > > > > mention
>>> >>>> > > > > > > non-finite values (such as +/-Inf and NaNs).
>>> >>>> > > > > > >
>>> >>>> > > > > > > I think they are handled via the "Exception" mechanism.
>>> >>>> Vortex's
>>> >>>> > > ALP
>>> >>>> > > > > > > implementation (below) does appear to handle finite
>>> >>>> numbers[2]
>>> >>>> > > > > > >
>>> >>>> > > > > > > > 3. It seems there is a single implementation, which is
>>> >>>> the one
>>> >>>> > > > > published
>>> >>>> > > > > > > > together with the paper. It is not obvious that it
>>> will be
>>> >>>> > > > > > > > maintained in the future, and reusing it is probably
>>> not
>>> >>>> an
>>> >>>> > > option
>>> >>>> > > > > for
>>> >>>> > > > > > > > non-C++ Parquet implementations
>>> >>>> > > > > > >
>>> >>>> > > > > > > My understanding from the call was that Prateek and team
>>> >>>> > > > re-implemented
>>> >>>> > > > > > > ALP  (did not use the implementation from CWI[3]) but
>>> that
>>> >>>> would
>>> >>>> > be
>>> >>>> > > > > good to
>>> >>>> > > > > > > confirm.
>>> >>>> > > > > > >
>>> >>>> > > > > > > There is also a Rust implementation of ALP[1] that is
>>> part
>>> >>>> of the
>>> >>>> > > > > Vortex
>>> >>>> > > > > > > file format implementation. I have not reviewed it to
>>> see
>>> >>>> if it
>>> >>>> > > > > deviates
>>> >>>> > > > > > > from the algorithm presented in the paper.
>>> >>>> > > > > > >
>>> >>>> > > > > > > Andrew
>>> >>>> > > > > > >
>>> >>>> > > > > > > [1]:
>>> >>>> > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
>>> >>>> > > > > > > [2]:
>>> >>>> > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
>>> >>>> > > > > > > [3]: https://github.com/cwida/ALP
>>> >>>> > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou
>>> >>>> > > > > <[email protected]> wrote:
>>> >>>> > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > Hello,
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > Thanks for doing this and I agree the numbers look
>>> >>>> impressive.
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > I would ask if possible for more data points:
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 1. More datasets: you could for example look at the
>>> >>>> datasets
>>> >>>> > that
>>> >>>> > > > > were
>>> >>>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see
>>> >>>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622
>>> and
>>> >>>> > > > specifically
>>> >>>> > > > > > > > the Google Doc linked there)
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and
>>> >>>> BYTE_STREAM_SPLIT
>>> >>>> > +
>>> >>>> > > > ZSTD
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I
>>> expect
>>> >>>> that
>>> >>>> > > ALP
>>> >>>> > > > > will
>>> >>>> > > > > > > > remain very good there as well
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > I also have the following reservations towards ALP:
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 1. There is no published official spec AFAICT, just a
>>> >>>> research
>>> >>>> > > > paper.
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't
>>> seem
>>> >>>> to
>>> >>>> > > > mention
>>> >>>> > > > > > > > non-finite values (such as +/-Inf and NaNs).
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 3. It seems there is a single implementation, which is
>>> >>>> the one
>>> >>>> > > > > published
>>> >>>> > > > > > > > together with the paper. It is not obvious that it
>>> will be
>>> >>>> > > > > > > > maintained in the future, and reusing it is probably
>>> not
>>> >>>> an
>>> >>>> > > option
>>> >>>> > > > > for
>>> >>>> > > > > > > > non-C++ Parquet implementations
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 4. The encoding itself is complex, since it involves a
>>> >>>> fallback
>>> >>>> > > on
>>> >>>> > > > > > > > another encoding if the primary encoding (which
>>> >>>> constitutes the
>>> >>>> > > > real
>>> >>>> > > > > > > > innovation) doesn't work out on a piece of data.
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > Based on this, I would say that if we think ALP is
>>> >>>> attractive
>>> >>>> > for
>>> >>>> > > > us,
>>> >>>> > > > > > > > we may want to incorporate our own version of ALP
>>> with the
>>> >>>> > > > following
>>> >>>> > > > > > > > changes:
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
>>> >>>> > > > implementations
>>> >>>> > > > > > > > know exactly how to encode and represent data
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 2. Do not include the ALPrd fallback which is a
>>> homegrown
>>> >>>> > > > dictionary
>>> >>>> > > > > > > > encoding without dictionary reuse accross pages, and
>>> >>>> instead
>>> >>>> > rely
>>> >>>> > > > on
>>> >>>> > > > > a
>>> >>>> > > > > > > > well-known Parquet encoding (such as
>>> BYTE_STREAM_SPLIT?)
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at
>>> >>>> > compressing
>>> >>>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED
>>> >>>> (which
>>> >>>> > has
>>> >>>> > > > the
>>> >>>> > > > > > > > same qualities and is already available in Parquet
>>> >>>> > > implementations)
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > Regards
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > Antoine.
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700
>>> >>>> > > > > > > > PRATEEK GAUR <[email protected]> wrote:
>>> >>>> > > > > > > > > Hi team,
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > We spent some time evaluating ALP compression and
>>> >>>> > decompression
>>> >>>> > > > > > > compared
>>> >>>> > > > > > > > to
>>> >>>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and
>>> >>>> > compression
>>> >>>> > > > > > > techniques
>>> >>>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to
>>> the
>>> >>>> > > community
>>> >>>> > > > > > > members
>>> >>>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I
>>> >>>> can't
>>> >>>> > seem
>>> >>>> > > > > to
>>> >>>> > > > > > > access
>>> >>>> > > > > > > > > the recording, so please let me know what access
>>> rules
>>> >>>> I need
>>> >>>> > > to
>>> >>>> > > > > get to
>>> >>>> > > > > > > > be
>>> >>>> > > > > > > > > able to view it )
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > We did this evaluation over some datasets pointed by
>>> >>>> the ALP
>>> >>>> > > > paper
>>> >>>> > > > > and
>>> >>>> > > > > > > > some
>>> >>>> > > > > > > > > pointed by the parquet community.
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > The results are available in the following document
>>> >>>> > > > > > > > > <
>>> >>>> > > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
>>> >>>> > > > >
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > :
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
>>> >>>> > > > >
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > Based on the numbers we see
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > >    -  ALP is comparable to ZSTD(level=1) in terms of
>>> >>>> > > compression
>>> >>>> > > > > ratio
>>> >>>> > > > > > > > and
>>> >>>> > > > > > > > >    much better compared to other schemes. (numbers
>>> in
>>> >>>> the
>>> >>>> > sheet
>>> >>>> > > > > are
>>> >>>> > > > > > > bytes
>>> >>>> > > > > > > > >    needed to encode each value )
>>> >>>> > > > > > > > >    - ALP going quite well in terms of decompression
>>> >>>> speed
>>> >>>> > > > (numbers
>>> >>>> > > > > in
>>> >>>> > > > > > > the
>>> >>>> > > > > > > > >    sheet are bytes decompressed per second)
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > As next steps we will
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > >    - Get the numbers for compression on top of byte
>>> >>>> stream
>>> >>>> > > split.
>>> >>>> > > > > > > > >    - Evaluate the algorithm over a few more
>>> datasets.
>>> >>>> > > > > > > > >    - Have an implementation in the arrow-parquet
>>> repo.
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > Looking forward to feedback from the community.
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > > > Best
>>> >>>> > > > > > > > > Prateek and Dhirhan
>>> >>>> > > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > > >
>>> >>>> > > > > > >
>>> >>>> > > > > >
>>> >>>> > > > >
>>> >>>> > > > >
>>> >>>> > > > >
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>
>>>
>>

Reply via email to