Re: [Parquet] ALP Encoding for Floating point data

Micah Kornfield Tue, 05 May 2026 13:50:07 -0700

Hi Antoine,

> Apologies if the question was already asked, but should we care about
> FLOAT16 for ALP? Can FLOAT + ALP be more efficient than FLOAT16 +
> BYTE_STREAM_SPLIT + LZ4 for example?


It was.  We thought we could defer it for the following reasons:
1.  It's not clear there are a lot of easy reference datasets to test its
effectiveness.  It does look like there might be one or two on huggingface
(e.g. https://huggingface.co/datasets/kikitora/curdie).
2.  It seemed likely that float 16 was more likely used for values that
were less likely to reduce to decimal values.
3.  It could be added as an extension later if needed.

Cheers,
Micah


On Tue, May 5, 2026 at 1:40 PM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> Apologies if the question was already asked, but should we care about
> FLOAT16 for ALP? Can FLOAT + ALP be more efficient than FLOAT16 +
> BYTE_STREAM_SPLIT + LZ4 for example?
>
> Regards
>
> Antoine.
>
>
> Le 30/04/2026 à 01:10, PRATEEK GAUR a écrit :
> > Thanks Andrew and Micah for review feedback on the two PR's
> > 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes
> > 2) (parquet-format repo)
> https://github.com/apache/parquet-format/pull/557
> >
> > I have addressed all (unless I missed something) comments on the two
> PR's.
> >
> > Best
> > Prateek
> >
> > On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote:
> >
> >> Thanks Andrew and Micah.
> >>
> >> `fair amount of feedback on at least the implementations`
> >> For the c++ I have already started addressing the feedback, I should be
> >> done with that Monday/Tuesday.
> >> I think Vinoo too has been making good progress on the Java
> implementation.
> >>
> >> Best
> >> Prateek
> >>
> >> On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]>
> >> wrote:
> >>
> >>> Got it. Thank you for the clarification -- I will try and look into the
> >>> spec and the Rust implementation[1] in this next week
> >>>
> >>> [1]: https://github.com/apache/arrow-rs/pull/9372
> >>>
> >>> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <
> [email protected]>
> >>> wrote:
> >>>
> >>>> Hi Andrew,
> >>>> I think there is a fair amount of feedback on at least the
> >>>> implementations, typically I think we've waited till they are close to
> >>>> mergeable before a final vote.  Otherwise I agree we are very close.
> >>>>
> >>>> -Micah
> >>>>
> >>>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]>
> wrote:
> >>>>
> >>>>> Thanks Prateek,
> >>>>>
> >>>>> I think from this content it looks to me like we are ready to start a
> >>>>> vote to explicitly accept ALP into Parquet
> >>>>>
> >>>>> Does anyone know of a reason we should postpone it for longer?
> >>>>> Perhaps someone needs some more time to review?
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi team,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hope everyone is doing well. I got a chance to work through all the
> >>>>>> remaining feedback and update the spec doc. Here are the new
> artifacts
> >>>>>>
> >>>>>> 1) Spec document :
> >>>>>>
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
> >>>>>>
> >>>>>> 2) Spec document in parquet format repo :
> >>>>>> https://github.com/apache/parquet-format/pull/557
> >>>>>>
> >>>>>> 3) Alp implementation in arrow c++ repo :
> >>>>>> https://github.com/apache/arrow/pull/48345/changes
> >>>>>>
> >>>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and
> Julien
> >>>>>>   https://github.com/apache/parquet-java/pull/3397
> >>>>>>
> >>>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo :
> >>>>>> https://github.com/apache/parquet-testing/pull/100
> >>>>>>
> >>>>>>
> >>>>>> And
> >>>>>>
> >>>>>>
> >>>>>>     - Go : Arnav just submitted an in progress implementation in Go.
> >>>>>>     https://github.com/apache/arrow-go/pull/704 (I haven't started
> >>>>>>     looking at it yet)
> >>>>>>     - Rust : I remember Andrew mentioned that this work is also in
> >>>>>>     progress (So 4 languages!)
> >>>>>>
> >>>>>>
> >>>>>> *Arrow C++ implementation *
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> The PR is out and was also used by Antoine to report the numbers as
> >>>>>> reported here. Micah and Konstantin have given 1 round of feedback
> >>>>>> and I'm addressing them today. Please note that the default
> >>>>>> optimization flag for compiling is O2 and not Q3. I got around 70%
> >>>>>> performance improvement in the decoding speed when using the O3
> flag.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and
> **Cross
> >>>>>> Language testing*
> >>>>>>
> >>>>>>
> >>>>>>     Let me know if you have any questions or feedback.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Now pasting some performance numbers
> >>>>>>
> >>>>>>
> >>>>>>    Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
> >>>>>> Neoverse V1)
> >>>>>>
> >>>>>>    ┌──────────────────┬──────────────┬──────────────┬─────────┐
> >>>>>>
> >>>>>>    │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
> >>>>>>
> >>>>>>    ├──────────────────┼──────────────┼──────────────┼─────────┤
> >>>>>>
> >>>>>>    │ valence          │     3,155    │     5,523    │  1.75x  │
> >>>>>>
> >>>>>>    │ danceability     │     3,233    │     5,685    │  1.76x  │
> >>>>>>
> >>>>>>    │ energy           │     3,197    │     5,652    │  1.77x  │
> >>>>>>
> >>>>>>    │ loudness         │     3,186    │     5,473    │  1.72x  │
> >>>>>>
> >>>>>>    └──────────────────┴──────────────┴──────────────┴─────────┘
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> @Micah Kornfield <[email protected]> : Got it.
> >>>>>>>
> >>>>>>> @Andrew Lamb <[email protected]>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Do you think it would be good to start moving the spec development
> >>>>>>>> into
> >>>>>>>> markdown format, in preparation for finalizing it?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes I'll update the numbers for some of the examples I have in the
> >>>>>>> spec based
> >>>>>>> on the updated header size. Then we should be good to go for the
> >>>>>>> markdown format.
> >>>>>>>
> >>>>>>> Thanks everyone!
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Andrew
> >>>>>>>>
> >>>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi team,
> >>>>>>>>>
> >>>>>>>>> 1) Andrew
> >>>>>>>>>
> >>>>>>>>>     - Thanks for working on test files. My PR did add all the
> test
> >>>>>>>> files I
> >>>>>>>>>     used to benchmark on datasets. Maybe we can club it together.
> >>>>>>>> WIll also
> >>>>>>>>> aid
> >>>>>>>>>     cross language testing
> >>>>>>>>>     -  Kosta Tarasov working on Rust implementation. This is
> great.
> >>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2) Antoine
> >>>>>>>>>
> >>>>>>>>>     - Thanks a lot for reporting the numbers on AMD. Looks like
> you
> >>>>>>>> are
> >>>>>>>>>     getting 8X the decoding performance of BSS. This is
> amazing!!.
> >>>>>>>>>     - Thanks for acknowledging the sampling design.
> >>>>>>>>>     - I agree with you on Fastlanes. In some crude experiments I
> >>>>>>>> didn't get
> >>>>>>>>>     a good perf benefit from it on Graviton3 (but maybe there was
> >>>>>>>> something
> >>>>>>>>>     wrong with my implementation).
> >>>>>>>>>     - Locking the 16bit exception encoding for the spec in this
> >>>>>>>> case.
> >>>>>>>>>     - Awesome I think we have solved for all open questions minus
> >>>>>>>> the
> >>>>>>>>>     version byte :). (will get back on this soon)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 3) Micah
> >>>>>>>>>
> >>>>>>>>>     - FastLanes : The current spec does allow for using FastLane
> >>>>>>>> with the
> >>>>>>>>>     configurable enum value for layout. We should be able to
> inject
> >>>>>>>> any
> >>>>>>>>> layout
> >>>>>>>>>     in the current design.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Working on resolving all remaining open comments on the spec this
> >>>>>>>> week.
> >>>>>>>>>
> >>>>>>>>> Best
> >>>>>>>>> Prateek
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <
> >>>>>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <
> >>>>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> It looks like the actual issue described for ORC in the paper
> >>>>>>>> is that
> >>>>>>>>> it
> >>>>>>>>>>> has multiple sub-encodings in a batch.  This is different then
> >>>>>>>> the
> >>>>>>>>> design
> >>>>>>>>>>> proposed here where there is still fixed encoding per page in
> >>>>>>>> parquet.
> >>>>>>>>>>> Given reasonably sized pages I don't think branch
> >>>>>>>> misprediction should
> >>>>>>>>>> be a
> >>>>>>>>>>> big issue for new encodings.  I agree that we should be
> >>>>>>>> conservative in
> >>>>>>>>>>> general for adding new encodings.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> +1
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >
>
>
>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to