Re: [Parquet] ALP Encoding for Floating point data

Antoine Pitrou Tue, 05 May 2026 13:39:44 -0700


Hello,

Apologies if the question was already asked, but should we care aboutFLOAT16 for ALP? Can FLOAT + ALP be more efficient than FLOAT16 +BYTE_STREAM_SPLIT + LZ4 for example?


Regards

Antoine.


Le 30/04/2026 à 01:10, PRATEEK GAUR a écrit :

Thanks Andrew and Micah for review feedback on the two PR's
1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes
2) (parquet-format repo) https://github.com/apache/parquet-format/pull/557

I have addressed all (unless I missed something) comments on the two PR's.

Best
Prateek

On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote:

Thanks Andrew and Micah.

`fair amount of feedback on at least the implementations`
For the c++ I have already started addressing the feedback, I should be
done with that Monday/Tuesday.
I think Vinoo too has been making good progress on the Java implementation.

Best
Prateek

On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]>
wrote:

Got it. Thank you for the clarification -- I will try and look into the
spec and the Rust implementation[1] in this next week

[1]: https://github.com/apache/arrow-rs/pull/9372

On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <[email protected]>
wrote:

Hi Andrew,
I think there is a fair amount of feedback on at least the
implementations, typically I think we've waited till they are close to
mergeable before a final vote.  Otherwise I agree we are very close.

-Micah

On Saturday, April 25, 2026, Andrew Lamb <[email protected]> wrote:

Thanks Prateek,

I think from this content it looks to me like we are ready to start a
vote to explicitly accept ALP into Parquet

Does anyone know of a reason we should postpone it for longer?
Perhaps someone needs some more time to review?

Andrew



On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]>
wrote:

Hi team,



Hope everyone is doing well. I got a chance to work through all the
remaining feedback and update the spec doc. Here are the new artifacts

1) Spec document :
https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit

2) Spec document in parquet format repo :
https://github.com/apache/parquet-format/pull/557

3) Alp implementation in arrow c++ repo :
https://github.com/apache/arrow/pull/48345/changes

4) Alp implementation in parquet-java repo : Work for Vinoo and Julien
  https://github.com/apache/parquet-java/pull/3397

5) PR with test and benchmarking artifacts in parquet-testing repo :
https://github.com/apache/parquet-testing/pull/100


And


    - Go : Arnav just submitted an in progress implementation in Go.
    https://github.com/apache/arrow-go/pull/704 (I haven't started
    looking at it yet)
    - Rust : I remember Andrew mentioned that this work is also in
    progress (So 4 languages!)


*Arrow C++ implementation *



The PR is out and was also used by Antoine to report the numbers as
reported here. Micah and Konstantin have given 1 round of feedback
and I'm addressing them today. Please note that the default
optimization flag for compiling is O2 and not Q3. I got around 70%
performance improvement in the decoding speed when using the O3 flag.



*Parqet-MR Java implementation (working with Vinoo and Julien) and **Cross
Language testing*


    Let me know if you have any questions or feedback.



Now pasting some performance numbers


   Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
Neoverse V1)

   ┌──────────────────┬──────────────┬──────────────┬─────────┐

   │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │

   ├──────────────────┼──────────────┼──────────────┼─────────┤

   │ valence          │     3,155    │     5,523    │  1.75x  │

   │ danceability     │     3,233    │     5,685    │  1.76x  │

   │ energy           │     3,197    │     5,652    │  1.77x  │

   │ loudness         │     3,186    │     5,473    │  1.72x  │

   └──────────────────┴──────────────┴──────────────┴─────────┘




On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
wrote:

@Micah Kornfield <[email protected]> : Got it.

@Andrew Lamb <[email protected]>

Do you think it would be good to start moving the spec development
into
markdown format, in preparation for finalizing it?


Yes I'll update the numbers for some of the examples I have in the
spec based
on the updated header size. Then we should be good to go for the
markdown format.

Thanks everyone!


Andrew

On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
wrote:

Hi team,

1) Andrew

    - Thanks for working on test files. My PR did add all the test

files I

    used to benchmark on datasets. Maybe we can club it together.

WIll also

aid
    cross language testing
    -  Kosta Tarasov working on Rust implementation. This is great.

Thanks



2) Antoine

    - Thanks a lot for reporting the numbers on AMD. Looks like you

are

    getting 8X the decoding performance of BSS. This is amazing!!.
    - Thanks for acknowledging the sampling design.
    - I agree with you on Fastlanes. In some crude experiments I

didn't get

    a good perf benefit from it on Graviton3 (but maybe there was

something

    wrong with my implementation).
    - Locking the 16bit exception encoding for the spec in this

case.

    - Awesome I think we have solved for all open questions minus

the

    version byte :). (will get back on this soon)


3) Micah

    - FastLanes : The current spec does allow for using FastLane

with the

    configurable enum value for layout. We should be able to inject

any

layout
    in the current design.


Working on resolving all remaining open comments on the spec this

week.


Best
Prateek


On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <

[email protected]>

wrote:

On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <

[email protected]>

wrote:



It looks like the actual issue described for ORC in the paper

is that

it

has multiple sub-encodings in a batch.  This is different then

the

design

proposed here where there is still fixed encoding per page in

parquet.

Given reasonably sized pages I don't think branch

misprediction should

be a

big issue for new encodings.  I agree that we should be

conservative in

general for adding new encodings.

+1

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to