I started creating some test .parquet files using the C++ implementation
[1] to help other implementations test compatibility. I currently have a
really simple file here [2] if anyone wants to play around with it.

I plan to expand it out over the next few days (e.g. include exceptions,
f32, etc)

Finally, I am pretty stoked to report that Kosta Tarasov has started
working on a Rust implementation[3]

Andrew

[1]: https://github.com/apache/arrow/pull/49154
[2]: https://github.com/user-attachments/files/25097841/single_f64_ALP.zip
[3]: https://github.com/apache/arrow-rs/issues/8748

On Tue, Feb 3, 2026 at 6:03 PM Micah Kornfield <[email protected]>
wrote:

> Thanks Prateek,
>
>> *Alp Version in Header*
>> Micah's point
>> *`* I'd suggest modelling fundamentally different algorithms with the
>> top level encoding enum, and have versioning/control bits where we believe
>> we will likely want to iterate`
>> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
>> and version control to iterate anything fundamental (like a layout change
>> of the
>> metadata).
>
>
> I think there might still be some misalignment on which enum we are
> talking about.  I was referring to an enum in parquet.thrift for encoding
> <https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630>
>  [1].
> Specifically for the current proposal for ALP would ALP_PSEUDODECIMAL.  For
> AlpRd we would and ALP_RD (or something similar).  If for some reason we
> need ALP_PSEUDODECIMAL, we always have ALP_PSEUDO_DECIMAL_2.  Given the
> other extension points, I hope we wouldn't need ALP_PSEUDO_DECIMAL_2 (or at
> least we get a better name) for a long time.
>
> I think the main technical trade-off here is how the thrift parsers across
> bindings handle unknown enum values (e.g. if these encodings show in the
> footer in stats, would that make the whole file unreadable, and do we care).
>
> Cheers,
> Micah
>
>
> [1]
> https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630
>
>
>
>
> On Tue, Feb 3, 2026 at 2:40 PM PRATEEK GAUR <[email protected]> wrote:
>
>> Hi Antoine and Micah,
>>
>> Apologies for getting back on this a little late.
>>
>> *Running Perf tests*
>> @Antoine Pitrou <[email protected]> were you able to figure out the
>> steps
>> to run the tests?
>>
>> *Sampling Frequency*
>> We want to pick the right parameters to encode the values with. That is
>> what the Spec requires.
>> From the implementation perspective you raise a good point that did cross
>> my
>> mind that 'practically we don't want to sample for every page', for
>> performance
>> reasons. My thinking is each engine is free to decide this.
>> 1) Do it at page level if data is changing often
>> 2) Provide fixed presets via config
>> 3) Do it once per encoder (per column, as Micah pointed out)
>> 4) Provide a fancy config.
>> I agree with Micah here '*I think maybe should clarify that the*
>> *encoding algorithm in a specification is a recommendation'*
>>
>>
>> *Number of values to pick for sampling*
>> 'why does this have to be a constant'
>> You are right, it doesn't need to be a constant, hence the spec doesn't
>> say
>> so.
>> Everything that is segregated out in the AlpContants(c++ impl) file can be
>> changed by
>> configuration.
>> (Did I get your question right @Antoine Pitrou <[email protected]> ?)
>>
>> *Alp Version in Header*
>> Micah's point
>> *`* I'd suggest modelling fundamentally different algorithms with the
>> top level encoding enum, and have versioning/control bits where we believe
>> we will likely want to iterate`
>> Yes this is exactly what is happening here. An enum to add AlpRd(and more)
>> and version control to iterate anything fundamental (like a layout change
>> of the
>> metadata).
>>
>> Re-stating the points so that scrolling is not needed.
>> 1.  Change of integer encoding (see debate in this thread on FOR vs
>> Delta).  We also want to get fast lanes in at some point.  I think an
>> enum inside the page for versioning makes sense, as it allows for easier
>> composability.
>> 2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with some
>> trade-offs, so it is not clear if it is something everyone would want to
>> enable.
>> 3.  Offset indexes to vectors
>> 4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)
>>
>> For eg at this point I do see that both bitpacking of exceptions, as
>> pointed by
>> Antoine, or plain ub2 encoding should work equally well . I was running
>> some benchmarks
>> here and I was getting read speeds of around 20 GB/s for 10bit packed
>> values which is
>> quite good enough (graviton 3 processor).
>> But for simplicity (and the fact that we won't get really large vectors,
>> my
>> inclination
>> is towards ub2 values) but I want to keep the path open to possibly have
>> bipacking
>> as an option as the workloads evolve to a level we haven't thought about
>> yet. We
>> can always add a new encoding but I don't see a path to having 20+ top
>> level
>> encodings. Again I don't have a very strong bias towards keeping it or
>> removing it
>> but my thinking right now is let's have the flexibility and make it easier
>> for people
>> to evolve this encoding behind a version byte.
>>
>>
>> Best
>> Prateek
>> PS : I probably have addressed all open threads raised by Antoine and
>> Micah.
>> (but I may have missed something)
>>
>>
>>
>>
>>
>> On Thu, Jan 29, 2026 at 10:52 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>> > Hi Antoine and Prateek,
>> >
>> > > > In Parquet C++, encoding happens at page level, and I would guess
>> other
>> > > > implementations do something similar. Sampling cannot reasonably be
>> > done
>> > > > at a higher level, that would require invasive architectural
>> changes.
>> >
>> >
>> > At least in C++ I believe we cache the encoder at a column level [1] (I
>> > believe the same is true for java). I think this implies one could
>> sample
>> > for the first page more or less, and then resample on some regular
>> cadence
>> > (or if compression degrades too much)?  In general, the exact approach
>> used
>> > in implementations can vary here, so I think maybe should clarify that
>> the
>> > encoding algorithm in a specification is a recommendation, and we
>> > concentrate the discussion on implementation to the individual language
>> > binding PRs.
>> >
>> > 1) Addition of AlpRd (if that takes time to get in for read doubles).
>> (This
>> > > is easily addable with provided AlgorithmEnum)
>> > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is
>> easily
>> > > addable with provided AlgorithmEnum)
>> > > 3) Addition of different encoding (This is easily addable with
>> provided
>> > > AlgorithmEnum).
>> >
>> >
>> > In short, I'd suggest modelling fundamentally different algorithms with
>> the
>> > top level encoding enum, and have versioning/control bits where we
>> believe
>> > we will likely want to iterate.
>> >
>> > Long version, based on my understanding of current open design points
>> the
>> > following extensions have been discussed:
>> >
>> > 1.  Change of integer encoding (see debate in this thread on FOR vs
>> > Delta).  We also want to get fastlanes in at some point.  I think an
>> > enum inside the page for versioning makes sense, as it allows for easier
>> > composability.
>> > 2.  Change in structure to exceptions (e.g. G-ALP).  G-ALP comes with
>> some
>> > trade-offs, so it is not clear if it is something everyone would want to
>> > enable.
>> > 3.  Offset indexes to vectors
>> > 4.  Different floating point encoding algorithms  (e.g. AlpRd + AlpBSS)
>> >
>> > 1 is almost definitely an extension point and I think it pays to version
>> > this within the page (if we decide to allow delta at some point, and
>> then
>> > do fast lanes we start having a lot of ALP-pseudecimal enums, that might
>> > not add a lot of value).  This would get worse if we multiply this by
>> any
>> > future exception layouts.  3. Feels like we can just make a decision
>> now on
>> > whether to add them, and if not added now can probably be added if we
>> get
>> > to cascaded encodings, or if really needed as a separate enum (but I'm
>> open
>> > to control bit here).
>> >
>> > For 4, these feel like they should be a top level enum for versioning,
>> they
>> > are fundamentally different algorithms with different code to decode
>> them.
>> > The trade-off here is that the current writers need to have some more
>> > invasive changes to have better fallback or choice of initial encoder
>> (but
>> > this needs to happen anyways).  We can easily have multiple page
>> encodings
>> > with a row-group from a reader's perspective
>> >
>> > For any other future changes that don't fall into these buckets a new
>> > top-level enum is always an escape hatch (and we can learn from our
>> > mistakes). Another take is it is just one byte per page, which would
>> take a
>> > while to add up even at Parquet's scale, but on the balance I'd lean
>> > towards YAGNI.
>> >
>> > Cheers,
>> > Micah
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1290
>> >
>> >
>> > On Tue, Jan 27, 2026 at 9:47 AM PRATEEK GAUR <[email protected]>
>> wrote:
>> >
>> > > Hi Antoine,
>> > >
>> > > Thanks. Replying inline to the discussion threads.
>> > >
>> > >
>> > > > > https://github.com/apache/parquet-testing/pull/100/files.
>> > > > > You might have to checkout that branch to be able to run the
>> > > benchmarks.
>> > > >
>> > >
>> > > Yeah just cherry-picking the branch should work. Let me me know if it
>> > ends
>> > > up
>> > > being some kind of permission issue.
>> > >
>> > >
>> > > >
>> > > > Hmm, I'll take a look at that later and come back.
>> > > >
>> > > > > * the encoding of integers uses a custom framing with
>> > > frame-of-reference
>> > > > >> encoding inside it, but Parquet implementations already implement
>> > > > >> DELTA_BINARY_PACKED which should have similar characteristics, so
>> > why
>> > > > >> not reuse that?
>> > > >
>> > > > Thanks. At worse it seems it would need one more bit per integer
>> than
>> > > > the proposed FOR scheme (it might need less if delta encoding allows
>> > > > reducing entropy among the encoded integers). I'm not sure how that
>> > > > makes it "fail".
>> > > >
>> > >
>> > >
>> > > My bad. I wanted to add another point.
>> > > 1) So I looked at at a few other examples and in most cases FOR ended
>> up
>> > > using
>> > >      let's bits per value. My thinking is that for 1M (1B values) , it
>> > will
>> > > add up.
>> > > 2) Another major point was that the bitunpacker + FOR was much much
>> > faster
>> > > than
>> > >     DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but
>> > > DeltaBitUnpack
>> > >     not so much. I vaguely remember the difference being around 2x. I
>> can
>> > > try and
>> > >     compute the numbers again. But today the entire bottleneck in the
>> > > decoder shows
>> > >     up in bit-unpacking.
>> > >
>> > >
>> > > > >> * there are a lot of fields in the headers that look a bit
>> > superfluous
>> > > > >> (though of course those bits are relatively cheap); for example,
>> why
>> > > > >> have a format "version" while we could define a new encoding for
>> > > > >> incompatible evolutions?
>> > > > >
>> > > > > We discussed this point in the Spec
>> > > > > <
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
>> >
>> > > > > document a lot and have gravitated
>> > > > > towards a versioning scheme for easier evolution.
>> > > >
>> > >
>> > > > * Meta-answer:
>> > > >
>> > > > I don't see any such discussion in the comments and the document
>> > doesn't
>> > > > state any rationale for it.
>> > > >
>> > > > As a reference, Python PEPs include a discussion of rejected
>> > > > alternatives so that people don't ask the same questions over and
>> over
>> > > > (see
>> https://peps.python.org/pep-0810/#alternate-implementation-ideas
>> > > > for an example).
>> > > >
>> > > > * Actual answer:
>> > > >
>> > > > My problem with this is that it will complicate understanding and
>> > > > communicating the feature set supported by each Parquet
>> implementation.
>> > > > "ALP" will not be a single spec but an evolving one with its own
>> > version
>> > > > numbers. I'm not sure why that is better than adding a "ALP2" if we
>> > even
>> > > > want to evolve the spec.
>> > > >
>> > > > It will also complicate APIs that currently accept encoding numbers,
>> > > > such as
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221
>> > > >
>> > > > We need a clear explanation of what makes ALP so special that it
>> > *needs*
>> > > > its own versioning scheme. Otherwise we should remove it IMHO.
>> > > >
>> > >
>> > > Based on the feedback I think it is more of a clarification/API
>> > > complication discussion.
>> > > I was thinking along these lines.
>> > >
>> > > Things that I thought will had to change but then we added enough
>> > > flexibility in metadata structs to allow for these.
>> > > 1) Addition of AlpRd (if that takes time to get in for read doubles).
>> > (This
>> > > is easily addable with provided AlgorithmEnum)
>> > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is
>> easily
>> > > addable with provided AlgorithmEnum)
>> > > 3) Addition of different encoding (This is easily addable with
>> provided
>> > > AlgorithmEnum).
>> > >
>> > > Things that I think might need a version field.
>> > > 1) Right now FOR and ALP structs which are interleaved store certain
>> set
>> > of
>> > > fields, given there have been lots of
>> > >      in coming suggestions my thinking was that having a version will
>> > allow
>> > > us to easily change them with minor
>> > >      changes to the code in all writers/readers (maybe that is a very
>> bit
>> > > task).
>> > >
>> > >
>> > > >
>> > > > >> * the "Total Encoded Element count" duplicates information
>> already
>> > in
>> > > > >> the page header, with risks of inconsistent values (including
>> > security
>> > > > >> risks that require specific care in implementations)
>> > > > >
>> > > > > 'num_elements' : let me re-read and get back on this.
>> > > >
>> > > > Ok, Micah's answer cleared that up.
>> > > >
>> > >
>> > > Cool.
>> > >
>> > >
>> > > >
>> > > > >> * what happens if the number of exceptions is above 65535? their
>> > > indices
>> > > > >> are coded as 16-bit uints. How about using the same encoding as
>> for
>> > > > >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also
>> > remove
>> > > > >> the 65535 limitation.
>> > > > >
>> > > > > So, I don't see a need for a vector larger than 65535. With that
>> > large
>> > > > > vectors the
>> > > > > overhead of metadata is small and you might as well break it into
>> > > > multiple
>> > > > > vectors. I'm gonna give it some more thought and get back.
>> > > >
>> > > > Ah, yes, I think you're right. Let's forget this :-)
>> > > >
>> > >
>> > >
>> > > :).
>> > >
>> > >
>> > > >
>> > > > > Sampling process should be statistically significant. It should
>> pick
>> > > > enough
>> > > > > values
>> > > > > and not have bias towards just the values towards the start. ALP
>> > > > algorithm
>> > > > > ensures
>> > > > > that and tries to balance between not spending enough cycles to
>> get
>> > > right
>> > > > > parameters
>> > > > > and picking incorrect parameters.
>> > > > >
>> > > > > For a very large row group we can change the constant and have it
>> > > > > selects over a larger data set.
>> > > > > Or one can do it at page level too. Happy to discuss more on this.
>> > > >
>> > > > In Parquet C++, encoding happens at page level, and I would guess
>> other
>> > > > implementations do something similar. Sampling cannot reasonably be
>> > done
>> > > > at a higher level, that would require invasive architectural
>> changes.
>> > > >
>> > > > But this raises another question: why does this have to be a
>> constant?
>> > > > When encoding a page, you know the population size (i.e. the actual
>> > > > number of values that will be encoded). You don't need to estimate
>> it
>> > > > with a constant.
>> > > >
>> > >
>> > > Some engines might decide to write large pages (large number of values
>> > in a
>> > > page) ~2B.
>> > > One might not want to select all the values towards sampling process
>> to
>> > > keep sampling
>> > > a slightly light weight operation. Also as @Andrew Lamb
>> > > <[email protected]> pointed out at times engines
>> > > already know what the best values of hyper parameters are might want
>> to
>> > > provide it as
>> > > input via an epic to send in pre-populated presets. This can happen
>> > across
>> > > multiple pages
>> > > or multiple row groups after having calculated things for the first
>> batch
>> > > of data. This might
>> > > further make sampling more light weight. I would gravitate towards
>> > > flexibility here :).
>> > >
>> > >
>> > > Thanks for this discussion
>> > > Prateek
>> > >
>> >
>>
>

Reply via email to