I started creating some test .parquet files using the C++ implementation [1] to help other implementations test compatibility. I currently have a really simple file here [2] if anyone wants to play around with it.
I plan to expand it out over the next few days (e.g. include exceptions, f32, etc) Finally, I am pretty stoked to report that Kosta Tarasov has started working on a Rust implementation[3] Andrew [1]: https://github.com/apache/arrow/pull/49154 [2]: https://github.com/user-attachments/files/25097841/single_f64_ALP.zip [3]: https://github.com/apache/arrow-rs/issues/8748 On Tue, Feb 3, 2026 at 6:03 PM Micah Kornfield <[email protected]> wrote: > Thanks Prateek, > >> *Alp Version in Header* >> Micah's point >> *`* I'd suggest modelling fundamentally different algorithms with the >> top level encoding enum, and have versioning/control bits where we believe >> we will likely want to iterate` >> Yes this is exactly what is happening here. An enum to add AlpRd(and more) >> and version control to iterate anything fundamental (like a layout change >> of the >> metadata). > > > I think there might still be some misalignment on which enum we are > talking about. I was referring to an enum in parquet.thrift for encoding > <https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630> > [1]. > Specifically for the current proposal for ALP would ALP_PSEUDODECIMAL. For > AlpRd we would and ALP_RD (or something similar). If for some reason we > need ALP_PSEUDODECIMAL, we always have ALP_PSEUDO_DECIMAL_2. Given the > other extension points, I hope we wouldn't need ALP_PSEUDO_DECIMAL_2 (or at > least we get a better name) for a long time. > > I think the main technical trade-off here is how the thrift parsers across > bindings handle unknown enum values (e.g. if these encodings show in the > footer in stats, would that make the whole file unreadable, and do we care). > > Cheers, > Micah > > > [1] > https://github.com/apache/parquet-format/blame/master/src/main/thrift/parquet.thrift#L630 > > > > > On Tue, Feb 3, 2026 at 2:40 PM PRATEEK GAUR <[email protected]> wrote: > >> Hi Antoine and Micah, >> >> Apologies for getting back on this a little late. >> >> *Running Perf tests* >> @Antoine Pitrou <[email protected]> were you able to figure out the >> steps >> to run the tests? >> >> *Sampling Frequency* >> We want to pick the right parameters to encode the values with. That is >> what the Spec requires. >> From the implementation perspective you raise a good point that did cross >> my >> mind that 'practically we don't want to sample for every page', for >> performance >> reasons. My thinking is each engine is free to decide this. >> 1) Do it at page level if data is changing often >> 2) Provide fixed presets via config >> 3) Do it once per encoder (per column, as Micah pointed out) >> 4) Provide a fancy config. >> I agree with Micah here '*I think maybe should clarify that the* >> *encoding algorithm in a specification is a recommendation'* >> >> >> *Number of values to pick for sampling* >> 'why does this have to be a constant' >> You are right, it doesn't need to be a constant, hence the spec doesn't >> say >> so. >> Everything that is segregated out in the AlpContants(c++ impl) file can be >> changed by >> configuration. >> (Did I get your question right @Antoine Pitrou <[email protected]> ?) >> >> *Alp Version in Header* >> Micah's point >> *`* I'd suggest modelling fundamentally different algorithms with the >> top level encoding enum, and have versioning/control bits where we believe >> we will likely want to iterate` >> Yes this is exactly what is happening here. An enum to add AlpRd(and more) >> and version control to iterate anything fundamental (like a layout change >> of the >> metadata). >> >> Re-stating the points so that scrolling is not needed. >> 1. Change of integer encoding (see debate in this thread on FOR vs >> Delta). We also want to get fast lanes in at some point. I think an >> enum inside the page for versioning makes sense, as it allows for easier >> composability. >> 2. Change in structure to exceptions (e.g. G-ALP). G-ALP comes with some >> trade-offs, so it is not clear if it is something everyone would want to >> enable. >> 3. Offset indexes to vectors >> 4. Different floating point encoding algorithms (e.g. AlpRd + AlpBSS) >> >> For eg at this point I do see that both bitpacking of exceptions, as >> pointed by >> Antoine, or plain ub2 encoding should work equally well . I was running >> some benchmarks >> here and I was getting read speeds of around 20 GB/s for 10bit packed >> values which is >> quite good enough (graviton 3 processor). >> But for simplicity (and the fact that we won't get really large vectors, >> my >> inclination >> is towards ub2 values) but I want to keep the path open to possibly have >> bipacking >> as an option as the workloads evolve to a level we haven't thought about >> yet. We >> can always add a new encoding but I don't see a path to having 20+ top >> level >> encodings. Again I don't have a very strong bias towards keeping it or >> removing it >> but my thinking right now is let's have the flexibility and make it easier >> for people >> to evolve this encoding behind a version byte. >> >> >> Best >> Prateek >> PS : I probably have addressed all open threads raised by Antoine and >> Micah. >> (but I may have missed something) >> >> >> >> >> >> On Thu, Jan 29, 2026 at 10:52 PM Micah Kornfield <[email protected]> >> wrote: >> >> > Hi Antoine and Prateek, >> > >> > > > In Parquet C++, encoding happens at page level, and I would guess >> other >> > > > implementations do something similar. Sampling cannot reasonably be >> > done >> > > > at a higher level, that would require invasive architectural >> changes. >> > >> > >> > At least in C++ I believe we cache the encoder at a column level [1] (I >> > believe the same is true for java). I think this implies one could >> sample >> > for the first page more or less, and then resample on some regular >> cadence >> > (or if compression degrades too much)? In general, the exact approach >> used >> > in implementations can vary here, so I think maybe should clarify that >> the >> > encoding algorithm in a specification is a recommendation, and we >> > concentrate the discussion on implementation to the individual language >> > binding PRs. >> > >> > 1) Addition of AlpRd (if that takes time to get in for read doubles). >> (This >> > > is easily addable with provided AlgorithmEnum) >> > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is >> easily >> > > addable with provided AlgorithmEnum) >> > > 3) Addition of different encoding (This is easily addable with >> provided >> > > AlgorithmEnum). >> > >> > >> > In short, I'd suggest modelling fundamentally different algorithms with >> the >> > top level encoding enum, and have versioning/control bits where we >> believe >> > we will likely want to iterate. >> > >> > Long version, based on my understanding of current open design points >> the >> > following extensions have been discussed: >> > >> > 1. Change of integer encoding (see debate in this thread on FOR vs >> > Delta). We also want to get fastlanes in at some point. I think an >> > enum inside the page for versioning makes sense, as it allows for easier >> > composability. >> > 2. Change in structure to exceptions (e.g. G-ALP). G-ALP comes with >> some >> > trade-offs, so it is not clear if it is something everyone would want to >> > enable. >> > 3. Offset indexes to vectors >> > 4. Different floating point encoding algorithms (e.g. AlpRd + AlpBSS) >> > >> > 1 is almost definitely an extension point and I think it pays to version >> > this within the page (if we decide to allow delta at some point, and >> then >> > do fast lanes we start having a lot of ALP-pseudecimal enums, that might >> > not add a lot of value). This would get worse if we multiply this by >> any >> > future exception layouts. 3. Feels like we can just make a decision >> now on >> > whether to add them, and if not added now can probably be added if we >> get >> > to cascaded encodings, or if really needed as a separate enum (but I'm >> open >> > to control bit here). >> > >> > For 4, these feel like they should be a top level enum for versioning, >> they >> > are fundamentally different algorithms with different code to decode >> them. >> > The trade-off here is that the current writers need to have some more >> > invasive changes to have better fallback or choice of initial encoder >> (but >> > this needs to happen anyways). We can easily have multiple page >> encodings >> > with a row-group from a reader's perspective >> > >> > For any other future changes that don't fall into these buckets a new >> > top-level enum is always an escape hatch (and we can learn from our >> > mistakes). Another take is it is just one byte per page, which would >> take a >> > while to add up even at Parquet's scale, but on the balance I'd lean >> > towards YAGNI. >> > >> > Cheers, >> > Micah >> > >> > [1] >> > >> > >> https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1290 >> > >> > >> > On Tue, Jan 27, 2026 at 9:47 AM PRATEEK GAUR <[email protected]> >> wrote: >> > >> > > Hi Antoine, >> > > >> > > Thanks. Replying inline to the discussion threads. >> > > >> > > >> > > > > https://github.com/apache/parquet-testing/pull/100/files. >> > > > > You might have to checkout that branch to be able to run the >> > > benchmarks. >> > > > >> > > >> > > Yeah just cherry-picking the branch should work. Let me me know if it >> > ends >> > > up >> > > being some kind of permission issue. >> > > >> > > >> > > > >> > > > Hmm, I'll take a look at that later and come back. >> > > > >> > > > > * the encoding of integers uses a custom framing with >> > > frame-of-reference >> > > > >> encoding inside it, but Parquet implementations already implement >> > > > >> DELTA_BINARY_PACKED which should have similar characteristics, so >> > why >> > > > >> not reuse that? >> > > > >> > > > Thanks. At worse it seems it would need one more bit per integer >> than >> > > > the proposed FOR scheme (it might need less if delta encoding allows >> > > > reducing entropy among the encoded integers). I'm not sure how that >> > > > makes it "fail". >> > > > >> > > >> > > >> > > My bad. I wanted to add another point. >> > > 1) So I looked at at a few other examples and in most cases FOR ended >> up >> > > using >> > > let's bits per value. My thinking is that for 1M (1B values) , it >> > will >> > > add up. >> > > 2) Another major point was that the bitunpacker + FOR was much much >> > faster >> > > than >> > > DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but >> > > DeltaBitUnpack >> > > not so much. I vaguely remember the difference being around 2x. I >> can >> > > try and >> > > compute the numbers again. But today the entire bottleneck in the >> > > decoder shows >> > > up in bit-unpacking. >> > > >> > > >> > > > >> * there are a lot of fields in the headers that look a bit >> > superfluous >> > > > >> (though of course those bits are relatively cheap); for example, >> why >> > > > >> have a format "version" while we could define a new encoding for >> > > > >> incompatible evolutions? >> > > > > >> > > > > We discussed this point in the Spec >> > > > > < >> > > > >> > > >> > >> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit >> > >> > > > > document a lot and have gravitated >> > > > > towards a versioning scheme for easier evolution. >> > > > >> > > >> > > > * Meta-answer: >> > > > >> > > > I don't see any such discussion in the comments and the document >> > doesn't >> > > > state any rationale for it. >> > > > >> > > > As a reference, Python PEPs include a discussion of rejected >> > > > alternatives so that people don't ask the same questions over and >> over >> > > > (see >> https://peps.python.org/pep-0810/#alternate-implementation-ideas >> > > > for an example). >> > > > >> > > > * Actual answer: >> > > > >> > > > My problem with this is that it will complicate understanding and >> > > > communicating the feature set supported by each Parquet >> implementation. >> > > > "ALP" will not be a single spec but an evolving one with its own >> > version >> > > > numbers. I'm not sure why that is better than adding a "ALP2" if we >> > even >> > > > want to evolve the spec. >> > > > >> > > > It will also complicate APIs that currently accept encoding numbers, >> > > > such as >> > > > >> > > > >> > > >> > >> https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221 >> > > > >> > > > We need a clear explanation of what makes ALP so special that it >> > *needs* >> > > > its own versioning scheme. Otherwise we should remove it IMHO. >> > > > >> > > >> > > Based on the feedback I think it is more of a clarification/API >> > > complication discussion. >> > > I was thinking along these lines. >> > > >> > > Things that I thought will had to change but then we added enough >> > > flexibility in metadata structs to allow for these. >> > > 1) Addition of AlpRd (if that takes time to get in for read doubles). >> > (This >> > > is easily addable with provided AlgorithmEnum) >> > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is >> easily >> > > addable with provided AlgorithmEnum) >> > > 3) Addition of different encoding (This is easily addable with >> provided >> > > AlgorithmEnum). >> > > >> > > Things that I think might need a version field. >> > > 1) Right now FOR and ALP structs which are interleaved store certain >> set >> > of >> > > fields, given there have been lots of >> > > in coming suggestions my thinking was that having a version will >> > allow >> > > us to easily change them with minor >> > > changes to the code in all writers/readers (maybe that is a very >> bit >> > > task). >> > > >> > > >> > > > >> > > > >> * the "Total Encoded Element count" duplicates information >> already >> > in >> > > > >> the page header, with risks of inconsistent values (including >> > security >> > > > >> risks that require specific care in implementations) >> > > > > >> > > > > 'num_elements' : let me re-read and get back on this. >> > > > >> > > > Ok, Micah's answer cleared that up. >> > > > >> > > >> > > Cool. >> > > >> > > >> > > > >> > > > >> * what happens if the number of exceptions is above 65535? their >> > > indices >> > > > >> are coded as 16-bit uints. How about using the same encoding as >> for >> > > > >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also >> > remove >> > > > >> the 65535 limitation. >> > > > > >> > > > > So, I don't see a need for a vector larger than 65535. With that >> > large >> > > > > vectors the >> > > > > overhead of metadata is small and you might as well break it into >> > > > multiple >> > > > > vectors. I'm gonna give it some more thought and get back. >> > > > >> > > > Ah, yes, I think you're right. Let's forget this :-) >> > > > >> > > >> > > >> > > :). >> > > >> > > >> > > > >> > > > > Sampling process should be statistically significant. It should >> pick >> > > > enough >> > > > > values >> > > > > and not have bias towards just the values towards the start. ALP >> > > > algorithm >> > > > > ensures >> > > > > that and tries to balance between not spending enough cycles to >> get >> > > right >> > > > > parameters >> > > > > and picking incorrect parameters. >> > > > > >> > > > > For a very large row group we can change the constant and have it >> > > > > selects over a larger data set. >> > > > > Or one can do it at page level too. Happy to discuss more on this. >> > > > >> > > > In Parquet C++, encoding happens at page level, and I would guess >> other >> > > > implementations do something similar. Sampling cannot reasonably be >> > done >> > > > at a higher level, that would require invasive architectural >> changes. >> > > > >> > > > But this raises another question: why does this have to be a >> constant? >> > > > When encoding a page, you know the population size (i.e. the actual >> > > > number of values that will be encoded). You don't need to estimate >> it >> > > > with a constant. >> > > > >> > > >> > > Some engines might decide to write large pages (large number of values >> > in a >> > > page) ~2B. >> > > One might not want to select all the values towards sampling process >> to >> > > keep sampling >> > > a slightly light weight operation. Also as @Andrew Lamb >> > > <[email protected]> pointed out at times engines >> > > already know what the best values of hyper parameters are might want >> to >> > > provide it as >> > > input via an epic to send in pre-populated presets. This can happen >> > across >> > > multiple pages >> > > or multiple row groups after having calculated things for the first >> batch >> > > of data. This might >> > > further make sampling more light weight. I would gravitate towards >> > > flexibility here :). >> > > >> > > >> > > Thanks for this discussion >> > > Prateek >> > > >> > >> >
