Hi Antoine and Micah, Apologies for getting back on this a little late.
*Running Perf tests* @Antoine Pitrou <[email protected]> were you able to figure out the steps to run the tests? *Sampling Frequency* We want to pick the right parameters to encode the values with. That is what the Spec requires. >From the implementation perspective you raise a good point that did cross my mind that 'practically we don't want to sample for every page', for performance reasons. My thinking is each engine is free to decide this. 1) Do it at page level if data is changing often 2) Provide fixed presets via config 3) Do it once per encoder (per column, as Micah pointed out) 4) Provide a fancy config. I agree with Micah here '*I think maybe should clarify that the* *encoding algorithm in a specification is a recommendation'* *Number of values to pick for sampling* 'why does this have to be a constant' You are right, it doesn't need to be a constant, hence the spec doesn't say so. Everything that is segregated out in the AlpContants(c++ impl) file can be changed by configuration. (Did I get your question right @Antoine Pitrou <[email protected]> ?) *Alp Version in Header* Micah's point *`* I'd suggest modelling fundamentally different algorithms with the top level encoding enum, and have versioning/control bits where we believe we will likely want to iterate` Yes this is exactly what is happening here. An enum to add AlpRd(and more) and version control to iterate anything fundamental (like a layout change of the metadata). Re-stating the points so that scrolling is not needed. 1. Change of integer encoding (see debate in this thread on FOR vs Delta). We also want to get fast lanes in at some point. I think an enum inside the page for versioning makes sense, as it allows for easier composability. 2. Change in structure to exceptions (e.g. G-ALP). G-ALP comes with some trade-offs, so it is not clear if it is something everyone would want to enable. 3. Offset indexes to vectors 4. Different floating point encoding algorithms (e.g. AlpRd + AlpBSS) For eg at this point I do see that both bitpacking of exceptions, as pointed by Antoine, or plain ub2 encoding should work equally well . I was running some benchmarks here and I was getting read speeds of around 20 GB/s for 10bit packed values which is quite good enough (graviton 3 processor). But for simplicity (and the fact that we won't get really large vectors, my inclination is towards ub2 values) but I want to keep the path open to possibly have bipacking as an option as the workloads evolve to a level we haven't thought about yet. We can always add a new encoding but I don't see a path to having 20+ top level encodings. Again I don't have a very strong bias towards keeping it or removing it but my thinking right now is let's have the flexibility and make it easier for people to evolve this encoding behind a version byte. Best Prateek PS : I probably have addressed all open threads raised by Antoine and Micah. (but I may have missed something) On Thu, Jan 29, 2026 at 10:52 PM Micah Kornfield <[email protected]> wrote: > Hi Antoine and Prateek, > > > > In Parquet C++, encoding happens at page level, and I would guess other > > > implementations do something similar. Sampling cannot reasonably be > done > > > at a higher level, that would require invasive architectural changes. > > > At least in C++ I believe we cache the encoder at a column level [1] (I > believe the same is true for java). I think this implies one could sample > for the first page more or less, and then resample on some regular cadence > (or if compression degrades too much)? In general, the exact approach used > in implementations can vary here, so I think maybe should clarify that the > encoding algorithm in a specification is a recommendation, and we > concentrate the discussion on implementation to the individual language > binding PRs. > > 1) Addition of AlpRd (if that takes time to get in for read doubles). (This > > is easily addable with provided AlgorithmEnum) > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily > > addable with provided AlgorithmEnum) > > 3) Addition of different encoding (This is easily addable with provided > > AlgorithmEnum). > > > In short, I'd suggest modelling fundamentally different algorithms with the > top level encoding enum, and have versioning/control bits where we believe > we will likely want to iterate. > > Long version, based on my understanding of current open design points the > following extensions have been discussed: > > 1. Change of integer encoding (see debate in this thread on FOR vs > Delta). We also want to get fastlanes in at some point. I think an > enum inside the page for versioning makes sense, as it allows for easier > composability. > 2. Change in structure to exceptions (e.g. G-ALP). G-ALP comes with some > trade-offs, so it is not clear if it is something everyone would want to > enable. > 3. Offset indexes to vectors > 4. Different floating point encoding algorithms (e.g. AlpRd + AlpBSS) > > 1 is almost definitely an extension point and I think it pays to version > this within the page (if we decide to allow delta at some point, and then > do fast lanes we start having a lot of ALP-pseudecimal enums, that might > not add a lot of value). This would get worse if we multiply this by any > future exception layouts. 3. Feels like we can just make a decision now on > whether to add them, and if not added now can probably be added if we get > to cascaded encodings, or if really needed as a separate enum (but I'm open > to control bit here). > > For 4, these feel like they should be a top level enum for versioning, they > are fundamentally different algorithms with different code to decode them. > The trade-off here is that the current writers need to have some more > invasive changes to have better fallback or choice of initial encoder (but > this needs to happen anyways). We can easily have multiple page encodings > with a row-group from a reader's perspective > > For any other future changes that don't fall into these buckets a new > top-level enum is always an escape hatch (and we can learn from our > mistakes). Another take is it is just one byte per page, which would take a > while to add up even at Parquet's scale, but on the balance I'd lean > towards YAGNI. > > Cheers, > Micah > > [1] > > https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1290 > > > On Tue, Jan 27, 2026 at 9:47 AM PRATEEK GAUR <[email protected]> wrote: > > > Hi Antoine, > > > > Thanks. Replying inline to the discussion threads. > > > > > > > > https://github.com/apache/parquet-testing/pull/100/files. > > > > You might have to checkout that branch to be able to run the > > benchmarks. > > > > > > > Yeah just cherry-picking the branch should work. Let me me know if it > ends > > up > > being some kind of permission issue. > > > > > > > > > > Hmm, I'll take a look at that later and come back. > > > > > > > * the encoding of integers uses a custom framing with > > frame-of-reference > > > >> encoding inside it, but Parquet implementations already implement > > > >> DELTA_BINARY_PACKED which should have similar characteristics, so > why > > > >> not reuse that? > > > > > > Thanks. At worse it seems it would need one more bit per integer than > > > the proposed FOR scheme (it might need less if delta encoding allows > > > reducing entropy among the encoded integers). I'm not sure how that > > > makes it "fail". > > > > > > > > > My bad. I wanted to add another point. > > 1) So I looked at at a few other examples and in most cases FOR ended up > > using > > let's bits per value. My thinking is that for 1M (1B values) , it > will > > add up. > > 2) Another major point was that the bitunpacker + FOR was much much > faster > > than > > DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but > > DeltaBitUnpack > > not so much. I vaguely remember the difference being around 2x. I can > > try and > > compute the numbers again. But today the entire bottleneck in the > > decoder shows > > up in bit-unpacking. > > > > > > > >> * there are a lot of fields in the headers that look a bit > superfluous > > > >> (though of course those bits are relatively cheap); for example, why > > > >> have a format "version" while we could define a new encoding for > > > >> incompatible evolutions? > > > > > > > > We discussed this point in the Spec > > > > < > > > > > > https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit> > > > > document a lot and have gravitated > > > > towards a versioning scheme for easier evolution. > > > > > > > > * Meta-answer: > > > > > > I don't see any such discussion in the comments and the document > doesn't > > > state any rationale for it. > > > > > > As a reference, Python PEPs include a discussion of rejected > > > alternatives so that people don't ask the same questions over and over > > > (see https://peps.python.org/pep-0810/#alternate-implementation-ideas > > > for an example). > > > > > > * Actual answer: > > > > > > My problem with this is that it will complicate understanding and > > > communicating the feature set supported by each Parquet implementation. > > > "ALP" will not be a single spec but an evolving one with its own > version > > > numbers. I'm not sure why that is better than adding a "ALP2" if we > even > > > want to evolve the spec. > > > > > > It will also complicate APIs that currently accept encoding numbers, > > > such as > > > > > > > > > https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221 > > > > > > We need a clear explanation of what makes ALP so special that it > *needs* > > > its own versioning scheme. Otherwise we should remove it IMHO. > > > > > > > Based on the feedback I think it is more of a clarification/API > > complication discussion. > > I was thinking along these lines. > > > > Things that I thought will had to change but then we added enough > > flexibility in metadata structs to allow for these. > > 1) Addition of AlpRd (if that takes time to get in for read doubles). > (This > > is easily addable with provided AlgorithmEnum) > > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily > > addable with provided AlgorithmEnum) > > 3) Addition of different encoding (This is easily addable with provided > > AlgorithmEnum). > > > > Things that I think might need a version field. > > 1) Right now FOR and ALP structs which are interleaved store certain set > of > > fields, given there have been lots of > > in coming suggestions my thinking was that having a version will > allow > > us to easily change them with minor > > changes to the code in all writers/readers (maybe that is a very bit > > task). > > > > > > > > > > >> * the "Total Encoded Element count" duplicates information already > in > > > >> the page header, with risks of inconsistent values (including > security > > > >> risks that require specific care in implementations) > > > > > > > > 'num_elements' : let me re-read and get back on this. > > > > > > Ok, Micah's answer cleared that up. > > > > > > > Cool. > > > > > > > > > > >> * what happens if the number of exceptions is above 65535? their > > indices > > > >> are coded as 16-bit uints. How about using the same encoding as for > > > >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also > remove > > > >> the 65535 limitation. > > > > > > > > So, I don't see a need for a vector larger than 65535. With that > large > > > > vectors the > > > > overhead of metadata is small and you might as well break it into > > > multiple > > > > vectors. I'm gonna give it some more thought and get back. > > > > > > Ah, yes, I think you're right. Let's forget this :-) > > > > > > > > > :). > > > > > > > > > > > Sampling process should be statistically significant. It should pick > > > enough > > > > values > > > > and not have bias towards just the values towards the start. ALP > > > algorithm > > > > ensures > > > > that and tries to balance between not spending enough cycles to get > > right > > > > parameters > > > > and picking incorrect parameters. > > > > > > > > For a very large row group we can change the constant and have it > > > > selects over a larger data set. > > > > Or one can do it at page level too. Happy to discuss more on this. > > > > > > In Parquet C++, encoding happens at page level, and I would guess other > > > implementations do something similar. Sampling cannot reasonably be > done > > > at a higher level, that would require invasive architectural changes. > > > > > > But this raises another question: why does this have to be a constant? > > > When encoding a page, you know the population size (i.e. the actual > > > number of values that will be encoded). You don't need to estimate it > > > with a constant. > > > > > > > Some engines might decide to write large pages (large number of values > in a > > page) ~2B. > > One might not want to select all the values towards sampling process to > > keep sampling > > a slightly light weight operation. Also as @Andrew Lamb > > <[email protected]> pointed out at times engines > > already know what the best values of hyper parameters are might want to > > provide it as > > input via an epic to send in pre-populated presets. This can happen > across > > multiple pages > > or multiple row groups after having calculated things for the first batch > > of data. This might > > further make sampling more light weight. I would gravitate towards > > flexibility here :). > > > > > > Thanks for this discussion > > Prateek > > >
