Hi Antoine and Prateek, > > In Parquet C++, encoding happens at page level, and I would guess other > > implementations do something similar. Sampling cannot reasonably be done > > at a higher level, that would require invasive architectural changes.
At least in C++ I believe we cache the encoder at a column level [1] (I believe the same is true for java). I think this implies one could sample for the first page more or less, and then resample on some regular cadence (or if compression degrades too much)? In general, the exact approach used in implementations can vary here, so I think maybe should clarify that the encoding algorithm in a specification is a recommendation, and we concentrate the discussion on implementation to the individual language binding PRs. 1) Addition of AlpRd (if that takes time to get in for read doubles). (This > is easily addable with provided AlgorithmEnum) > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily > addable with provided AlgorithmEnum) > 3) Addition of different encoding (This is easily addable with provided > AlgorithmEnum). In short, I'd suggest modelling fundamentally different algorithms with the top level encoding enum, and have versioning/control bits where we believe we will likely want to iterate. Long version, based on my understanding of current open design points the following extensions have been discussed: 1. Change of integer encoding (see debate in this thread on FOR vs Delta). We also want to get fastlanes in at some point. I think an enum inside the page for versioning makes sense, as it allows for easier composability. 2. Change in structure to exceptions (e.g. G-ALP). G-ALP comes with some trade-offs, so it is not clear if it is something everyone would want to enable. 3. Offset indexes to vectors 4. Different floating point encoding algorithms (e.g. AlpRd + AlpBSS) 1 is almost definitely an extension point and I think it pays to version this within the page (if we decide to allow delta at some point, and then do fast lanes we start having a lot of ALP-pseudecimal enums, that might not add a lot of value). This would get worse if we multiply this by any future exception layouts. 3. Feels like we can just make a decision now on whether to add them, and if not added now can probably be added if we get to cascaded encodings, or if really needed as a separate enum (but I'm open to control bit here). For 4, these feel like they should be a top level enum for versioning, they are fundamentally different algorithms with different code to decode them. The trade-off here is that the current writers need to have some more invasive changes to have better fallback or choice of initial encoder (but this needs to happen anyways). We can easily have multiple page encodings with a row-group from a reader's perspective For any other future changes that don't fall into these buckets a new top-level enum is always an escape hatch (and we can learn from our mistakes). Another take is it is just one byte per page, which would take a while to add up even at Parquet's scale, but on the balance I'd lean towards YAGNI. Cheers, Micah [1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1290 On Tue, Jan 27, 2026 at 9:47 AM PRATEEK GAUR <[email protected]> wrote: > Hi Antoine, > > Thanks. Replying inline to the discussion threads. > > > > > https://github.com/apache/parquet-testing/pull/100/files. > > > You might have to checkout that branch to be able to run the > benchmarks. > > > > Yeah just cherry-picking the branch should work. Let me me know if it ends > up > being some kind of permission issue. > > > > > > Hmm, I'll take a look at that later and come back. > > > > > * the encoding of integers uses a custom framing with > frame-of-reference > > >> encoding inside it, but Parquet implementations already implement > > >> DELTA_BINARY_PACKED which should have similar characteristics, so why > > >> not reuse that? > > > > Thanks. At worse it seems it would need one more bit per integer than > > the proposed FOR scheme (it might need less if delta encoding allows > > reducing entropy among the encoded integers). I'm not sure how that > > makes it "fail". > > > > > My bad. I wanted to add another point. > 1) So I looked at at a few other examples and in most cases FOR ended up > using > let's bits per value. My thinking is that for 1M (1B values) , it will > add up. > 2) Another major point was that the bitunpacker + FOR was much much faster > than > DeltaBitPack decoder. BitUnpacker+FOR was easily SIMDable but > DeltaBitUnpack > not so much. I vaguely remember the difference being around 2x. I can > try and > compute the numbers again. But today the entire bottleneck in the > decoder shows > up in bit-unpacking. > > > > >> * there are a lot of fields in the headers that look a bit superfluous > > >> (though of course those bits are relatively cheap); for example, why > > >> have a format "version" while we could define a new encoding for > > >> incompatible evolutions? > > > > > > We discussed this point in the Spec > > > < > > > https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit> > > > document a lot and have gravitated > > > towards a versioning scheme for easier evolution. > > > > > * Meta-answer: > > > > I don't see any such discussion in the comments and the document doesn't > > state any rationale for it. > > > > As a reference, Python PEPs include a discussion of rejected > > alternatives so that people don't ask the same questions over and over > > (see https://peps.python.org/pep-0810/#alternate-implementation-ideas > > for an example). > > > > * Actual answer: > > > > My problem with this is that it will complicate understanding and > > communicating the feature set supported by each Parquet implementation. > > "ALP" will not be a single spec but an evolving one with its own version > > numbers. I'm not sure why that is better than adding a "ALP2" if we even > > want to evolve the spec. > > > > It will also complicate APIs that currently accept encoding numbers, > > such as > > > > > https://github.com/apache/arrow/blob/5272a68c134deea82040f2f29bb6257ad7b52be0/cpp/src/parquet/properties.h#L221 > > > > We need a clear explanation of what makes ALP so special that it *needs* > > its own versioning scheme. Otherwise we should remove it IMHO. > > > > Based on the feedback I think it is more of a clarification/API > complication discussion. > I was thinking along these lines. > > Things that I thought will had to change but then we added enough > flexibility in metadata structs to allow for these. > 1) Addition of AlpRd (if that takes time to get in for read doubles). (This > is easily addable with provided AlgorithmEnum) > 2) Addition of AlpRd + Modified BSS (suggested by Azim) (This is easily > addable with provided AlgorithmEnum) > 3) Addition of different encoding (This is easily addable with provided > AlgorithmEnum). > > Things that I think might need a version field. > 1) Right now FOR and ALP structs which are interleaved store certain set of > fields, given there have been lots of > in coming suggestions my thinking was that having a version will allow > us to easily change them with minor > changes to the code in all writers/readers (maybe that is a very bit > task). > > > > > > >> * the "Total Encoded Element count" duplicates information already in > > >> the page header, with risks of inconsistent values (including security > > >> risks that require specific care in implementations) > > > > > > 'num_elements' : let me re-read and get back on this. > > > > Ok, Micah's answer cleared that up. > > > > Cool. > > > > > > >> * what happens if the number of exceptions is above 65535? their > indices > > >> are coded as 16-bit uints. How about using the same encoding as for > > >> bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove > > >> the 65535 limitation. > > > > > > So, I don't see a need for a vector larger than 65535. With that large > > > vectors the > > > overhead of metadata is small and you might as well break it into > > multiple > > > vectors. I'm gonna give it some more thought and get back. > > > > Ah, yes, I think you're right. Let's forget this :-) > > > > > :). > > > > > > > Sampling process should be statistically significant. It should pick > > enough > > > values > > > and not have bias towards just the values towards the start. ALP > > algorithm > > > ensures > > > that and tries to balance between not spending enough cycles to get > right > > > parameters > > > and picking incorrect parameters. > > > > > > For a very large row group we can change the constant and have it > > > selects over a larger data set. > > > Or one can do it at page level too. Happy to discuss more on this. > > > > In Parquet C++, encoding happens at page level, and I would guess other > > implementations do something similar. Sampling cannot reasonably be done > > at a higher level, that would require invasive architectural changes. > > > > But this raises another question: why does this have to be a constant? > > When encoding a page, you know the population size (i.e. the actual > > number of values that will be encoded). You don't need to estimate it > > with a constant. > > > > Some engines might decide to write large pages (large number of values in a > page) ~2B. > One might not want to select all the values towards sampling process to > keep sampling > a slightly light weight operation. Also as @Andrew Lamb > <[email protected]> pointed out at times engines > already know what the best values of hyper parameters are might want to > provide it as > input via an epic to send in pre-populated presets. This can happen across > multiple pages > or multiple row groups after having calculated things for the first batch > of data. This might > further make sampling more light weight. I would gravitate towards > flexibility here :). > > > Thanks for this discussion > Prateek >
