Hi Antoine, I think I can help perhaps add some more details to Prateek's answer.
> > * the "Total Encoded Element count" duplicates information already in > the page header, with risks of inconsistent values (including security > risks that require specific care in implementations) > 'num_elements' : let me re-read and get back on this. This is confusing but the data page header only contains the total number of values "including null" [1], so it effectively has the number of repetition and definition levels not the number of encoded value count. We expect these to be inconsistent (DataPageV2 and statistics have null counts but generally I don't think these are passed to decoders anyways). The only other way of retrieving this count is to decode all repetition/definitions up front to get the true count. DELTA_BINARY_PACKED also stores the total number of values [2] BYTE_STREAM_SPLIT actually required a spec update [3] to state no padding is allowed because otherwise there would be no other way to get this number. > * the C++ implementation has a `kSamplerRowgroupSize` constant, which > worries me; row group size can vary *a lot* between workloads (from > thousands to millions of elements typically), the sampling process > should not depend on that. The last time I reviewed the C++ implementation I think we were actually recalculating these values per page. So I think this might just be a naming issue (as long as the implementation doesn't change. Page sizes could get larger then this constant but probably not by too much? Cheers, Micah [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L679 [2] https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 [3] https://github.com/apache/parquet-format/commit/230711fbfd8d3399cce935a4f39d1be7b6ad5ad5 On Mon, Jan 26, 2026 at 7:15 PM PRATEEK GAUR <[email protected]> wrote: > Thanks Andrew for building momentum. > > Hi Antoine, > > Replies to your questions are inline. > > On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou <[email protected]> wrote: > > > > > Hey all, > > > > Thanks Prateek and Dhirhan for submitting this as it's clear you've been > > putting quite a bit of work into it. IMHO, the ALP encoding looks very > > promising as an addition to Parquet format. > > > > That said, I have a few technical concerns: > > > > * I cannot seem to run the C++ benchmarks because of the git submodule > > configuration. It may be easier to fix but I'm looking for guidance here > > :-) > > > > ``` > > $ LANG=C git submodule update > > fatal: transport 'file' not allowed > > fatal: Fetched in submodule path 'submodules/parquet-testing', but it > > did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct > > fetching of that commit failed. > > ``` > > > > I think that is because the dataset branch hasn't been merged in yet. > The files are in this pull request > https://github.com/apache/parquet-testing/pull/100/files. > You might have to checkout that branch to be able to run the benchmarks. > > * the encoding of integers uses a custom framing with frame-of-reference > > encoding inside it, but Parquet implementations already implement > > DELTA_BINARY_PACKED which should have similar characteristics, so why > > not reuse that? > > > > I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it > didn't > fit the needs. > > My understanding of DELTA_BINARY_PACKED is this > delta[i] = value[i] - value[i-1] > > Pasting an example of why it may fail. > > Input: [19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices: > [1999, 549, 14900, 99, 29999] > ═══════════════════════════════════════════════════════════════════════════ > DELTA_BINARY_PACKED (Two Levels) > ═══════════════════════════════════════════════════════════════════════════ > > STEP 1: Adjacent differences > first_value = 1999 (stored in header) > > delta[0] = 549 - 1999 = -1450 > delta[1] = 14900 - 549 = +14351 > delta[2] = 99 - 14900 = -14801 > delta[3] = 29999 - 99 = +29900 > > Adjacent deltas = [-1450, +14351, -14801, +29900] > > STEP 2: Frame of reference on the deltas > min_delta = -14801 (stored as zigzag ULEB128) > > adjusted[0] = -1450 - (-14801) = 13351 > adjusted[1] = +14351 - (-14801) = 29152 > adjusted[2] = -14801 - (-14801) = 0 > adjusted[3] = +29900 - (-14801) = 44701 > > Adjusted deltas = [13351, 29152, 0, 44701] > > Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits > > ═══════════════════════════════════════════════════════════════════════════ > FOR (Single Level) > ═══════════════════════════════════════════════════════════════════════════ > > min = 99 (stored as frame_of_reference) > > delta[0] = 1999 - 99 = 1900 > delta[1] = 549 - 99 = 450 > delta[2] = 14900 - 99 = 14801 > delta[3] = 99 - 99 = 0 > delta[4] = 29999 - 99 = 29900 > > Deltas = [1900, 450, 14801, 0, 29900] > > Range: 0 to 29900 → bit_width = ceil(log2(29901)) = 15 bits > > I think for floating we need min and not adjacent subtractions which will > also include negative values. > > > > * there are a lot of fields in the headers that look a bit superfluous > > (though of course those bits are relatively cheap); for example, why > > have a format "version" while we could define a new encoding for > > incompatible evolutions? > > > > We discussed this point in the Spec > <https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit > > > document a lot and have gravitated > towards a versioning scheme for easier evolution. > > > > > > * the "Total Encoded Element count" duplicates information already in > > the page header, with risks of inconsistent values (including security > > risks that require specific care in implementations) > > > > 'num_elements' : let me re-read and get back on this. > > > > > > * what happens if the number of exceptions is above 65535? their indices > > are coded as 16-bit uints. How about using the same encoding as for > > bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove > > the 65535 limitation. > > > > So, I don't see a need for a vector larger than 65535. With that large > vectors the > overhead of metadata is small and you might as well break it into multiple > vectors. I'm gonna give it some more thought and get back. > > > > > > * the C++ implementation has a `kSamplerRowgroupSize` constant, which > > worries me; row group size can vary *a lot* between workloads (from > > thousands to millions of elements typically), the sampling process > > should not depend on that. > > > > Sampling process should be statistically significant. It should pick enough > values > and not have bias towards just the values towards the start. ALP algorithm > ensures > that and tries to balance between not spending enough cycles to get right > parameters > and picking incorrect parameters. > > For a very large row group we can change the constant and have it > selects over a larger data set. > Or one can do it at page level too. Happy to discuss more on this. > > > > > > Regards > > > > Antoine. > > > > > > > > Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit : > > > Hi team, > > > > > > We spent some time evaluating ALP compression and decompression > compared > > to > > > other encoding alternatives like CHIMP/GORILLA and compression > techniques > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community > members > > > on October 15th in the biweekly parquet meeting. ( I can't seem to > access > > > the recording, so please let me know what access rules I need to get to > > be > > > able to view it ) > > > > > > We did this evaluation over some datasets pointed by the ALP paper and > > some > > > pointed by the parquet community. > > > > > > The results are available in the following document > > > < > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > > > > > > : > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > > > > > Based on the numbers we see > > > > > > - ALP is comparable to ZSTD(level=1) in terms of compression ratio > > and > > > much better compared to other schemes. (numbers in the sheet are > > bytes > > > needed to encode each value ) > > > - ALP going quite well in terms of decompression speed (numbers in > > the > > > sheet are bytes decompressed per second) > > > > > > As next steps we will > > > > > > - Get the numbers for compression on top of byte stream split. > > > - Evaluate the algorithm over a few more datasets. > > > - Have an implementation in the arrow-parquet repo. > > > > > > Looking forward to feedback from the community. > > > > > > Best > > > Prateek and Dhirhan > > > > > > > > > >
