Thanks Andrew for building momentum. Hi Antoine,
Replies to your questions are inline. On Mon, Jan 26, 2026 at 2:45 AM Antoine Pitrou <[email protected]> wrote: > > Hey all, > > Thanks Prateek and Dhirhan for submitting this as it's clear you've been > putting quite a bit of work into it. IMHO, the ALP encoding looks very > promising as an addition to Parquet format. > > That said, I have a few technical concerns: > > * I cannot seem to run the C++ benchmarks because of the git submodule > configuration. It may be easier to fix but I'm looking for guidance here > :-) > > ``` > $ LANG=C git submodule update > fatal: transport 'file' not allowed > fatal: Fetched in submodule path 'submodules/parquet-testing', but it > did not contain 66dfde8b2a569e7cbc8e998153e8dd6f2b36f940. Direct > fetching of that commit failed. > ``` > I think that is because the dataset branch hasn't been merged in yet. The files are in this pull request https://github.com/apache/parquet-testing/pull/100/files. You might have to checkout that branch to be able to run the benchmarks. * the encoding of integers uses a custom framing with frame-of-reference > encoding inside it, but Parquet implementations already implement > DELTA_BINARY_PACKED which should have similar characteristics, so why > not reuse that? > I did look at DELTA_BINARY_PACKED. Unless I unless I understood it wrong it didn't fit the needs. My understanding of DELTA_BINARY_PACKED is this delta[i] = value[i] - value[i-1] Pasting an example of why it may fail. Input: [19.99, 5.49, 149.00, 0.99, 299.99, ...] // ALP encoded prices: [1999, 549, 14900, 99, 29999] ═══════════════════════════════════════════════════════════════════════════ DELTA_BINARY_PACKED (Two Levels) ═══════════════════════════════════════════════════════════════════════════ STEP 1: Adjacent differences first_value = 1999 (stored in header) delta[0] = 549 - 1999 = -1450 delta[1] = 14900 - 549 = +14351 delta[2] = 99 - 14900 = -14801 delta[3] = 29999 - 99 = +29900 Adjacent deltas = [-1450, +14351, -14801, +29900] STEP 2: Frame of reference on the deltas min_delta = -14801 (stored as zigzag ULEB128) adjusted[0] = -1450 - (-14801) = 13351 adjusted[1] = +14351 - (-14801) = 29152 adjusted[2] = -14801 - (-14801) = 0 adjusted[3] = +29900 - (-14801) = 44701 Adjusted deltas = [13351, 29152, 0, 44701] Range: 0 to 44701 → bit_width = ceil(log2(44702)) = 16 bits ═══════════════════════════════════════════════════════════════════════════ FOR (Single Level) ═══════════════════════════════════════════════════════════════════════════ min = 99 (stored as frame_of_reference) delta[0] = 1999 - 99 = 1900 delta[1] = 549 - 99 = 450 delta[2] = 14900 - 99 = 14801 delta[3] = 99 - 99 = 0 delta[4] = 29999 - 99 = 29900 Deltas = [1900, 450, 14801, 0, 29900] Range: 0 to 29900 → bit_width = ceil(log2(29901)) = 15 bits I think for floating we need min and not adjacent subtractions which will also include negative values. > * there are a lot of fields in the headers that look a bit superfluous > (though of course those bits are relatively cheap); for example, why > have a format "version" while we could define a new encoding for > incompatible evolutions? > We discussed this point in the Spec <https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit> document a lot and have gravitated towards a versioning scheme for easier evolution. > > * the "Total Encoded Element count" duplicates information already in > the page header, with risks of inconsistent values (including security > risks that require specific care in implementations) > 'num_elements' : let me re-read and get back on this. > > * what happens if the number of exceptions is above 65535? their indices > are coded as 16-bit uints. How about using the same encoding as for > bit-packed integers (e.g. DELTA_BINARY_PACKED), which will also remove > the 65535 limitation. > So, I don't see a need for a vector larger than 65535. With that large vectors the overhead of metadata is small and you might as well break it into multiple vectors. I'm gonna give it some more thought and get back. > > * the C++ implementation has a `kSamplerRowgroupSize` constant, which > worries me; row group size can vary *a lot* between workloads (from > thousands to millions of elements typically), the sampling process > should not depend on that. > Sampling process should be statistically significant. It should pick enough values and not have bias towards just the values towards the start. ALP algorithm ensures that and tries to balance between not spending enough cycles to get right parameters and picking incorrect parameters. For a very large row group we can change the constant and have it selects over a larger data set. Or one can do it at page level too. Happy to discuss more on this. > > Regards > > Antoine. > > > > Le 16/10/2025 à 23:47, PRATEEK GAUR a écrit : > > Hi team, > > > > We spent some time evaluating ALP compression and decompression compared > to > > other encoding alternatives like CHIMP/GORILLA and compression techniques > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members > > on October 15th in the biweekly parquet meeting. ( I can't seem to access > > the recording, so please let me know what access rules I need to get to > be > > able to view it ) > > > > We did this evaluation over some datasets pointed by the ALP paper and > some > > pointed by the parquet community. > > > > The results are available in the following document > > < > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > > > > : > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > > > Based on the numbers we see > > > > - ALP is comparable to ZSTD(level=1) in terms of compression ratio > and > > much better compared to other schemes. (numbers in the sheet are > bytes > > needed to encode each value ) > > - ALP going quite well in terms of decompression speed (numbers in > the > > sheet are bytes decompressed per second) > > > > As next steps we will > > > > - Get the numbers for compression on top of byte stream split. > > - Evaluate the algorithm over a few more datasets. > > - Have an implementation in the arrow-parquet repo. > > > > Looking forward to feedback from the community. > > > > Best > > Prateek and Dhirhan > > > > >
