[PR] [WIP] Parquet Java ALP Implementation [parquet-java]

via GitHub Tue, 17 Feb 2026 15:17:55 -0800


vinooganesh opened a new pull request, #3397:
URL: https://github.com/apache/parquet-java/pull/3397


   cc @julienledem @alamb 
   
   ### Rationale for this change
   
   Reworks the ALP encoding implementation to address emkornfield's 
architectural feedback on PR #3390. The original buffered all values in memory 
and decoded eagerly. This makes the writer incremental (encode per-vector as 
values arrive) and the reader lazy (decode on demand), matching how other 
Parquet encodings work.
   
   Builds on Julien Le Dem's original implementation 
(https://github.com/apache/parquet-java/pull/3390). File structure, integration 
points, core math, and interop test infrastructure all come from his work. The 
rework focused on the internal writer/reader plumbing.
   
   ### What changes are included in this PR?
   
   **Architecture (addressing review feedback):**
   - Incremental writer. Values buffer in a fixed-size vector, each full vector 
encodes and flushes immediately.
   - Lazy reader. Vectors decode on first access via offset array, skip() is 
O(1).
   - Interleaved page layout so each vector is self-contained.
   - Extracted AlpValuesReader abstract base class for shared logic.
   - Preset caching. Full parameter search for first 8 vectors, top 5 combos 
cached for the rest.
   
   **Spec compliance:**
   - Fixed packed data size formula to ceil(n * bitWidth / 8)
   - Fixed unsigned delta comparison in float writer
   - Explicit little-endian byte reads instead of relying on ByteBuffer order
   - Using parquet-encoding's BytePacker instead of custom bit-packing
   - Capped max vector size at 32768 to prevent uint16 overflow in 
num_exceptions
   
   **Code quality:**
   - Renamed bitWidth overloads to prevent silent type coercion
   - Package-private visibility for internals
   - Configurable vector size (default 1024)
   
   **Integration:**
   - Wired ALP into DefaultV2ValuesWriterFactory and ParquetProperties
   
   ### Are these changes tested?
   
   Yes. 105 tests across 3 test classes, all passing. Full parquet-column suite 
(677 tests) also passes.
   
   Key tests construct ALP page bytes directly according to the spec and feed 
them to the reader without going through the writer. This verifies the reader 
works independently and catches any bugs where writer and reader agree with 
each other but disagree with the spec. Also covers NaN bit pattern 
preservation, negative zero roundtrip, extreme values, every partial vector 
remainder mod 8, skip across vector boundaries, and preset caching under 
distribution change.
   
   ### Are there any user-facing changes?
   
   Users can enable ALP encoding for FLOAT and DOUBLE columns via 
ParquetProperties.withAlpEncoding(), globally or per-column.
   
   Note: Likely me missing something - but @ALP is not yet in the 
parquet-format Thrift spec 
(https://github.com/apache/parquet-format/issues/533), so writing ALP files 
through the full Hadoop pipeline will fail at metadata serialization until 
parquet.thrift is updated (parquet-format PR #548).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP] Parquet Java ALP Implementation [parquet-java]

Reply via email to