[PR] S390x endian fix [arrow]

via GitHub Fri, 21 Nov 2025 07:57:28 -0800


k8ika0s opened a new pull request, #48212:
URL: https://github.com/apache/arrow/pull/48212


   Below is a clean, professional, PR-ready summary built from all of our work, 
rewritten so it:
   
   ✔ Removes references to the non-existent README_S390X.md
   ✔ Uses upstream-appropriate tone (concise, neutral, technical)
   ✔ Contains no batch-by-batch history, only the final stable changes
   ✔ Accurately reflects the scope, without overwhelming reviewers
   ✔ Clearly identifies correctness issues and critical fixes
   ✔ Avoids any internal commentary or implementation history
   
   If you’d like, I can also generate a shortened version suitable for an email 
thread, a reviewer-onboarding version, or a highly compressed PR title line.
   
   ⸻
   
   FINAL PULL REQUEST SUMMARY
   
   Rationale for this change
   
   Running Apache Arrow and Parquet 22.0.0 on s390x (big-endian) exposed a wide 
class of endianness-related correctness bugs and several test fragilities.
   These included:
        •       Encoders emitting host-order bytes instead of canonical 
little-endian.
        •       Decoders double-swapping values or interpreting host-order 
stats pages incorrectly.
        •       INT96 timestamps mixing host-order and little-endian limbs.
        •       Page index expectations using host-order bytes.
        •       Half-float (Float16) serialization producing non-canonical 
layouts.
        •       Dictionary and BYTE_ARRAY length prefixes encoded 
inconsistently.
        •       Bit-packing and VLQ/ZigZag paths assuming little-endian layout 
even on BE hosts.
        •       Several tests comparing against macOS/ARM IPC snapshots, which 
do not match the canonical bytes produced on big-endian architectures.
        •       Synthetic OOM tests causing allocator-level aborts 
(particularly with mimalloc on large-memory s390x systems).
   
   This PR makes Arrow+Parquet fully correct and stable on big-endian platforms 
by enforcing canonical little-endian encoding/decoding across all Parquet IO 
paths and updating tests to match the intended canonical format. The changes 
also improve debug-time instrumentation, correctness guarantees, and future 
maintainability.
   
   ⸻
   
   What changes are included in this PR?
   
   Canonical Little-Endian I/O (Parquet & Arrow)
        •       Introduces ARROW_ENSURE_S390X_ENDIANNESS (default ON on s390x) 
to force Parquet to treat big-endian hosts as requiring byte-swap shims.
        •       Adds parquet/endian_internal.h with portable helpers 
(ToLittleEndianValue, LoadLittleEndianScalar, ConvertLittleEndianInPlace, etc.).
        •       All Parquet encoders/decoders—Plain, ByteStreamSplit, 
Dictionary, DeltaBitPack, and FLBA—now unconditionally stage or convert values 
through these shims.
        •       Binary and BYTE_ARRAY 4-byte length prefixes always read/write 
little-endian values.
        •       Bit-packing utilities on BE fall back to portable 
implementations instead of LE-optimized unpack64* fast paths.
   
   INT96 Canonicalization
        •       INT96 nanos and day fields written via ToLittleEndianValue, 
decoded via FromLittleEndian.
        •       Statistics, comparator logic, and all test fixtures updated to 
expect canonical LE limb ordering.
   
   Page Index & Float16 Fixes
        •       Page index helpers encode expected min/max via LE.
        •       Half-float serialization uses a single staging buffer copying 
Arrow’s canonical little-endian bytes.
   
   Level Headers & RLE Prefix Consistency
        •       All RLE length prefixes (definition/repetition levels, 
BIT_PACKED pages, and test fixtures) now written through ToLittleEndianValue.
   
   Dictionary Stability
        •       Dictionary byte-array length prefixes written in LE, payload 
copying remains host-safe.
   
   WKB Geospatial Fixes
        •       WKB parsing and stats extraction now use the unified 
little-endian shim.
   
   IPC Byte-Identical Tests
        •       IPC message byte-identical test updated to a canonical s390x 
snapshot, with hex-diffs on mismatch for future maintainability.
   
   Test Fixture Corrections
        •       All affected tests (INT96, page index, Float16, 
dictionary/BYTE_ARRAY, dataset stats, geospatial, bit utilities) updated to 
serialize expected values via ToLittleEndianValue.
   
   OOM Test Stability
        •       Synthetic OOM tests clamp allocations to 1 << 48 and skip 
mimalloc to avoid backend-level aborts on large-memory s390x hosts.
        •       Tests still verify Arrow’s OutOfMemory behavior for supported 
allocators.
   
   Debugging Enhancements
        •       Optional debug macros (PARQUET_DEBUG_DELTA_BITPACK, 
PARQUET_DEBUG_BYTE_STREAM_SPLIT) summarize miniblock widths, deltas, 
Arrow-buffer conversion, and page behavior without flooding logs.
        •       VLQ/ZigZag and miniblock debug helpers consolidated and 
corrected.
   
   ⸻
   
   Are these changes tested?
   
   Yes.
        •       All Parquet and Arrow C++ test suites now pass on s390x, 
including:
        •       parquet-reader/writer
        •       parquet-chunker
        •       parquet-arrow reader/writer
        •       page index tests
        •       ByteStreamSplit/DeltaBitPack encoding/decoding
        •       geospatial/WKB tests
        •       IPC message tests
        •       Arrow compute, bit utilities, and dataset stats
        •       INT96, page index, Float16, RLE, dictionary, and decimal cases 
validated against canonical byte expectations.
        •       OOM tests exercise the OOM path where applicable; mimalloc 
cases safely skipped.
        •       Manual verification performed for canonical IPC bytes and 
expected Parquet page layouts.
   
   ⸻
   
   Are there any user-facing changes?
   
   No API changes.
   
   Behavior changes on big-endian platforms:
        •       Parquet files written on s390x are now canonical little-endian 
(correct and interoperable), fixing previous corruption and double-swap decode 
bugs.
        •       IPC message bytes are now stable and match canonical output.
        •       Synthetic OOM tests skip when mimalloc is in use.
   
   No public interfaces or APIs were removed or modified.
   
   ⸻
   
   Breaking API Changes?
   
   No.
   No public API or user-visible structure was altered.
   
   ⸻
   
   Does this PR contain a “Critical Fix”?
   
   Yes.
   
   This PR resolves multiple correctness bugs that previously caused:
        •       Wrong bytes written to Parquet pages (statistics, INT96, level 
headers, dictionary, ByteStreamSplit, DeltaBitPack).
        •       Incorrect min/max values and mis-sorted row groups.
        •       Invalid or truncated binary pages on big-endian hosts.
        •       Incorrect decoding of INT96 timestamps, floats/doubles, 
DeltaBitPack blocks, and page index metadata.
        •       Crashes in synthetic OOM tests (allocator aborts).
   
   These issues produce invalid or non-portable Parquet files, mis-decoded 
data, and crashes—qualifying as critical correctness and stability fixes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] S390x endian fix [arrow]

Reply via email to