Re: [PR] GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… [arrow]

via GitHub Mon, 24 Nov 2025 05:30:43 -0800


Vishwanatha-HD commented on PR #48203:
URL: https://github.com/apache/arrow/pull/48203#issuecomment-3570805783


   > @Vishwanatha-HD Hey! Really appreciate you taking on the encoder/decoder 
paths for s390x — these two files are where a lot of the subtle BE issues first 
show up.
   > 
   > One thing I ran into on real s390x hardware is that Arrow’s array buffers 
already store their scalars in canonical little-endian format. Because of that, 
per-value swapping inside the Plain/Arrow fast paths can sometimes lead to an 
unintended double-swap, especially when mixing Arrow-originated inputs with 
non-Arrow callers (e.g., DeltaBitPack or ByteStreamSplit feeding into the same 
decode path).
   > 
   > A couple of spots I’m curious about here:
   > 
   > **• PlainDecoder → Arrow fast path** The `#if ARROW_LITTLE_ENDIAN` 
branches check out for host-native buffers, but Arrow itself always hands you 
LE data. Does this path avoid re-swapping values that are already canonical LE 
from Arrow? I’ve seen that cause subtle mismatches on BE when DeltaBitPack or 
BSS push Arrow-arrays directly into the decoder.
   > 
   > **• PlainEncoder (primitive path)** On BE, the per-value `ToLittleEndian` 
write works for correctness, though I’ve found cases where staging to a single 
LE scratch buffer helps avoid partial/mixed-endian outputs when builders and 
sinks run back-to-back.
   > 
   > **• ByteStreamSplit** Here the code assumes DoSplitStreams handles 
endianness, but BSS usually expects inputs to already be in canonical LE order 
before the streams are interleaved. With native-order buffers on BE, the 
shuffle sometimes produces different stats/dictionary bytes across 
architectures. Curious if you’ve tested mixed Arrow/non-Arrow inputs through 
this path?
   > 
   > None of this blocks the PR — just sharing things I hit in BE testing 
across the encode/decode → stats → page-index chain.
   
   @k8ika0s.. Thanks for your review comments.. While agree with you that from 
the structural point of view, things could be log better.. But please note 
that, I have tested with my changes on the s390x systems and also on Openshift 
AI workloads.. It works properly.. Hence there is no concern with these 
changes.. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… [arrow]

Reply via email to