Re: [PR] GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x [arrow]

via GitHub Mon, 08 Jun 2026 07:52:58 -0700


Vishwanatha-HD commented on code in PR #48217:
URL: https://github.com/apache/arrow/pull/48217#discussion_r3374178572



##########
cpp/src/arrow/util/byte_stream_split_internal.h:
##########
@@ -330,15 +330,20 @@ inline void DoSplitStreams(const uint8_t* src, int width, 
int64_t nvalues,
   while (nvalues >= kBlockSize) {
     for (int stream = 0; stream < width; ++stream) {
       uint8_t* dest = dest_streams[stream];
+#if ARROW_LITTLE_ENDIAN
+      const int src_stream = stream;
+#else
+      const int src_stream = width - 1 - stream;
+#endif

Review Comment:
   Hi @pitrou.. 
   This is a classic endianness correction for the Byte Stream Split encoding 
used by Arrow/Parquet.
   Suppose you have 4-byte values (e.g. float32):
   Value1 = [B0 B1 B2 B3]
   Value2 = [C0 C1 C2 C3]
   Value3 = [D0 D1 D2 D3]
   
   In memory (normal layout):
   B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3
   
   Byte Stream Split transforms this into:
   Stream0: B0 C0 D0 ...
   Stream1: B1 C1 D1 ...
   Stream2: B2 C2 D2 ...
   Stream3: B3 C3 D3 ...
   
   The original code assumes 
   stream 0 = least significant byte
   stream 1 = next byte
   stream 2 = next byte
   stream 3 = most significant byte
   
   For eg: If this is the byte stream, 0x3F800000 
   
   00              00       80     3F
    ^                                     ^
    stream 0                        stream 3
   
   This is true on little-endian machines.
   
   On IBM Z, the same float is stored as, 
   3F                  80          00         00
    ^                                                ^
    byte 0 in memory                     byte 3 in memory
    
    However, Byte Stream Split specification expects
   stream 0 = least significant byte
   stream 1 = next byte
   stream 2 = next byte
   stream 3 = most significant byte
   
   Hence my fix effectively reverses the byte order. And this is how it looks 
with my fix
   Stream0 gets byte[3] = 00
   Stream1 gets byte[2] = 00
   Stream2 gets byte[1] = 80
   Stream3 gets byte[0] = 3F
   
   I hope I am making sense here.. Please let me know if you need more details. 
Thanks.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x [arrow]

Reply via email to