Vishwanatha-HD commented on code in PR #48217:
URL: https://github.com/apache/arrow/pull/48217#discussion_r3374178572
##########
cpp/src/arrow/util/byte_stream_split_internal.h:
##########
@@ -330,15 +330,20 @@ inline void DoSplitStreams(const uint8_t* src, int width,
int64_t nvalues,
while (nvalues >= kBlockSize) {
for (int stream = 0; stream < width; ++stream) {
uint8_t* dest = dest_streams[stream];
+#if ARROW_LITTLE_ENDIAN
+ const int src_stream = stream;
+#else
+ const int src_stream = width - 1 - stream;
+#endif
Review Comment:
Hi @pitrou..
This is a classic endianness correction for the Byte Stream Split encoding
used by Arrow/Parquet.
Suppose you have 4-byte values (e.g. float32):
Value1 = [B0 B1 B2 B3]
Value2 = [C0 C1 C2 C3]
Value3 = [D0 D1 D2 D3]
In memory (normal layout):
B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3
Byte Stream Split transforms this into:
Stream0: B0 C0 D0 ...
Stream1: B1 C1 D1 ...
Stream2: B2 C2 D2 ...
Stream3: B3 C3 D3 ...
The original code assumes
stream 0 = least significant byte
stream 1 = next byte
stream 2 = next byte
stream 3 = most significant byte
For eg: If this is the byte stream, 0x3F800000
00 00 80 3F
^ ^
stream 0 stream 3
This is true on little-endian machines.
On IBM Z, the same float is stored as,
3F 80 00 00
^ ^
byte 0 in memory byte 3 in memory
However, Byte Stream Split specification expects
stream 0 = least significant byte
stream 1 = next byte
stream 2 = next byte
stream 3 = most significant byte
Hence my fix effectively reverses the byte order. And this is how it looks
with my fix
Stream0 gets byte[3] = 00
Stream1 gets byte[2] = 00
Stream2 gets byte[1] = 80
Stream3 gets byte[0] = 3F
I hope I am making sense here.. Please let me know if you need more details.
Thanks..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]