alamb commented on code in PR #7875: URL: https://github.com/apache/arrow-rs/pull/7875#discussion_r2187650306
########## arrow-array/src/array/byte_view_array.rs: ########## @@ -1193,9 +1197,19 @@ mod tests { b"abcdefghi", b"abcdefghij", b"abcdefghijk", - b"abcdefghijkl", // 12 bytes, max inline - b"bar", - b"bar\0", // special case to test length tiebreaker + b"abcdefghijkl", + // ─────────────────────────────────────────────────────────────────────── + // This pair verifies that we didn’t accidentally reverse the inline bytes: + // without our fix, “backend one” would compare as if it were + // “eno dnekcab”, so “one” might end up sorting _after_ “two”. + b"backend one", // special case: tests byte-order reversal bug + b"backend two", Review Comment: Can we also please include strings that have the same exact inline prefix the length differ as well as content ? Something like * `LongerThan12Bytes` * `LongerThan12Bytez` * `LongerThan12Bytes\0` * `LongerThan12Byt` ########## arrow-array/src/array/byte_view_array.rs: ########## @@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { /// key("bar\0") = 0x0000000000000000000062617200000004 /// ⇒ key("bar") < key("bar\0") /// ``` + /// + /// ### Why the old code failed + /// Fixed in: <https://github.com/apache/arrow-rs/pull/7875> + /// In the previous implementation, we did: + /// ```ignore + /// let word_be = u128::from_le_bytes(raw.to_le_bytes()).to_be(); + /// ``` + /// That single `.to_be()` flips **all 16 bytes**, including the data region, effectively + /// reversing each string. As a result: + /// - `"backend one"` was compared as if it were `"eno dnekcab"`, inverting lexicographical order + /// - any multi‑byte string ordering was completely incorrect #[inline(always)] pub fn inline_key_fast(raw: u128) -> u128 { - // Convert the raw u128 (little-endian) into bytes for manipulation let raw_bytes = raw.to_le_bytes(); - // Extract the length (first 4 bytes), convert to big-endian u32, and promote to u128 - let len_le = &raw_bytes[0..4]; - let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as u128; - - // Extract the inline string bytes (next 12 bytes), place them into the lower 12 bytes of a 16-byte array, - // padding the upper 4 bytes with zero to form a little-endian u128 value - let mut inline_bytes = [0u8; 16]; - inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]); - - // Convert to big-endian to ensure correct lexical ordering - let inline_u128 = u128::from_le_bytes(inline_bytes).to_be(); + // Parse native length from LE, then write BE bytes + let length = u32::from_le_bytes(raw_bytes[0..4].try_into().unwrap()); Review Comment: I think most of the other code in this repo extracts the length from a view using `as u32`. I suggest we follow the same convention ```suggestion let length = raw as u32; ``` ########## arrow-array/src/array/byte_view_array.rs: ########## @@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { /// key("bar\0") = 0x0000000000000000000062617200000004 /// ⇒ key("bar") < key("bar\0") /// ``` + /// + /// ### Why the old code failed Review Comment: I think it would be more helpful to future readers to explain here to explain how the calculation works, rather than explaining why a previous attempt didnt' work 🤔 ########## arrow-array/src/array/byte_view_array.rs: ########## @@ -1193,9 +1197,19 @@ mod tests { b"abcdefghi", b"abcdefghij", b"abcdefghijk", - b"abcdefghijkl", // 12 bytes, max inline - b"bar", - b"bar\0", // special case to test length tiebreaker + b"abcdefghijkl", + // ─────────────────────────────────────────────────────────────────────── + // This pair verifies that we didn’t accidentally reverse the inline bytes: + // without our fix, “backend one” would compare as if it were + // “eno dnekcab”, so “one” might end up sorting _after_ “two”. + b"backend one", // special case: tests byte-order reversal bug Review Comment: I wonder why this wasn't this caught withthe `xyy` and `xyz` test case? ########## arrow-array/src/array/byte_view_array.rs: ########## @@ -1193,9 +1197,19 @@ mod tests { b"abcdefghi", b"abcdefghij", b"abcdefghijk", - b"abcdefghijkl", // 12 bytes, max inline - b"bar", - b"bar\0", // special case to test length tiebreaker + b"abcdefghijkl", + // ─────────────────────────────────────────────────────────────────────── + // This pair verifies that we didn’t accidentally reverse the inline bytes: + // without our fix, “backend one” would compare as if it were + // “eno dnekcab”, so “one” might end up sorting _after_ “two”. + b"backend one", // special case: tests byte-order reversal bug + b"backend two", Review Comment: Also, given the endian swapping going on, can we please also include a few strings that are more than 256 bytes long (so the length requires 2 bytes to store)? ########## arrow-array/src/array/byte_view_array.rs: ########## @@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { /// key("bar\0") = 0x0000000000000000000062617200000004 Review Comment: I think this diagram needs be updated as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org