alamb commented on code in PR #7875:
URL: https://github.com/apache/arrow-rs/pull/7875#discussion_r2187650306


##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
             b"abcdefghi",
             b"abcdefghij",
             b"abcdefghijk",
-            b"abcdefghijkl", // 12 bytes, max inline
-            b"bar",
-            b"bar\0", // special case to test length tiebreaker
+            b"abcdefghijkl",
+            // 
───────────────────────────────────────────────────────────────────────
+            // This pair verifies that we didn’t accidentally reverse the 
inline bytes:
+            // without our fix, “backend one” would compare as if it were
+            //    “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+            b"backend one", // special case: tests byte-order reversal bug
+            b"backend two",

Review Comment:
   Can we also please include  strings that have the same exact inline prefix 
the length differ as well as content ?
   
   Something like 
   * `LongerThan12Bytes`
   * `LongerThan12Bytez`
   * `LongerThan12Bytes\0`
    * `LongerThan12Byt`
   
   



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
     /// key("bar\0") = 0x0000000000000000000062617200000004
     /// ⇒ key("bar") < key("bar\0")
     /// ```
+    ///
+    /// ### Why the old code failed
+    /// Fixed in: <https://github.com/apache/arrow-rs/pull/7875>
+    /// In the previous implementation, we did:
+    /// ```ignore
+    /// let word_be = u128::from_le_bytes(raw.to_le_bytes()).to_be();
+    /// ```
+    /// That single `.to_be()` flips **all 16 bytes**, including the data 
region, effectively
+    /// reversing each string. As a result:
+    /// - `"backend one"` was compared as if it were `"eno dnekcab"`, 
inverting lexicographical order
+    /// - any multi‑byte string ordering was completely incorrect
     #[inline(always)]
     pub fn inline_key_fast(raw: u128) -> u128 {
-        // Convert the raw u128 (little-endian) into bytes for manipulation
         let raw_bytes = raw.to_le_bytes();
 
-        // Extract the length (first 4 bytes), convert to big-endian u32, and 
promote to u128
-        let len_le = &raw_bytes[0..4];
-        let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as 
u128;
-
-        // Extract the inline string bytes (next 12 bytes), place them into 
the lower 12 bytes of a 16-byte array,
-        // padding the upper 4 bytes with zero to form a little-endian u128 
value
-        let mut inline_bytes = [0u8; 16];
-        inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]);
-
-        // Convert to big-endian to ensure correct lexical ordering
-        let inline_u128 = u128::from_le_bytes(inline_bytes).to_be();
+        // Parse native length from LE, then write BE bytes
+        let length = u32::from_le_bytes(raw_bytes[0..4].try_into().unwrap());

Review Comment:
   I think most of the other code in this repo extracts the length from a view 
using `as u32`. I suggest we follow the same convention
   
   ```suggestion
           let length = raw as u32;
   ```



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
     /// key("bar\0") = 0x0000000000000000000062617200000004
     /// ⇒ key("bar") < key("bar\0")
     /// ```
+    ///
+    /// ### Why the old code failed

Review Comment:
   I think it would be more helpful to future readers to explain here to 
explain how the calculation works, rather than explaining why a previous 
attempt didnt' work 🤔 



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
             b"abcdefghi",
             b"abcdefghij",
             b"abcdefghijk",
-            b"abcdefghijkl", // 12 bytes, max inline
-            b"bar",
-            b"bar\0", // special case to test length tiebreaker
+            b"abcdefghijkl",
+            // 
───────────────────────────────────────────────────────────────────────
+            // This pair verifies that we didn’t accidentally reverse the 
inline bytes:
+            // without our fix, “backend one” would compare as if it were
+            //    “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+            b"backend one", // special case: tests byte-order reversal bug

Review Comment:
   I wonder why this wasn't this caught withthe `xyy` and `xyz` test case?



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
             b"abcdefghi",
             b"abcdefghij",
             b"abcdefghijk",
-            b"abcdefghijkl", // 12 bytes, max inline
-            b"bar",
-            b"bar\0", // special case to test length tiebreaker
+            b"abcdefghijkl",
+            // 
───────────────────────────────────────────────────────────────────────
+            // This pair verifies that we didn’t accidentally reverse the 
inline bytes:
+            // without our fix, “backend one” would compare as if it were
+            //    “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+            b"backend one", // special case: tests byte-order reversal bug
+            b"backend two",

Review Comment:
   Also, given the endian swapping going on, can we please also include a few 
strings that are more than 256 bytes long (so the length requires 2 bytes to 
store)?



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
     /// key("bar\0") = 0x0000000000000000000062617200000004

Review Comment:
   I think this diagram needs be updated as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to