Re: [PR] Implement compare operations for view types [arrow-rs]

via GitHub Mon, 17 Jun 2024 10:35:27 -0700


alamb commented on code in PR #5900:
URL: https://github.com/apache/arrow-rs/pull/5900#discussion_r1643163466



##########
arrow-ord/src/cmp.rs:
##########
@@ -538,6 +540,135 @@ impl<'a, T: ByteArrayType> ArrayOrd for &'a 
GenericByteArray<T> {
     }
 }
 
+/// Comparing two ByteView types are non-trivial.
+/// It takes a bit of patience to understand why we don't just compare two 
&[u8] directly.
+///
+/// ByteView types give us the following two advantages, and we need to be 
careful not to lose them:
+/// (1) For string/byte smaller than 12 bytes, the entire data is inlined in 
the view.
+///     Meaning that reading one array element requires only one memory access
+///     (two memory access required for StringArray, one for offset buffer, 
the other for value buffer).
+///
+/// (2) For string/byte larger than 12 bytes, we can still be faster than (for 
certain operations) StringArray/ByteArray,
+///     thanks to the inlined 4 bytes.
+///     Consider equality check:
+///     If the first four bytes of the two strings are different, we can 
return false immediately (with just one memory access).
+///     If we are unlucky and the first four bytes are the same, we need to 
fallback to compare two full strings.
+///
+/// If we directly load the data using 
`GenericByteViewArray::value_unchecked`, we load the entire string and make 
multiple memory accesses;
+/// but most of the time (eq, ord), we only only need to look at the first 4 
bytes to know the answer, thus not utilizing the inlined data.
+impl<'a, T: ByteViewType> ArrayOrd for &'a GenericByteViewArray<T> {
+    type Item = (&'a GenericByteViewArray<T>, usize);
+
+    /// # Equality check flow
+    /// (1) if both string are smaller than 12 bytes, we can directly compare 
the data inlined to the view.
+    /// (2) if any of the string is larger than 12 bytes, we need to compare 
the full string.
+    ///     (2.1) if the inlined 4 bytes are different, we can return false 
immediately.
+    ///     (2.2) o.w., we need to compare the full string.
+    ///
+    /// # Safety
+    /// (1) Indexing. The Self::Item.1 encodes the index value, which is 
already checked in `value` function,
+    ///               so it is safe to index into the views.
+    /// (2) Slice data from view. We know the bytes 4-8 are inlined data (per 
spec), so it is safe to slice from the view.
+    fn is_eq(l: Self::Item, r: Self::Item) -> bool {
+        let l_view = unsafe { l.0.views().get_unchecked(l.1) };
+        let l_len = *l_view as u32;
+
+        let r_view = unsafe { r.0.views().get_unchecked(r.1) };
+        let r_len = *r_view as u32;
+
+        if l_len != r_len {
+            return false;
+        }
+
+        if l_len <= 12 {
+            let l_data = unsafe {
+                std::slice::from_raw_parts(

Review Comment:
   A casual reader of this code might find this construction non obvious .
   
   What do you think about putting this code (to get the inlined value as a 
string) into a (inlined) functions somewhere? 
   
   Maybe as a free function in GenericByteViewArray so this code could look like
   
   ```rust
   let l_data = unsafe { GenericByteViewArray::inline_value(&l_view, l_len) };
   let r_data = unsafe { GenericByteViewArray::inline_value(&r_view, r_len) };
   ```
   
   This definitely is not required and is just a suggestion for readability 
(and we could do it as a follow on PR or never)
   
   It might make the code beloe easier to follow



##########
arrow-ord/src/cmp.rs:
##########
@@ -538,6 +540,135 @@ impl<'a, T: ByteArrayType> ArrayOrd for &'a 
GenericByteArray<T> {
     }
 }
 
+/// Comparing two ByteView types are non-trivial.
+/// It takes a bit of patience to understand why we don't just compare two 
&[u8] directly.
+///
+/// ByteView types give us the following two advantages, and we need to be 
careful not to lose them:
+/// (1) For string/byte smaller than 12 bytes, the entire data is inlined in 
the view.
+///     Meaning that reading one array element requires only one memory access
+///     (two memory access required for StringArray, one for offset buffer, 
the other for value buffer).
+///
+/// (2) For string/byte larger than 12 bytes, we can still be faster than (for 
certain operations) StringArray/ByteArray,
+///     thanks to the inlined 4 bytes.
+///     Consider equality check:
+///     If the first four bytes of the two strings are different, we can 
return false immediately (with just one memory access).
+///     If we are unlucky and the first four bytes are the same, we need to 
fallback to compare two full strings.
+///
+/// If we directly load the data using 
`GenericByteViewArray::value_unchecked`, we load the entire string and make 
multiple memory accesses;
+/// but most of the time (eq, ord), we only only need to look at the first 4 
bytes to know the answer, thus not utilizing the inlined data.
+impl<'a, T: ByteViewType> ArrayOrd for &'a GenericByteViewArray<T> {
+    type Item = (&'a GenericByteViewArray<T>, usize);

Review Comment:
   Could you please document what the `usize` here is? 
   
   It took me quite a bit of time (I eventually resorted to the debugger) to 
figure out why this impl of `ArrayOrd` uses this tuple and is different than 
the others (and it is very clever) -- i think it is that it is a way to pass 
the `index` from `value_unchecked` into the is_eq without having to do the 
lookup for the string value
   
   This is very clever
   
   
   



##########
arrow-ord/src/comparison.rs:
##########
@@ -1189,20 +1206,60 @@ mod tests {
         };
     }
 
+    macro_rules! test_utf8_view_scalar {
+        ($test_name:ident, $left:expr, $right:expr, $op:expr, $expected:expr) 
=> {
+            #[test]
+            fn $test_name() {
+                let left = StringViewArray::from($left);
+                let right = StringViewArray::new_scalar($right);
+                let res = $op(&left, &right).unwrap();
+                let expected = $expected;
+                assert_eq!(expected.len(), res.len());
+                for i in 0..res.len() {
+                    let v = res.value(i);
+                    assert_eq!(
+                        v,
+                        expected[i],
+                        "unexpected result when comparing {} at position {} to 
{} ",
+                        left.value(i),
+                        i,
+                        $right
+                    );
+                }
+            }
+        };
+    }
+
+    const LARGE_STRING: &str = "larger than 12 bytes string";
+
     test_utf8!(
         test_utf8_array_eq,
         vec!["arrow", "arrow", "arrow", "arrow"],
         vec!["arrow", "parquet", "datafusion", "flight"],
         crate::cmp::eq,
         [true, false, false, false]
     );
+    test_utf8_view!(
+        test_utf8_view_array_eq,
+        vec![LARGE_STRING, "arrow", "arrow", "arrow"],
+        vec![LARGE_STRING, "parquet", "datafusion", "flight"],
+        crate::cmp::eq,
+        [true, false, false, false]
+    );
     test_utf8_scalar!(
         test_utf8_array_eq_scalar,
         vec!["arrow", "parquet", "datafusion", "flight"],
         "arrow",
         crate::cmp::eq,
         [true, false, false, false]
     );
+    test_utf8_view_scalar!(

Review Comment:
   Can we also please add a test for comparing a "small" string literal ?



##########
arrow-ord/src/cmp.rs:
##########
@@ -538,6 +540,135 @@ impl<'a, T: ByteArrayType> ArrayOrd for &'a 
GenericByteArray<T> {
     }
 }
 
+/// Comparing two ByteView types are non-trivial.
+/// It takes a bit of patience to understand why we don't just compare two 
&[u8] directly.
+///
+/// ByteView types give us the following two advantages, and we need to be 
careful not to lose them:
+/// (1) For string/byte smaller than 12 bytes, the entire data is inlined in 
the view.
+///     Meaning that reading one array element requires only one memory access
+///     (two memory access required for StringArray, one for offset buffer, 
the other for value buffer).
+///
+/// (2) For string/byte larger than 12 bytes, we can still be faster than (for 
certain operations) StringArray/ByteArray,
+///     thanks to the inlined 4 bytes.
+///     Consider equality check:
+///     If the first four bytes of the two strings are different, we can 
return false immediately (with just one memory access).
+///     If we are unlucky and the first four bytes are the same, we need to 
fallback to compare two full strings.
+///
+/// If we directly load the data using 
`GenericByteViewArray::value_unchecked`, we load the entire string and make 
multiple memory accesses;
+/// but most of the time (eq, ord), we only only need to look at the first 4 
bytes to know the answer, thus not utilizing the inlined data.
+impl<'a, T: ByteViewType> ArrayOrd for &'a GenericByteViewArray<T> {
+    type Item = (&'a GenericByteViewArray<T>, usize);
+
+    /// # Equality check flow
+    /// (1) if both string are smaller than 12 bytes, we can directly compare 
the data inlined to the view.
+    /// (2) if any of the string is larger than 12 bytes, we need to compare 
the full string.
+    ///     (2.1) if the inlined 4 bytes are different, we can return false 
immediately.
+    ///     (2.2) o.w., we need to compare the full string.
+    ///
+    /// # Safety
+    /// (1) Indexing. The Self::Item.1 encodes the index value, which is 
already checked in `value` function,
+    ///               so it is safe to index into the views.
+    /// (2) Slice data from view. We know the bytes 4-8 are inlined data (per 
spec), so it is safe to slice from the view.
+    fn is_eq(l: Self::Item, r: Self::Item) -> bool {
+        let l_view = unsafe { l.0.views().get_unchecked(l.1) };
+        let l_len = *l_view as u32;
+
+        let r_view = unsafe { r.0.views().get_unchecked(r.1) };
+        let r_len = *r_view as u32;
+
+        if l_len != r_len {
+            return false;
+        }
+
+        if l_len <= 12 {
+            let l_data = unsafe {
+                std::slice::from_raw_parts(
+                    (l_view as *const u128 as *const u8).wrapping_add(4),
+                    l_len as usize,
+                )
+            };
+            let r_data = unsafe {
+                std::slice::from_raw_parts(
+                    (r_view as *const u128 as *const u8).wrapping_add(4),
+                    r_len as usize,
+                )
+            };
+            l_data == r_data
+        } else {
+            let l_inlined_data = unsafe {
+                std::slice::from_raw_parts((l_view as *const u128 as *const 
u8).wrapping_add(4), 4)
+            };
+            let r_inlined_data = unsafe {
+                std::slice::from_raw_parts((r_view as *const u128 as *const 
u8).wrapping_add(4), 4)
+            };
+            if l_inlined_data != r_inlined_data {
+                return false;
+            }
+
+            let l_full_data: &[u8] = unsafe { 
l.0.value_unchecked(l.1).as_ref() };
+            let r_full_data: &[u8] = unsafe { 
r.0.value_unchecked(r.1).as_ref() };
+            l_full_data == r_full_data
+        }
+    }
+
+    /// # Ordering check flow
+    /// (1) if both string are smaller than 12 bytes, we can directly compare 
the data inlined to the view.
+    /// (2) if any of the string is larger than 12 bytes, we need to compare 
the full string.
+    ///     (2.1) if the inlined 4 bytes are different, we can return the 
result immediately.
+    ///     (2.2) o.w., we need to compare the full string.
+    ///
+    /// # Safety
+    /// (1) Indexing. The Self::Item.1 encodes the index value, which is 
already checked in `value` function,
+    ///              so it is safe to index into the views.
+    /// (2) Slice data from view. We know the bytes 4-8 are inlined data (per 
spec), so it is safe to slice from the view.
+    fn is_lt(l: Self::Item, r: Self::Item) -> bool {
+        let l_view = l.0.views().get(l.1).unwrap();
+        let l_len = *l_view as u32;
+
+        let r_view = r.0.views().get(r.1).unwrap();
+        let r_len = *r_view as u32;
+
+        if l_len <= 12 && r_len <= 12 {
+            let l_data = unsafe {
+                std::slice::from_raw_parts(
+                    (l_view as *const u128 as *const u8).wrapping_add(4),
+                    l_len as usize,
+                )
+            };
+            let r_data = unsafe {
+                std::slice::from_raw_parts(
+                    (r_view as *const u128 as *const u8).wrapping_add(4),
+                    r_len as usize,
+                )
+            };
+            return l_data < r_data;
+        }
+        // one of the string is larger than 12 bytes,
+        // we then try to compare the inlined data first

Review Comment:
   This likewise might be helpful to put in a `#inline(always)` function to 
make this easier for future readers to understand
   
   ```rust
   let l_inlined_data = GenericByteViewArray::inlined_prefix_value(&l_view);
   let r_inlined_data = GenericByteViewArray::inlined_prefix_value(&r_view);
   ```



##########
arrow-ord/src/cmp.rs:
##########
@@ -538,6 +540,135 @@ impl<'a, T: ByteArrayType> ArrayOrd for &'a 
GenericByteArray<T> {
     }
 }
 
+/// Comparing two ByteView types are non-trivial.
+/// It takes a bit of patience to understand why we don't just compare two 
&[u8] directly.

Review Comment:
   😮  🏆 🥇  for the comments



##########
arrow-ord/src/comparison.rs:
##########
@@ -1189,20 +1206,60 @@ mod tests {
         };
     }
 
+    macro_rules! test_utf8_view_scalar {
+        ($test_name:ident, $left:expr, $right:expr, $op:expr, $expected:expr) 
=> {
+            #[test]
+            fn $test_name() {
+                let left = StringViewArray::from($left);
+                let right = StringViewArray::new_scalar($right);
+                let res = $op(&left, &right).unwrap();
+                let expected = $expected;
+                assert_eq!(expected.len(), res.len());
+                for i in 0..res.len() {
+                    let v = res.value(i);
+                    assert_eq!(
+                        v,
+                        expected[i],
+                        "unexpected result when comparing {} at position {} to 
{} ",
+                        left.value(i),
+                        i,
+                        $right
+                    );
+                }
+            }
+        };
+    }
+
+    const LARGE_STRING: &str = "larger than 12 bytes string";
+
     test_utf8!(
         test_utf8_array_eq,
         vec!["arrow", "arrow", "arrow", "arrow"],
         vec!["arrow", "parquet", "datafusion", "flight"],
         crate::cmp::eq,
         [true, false, false, false]
     );
+    test_utf8_view!(
+        test_utf8_view_array_eq,
+        vec![LARGE_STRING, "arrow", "arrow", "arrow"],

Review Comment:
   Can we please add comparisons to all these tests for 
   1. 2 large strings that have the same first 4 bytes but are different in the 
remaining bytes
   2. 1 large and 1 small string that are the same in the first 4 bytes but 
different in the remaining btes
   3. 2 small strings that are the same in the first 4 byes (I realize this 
won't exercise a different code path but would be good to test)
   
   i think otherwise the tests cover all the important cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement compare operations for view types [arrow-rs]

Reply via email to