Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

via GitHub Sat, 02 Aug 2025 01:18:26 -0700


zhuqi-lucas commented on code in PR #7860:
URL: https://github.com/apache/arrow-rs/pull/7860#discussion_r2249166605



##########
arrow-ord/src/sort.rs:
##########
@@ -295,12 +295,85 @@ fn sort_bytes<T: ByteArrayType>(
     options: SortOptions,
     limit: Option<usize>,
 ) -> UInt32Array {
-    let mut valids = value_indices
+    // Note: Why do we use 4‑byte prefix?
+    // Compute the 4‑byte prefix in BE order, or left‑pad if shorter.
+    // Most byte‐sequences differ in their first few bytes, so by
+    // comparing up to 4 bytes as a single u32 we avoid the overhead
+    // of a full lexicographical compare for the vast majority of cases.
+
+    // 1. Build a vector of (index, prefix, length) tuples
+    let mut valids: Vec<(u32, u32, u64)> = value_indices

Review Comment:
   Good point @alamb , i think it's safe for us to do the cut here because:
   
   In our sort_bytes implementation we always pull the first up to 4 bytes as 
raw &[u8] (via values.value(...).as_ref()), pack them into a u32, and compare 
at the byte level. That means we never try to decode those 4 bytes back into a 
&str, so there’s no risk of a UTF-8 panic in the hot path.
   
   And we don't need to use it, just to compare.
   
   I will also try to add this in fuzz testing, thanks!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

Reply via email to