Re: [PR] perf: optimize regexp_count to avoid String allocation when start position is provided [datafusion]

via GitHub Tue, 30 Dec 2025 11:28:08 -0800


comphead commented on code in PR #19553:
URL: https://github.com/apache/datafusion/pull/19553#discussion_r2653715294



##########
datafusion/functions/src/regex/regexpcount.rs:
##########
@@ -569,8 +569,16 @@ fn count_matches(
             ));
         }
 
-        let find_slice = value.chars().skip(start as usize - 
1).collect::<String>();
-        let count = pattern.find_iter(find_slice.as_str()).count();
+        // Find the byte offset for the start position (1-based character 
index)
+        let byte_offset = value
+            .char_indices()
+            .nth((start as usize).saturating_sub(1))
+            .map(|(idx, _)| idx)
+            .unwrap_or(value.len());
+
+        // Use string slicing instead of collecting chars into a new String
+        let find_slice = &value[byte_offset..];

Review Comment:
   Thanks @viirya a quick question as this PR refers to the string by offset 
directly, would be that UTF compliant? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: optimize regexp_count to avoid String allocation when start position is provided [datafusion]

Reply via email to