findepi commented on code in PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#discussion_r1759057882
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -197,6 +213,29 @@ fn string_view_substr(
let start_array = as_int64_array(&args[0])?;
+ // Notes for ASCII-only optimization:
+ //
+ // String characters are variable length encoded in UTF-8, `substr()`
function's
+ // arguments are character-based, converting them into byte-based indices
+ // requires expensive decoding.
+ // However, checking if a string is ASCII-only is relatively cheap.
+ // If strings are ASCII only, use byte-based indices instead.
+ //
+ // A common pattern to call `substr()` is taking a small prefix of a long
+ // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+ // In such case the overhead of ASCII-validation may not be worth it, so
+ // skip the validation for long strings for now.
Review Comment:
Why not check only the requested string prefix for being ascii?
could `string_view_array.is_ascii` variant validate string prefixes of
given length why still being vectorized?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]