alamb commented on code in PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#discussion_r1761725094
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -186,6 +202,53 @@ fn make_and_append_view(
null_builder.append_non_null();
}
+// String characters are variable length encoded in UTF-8, `substr()`
function's
+// arguments are character-based, converting them into byte-based indices
+// requires expensive decoding.
+// However, checking if a string is ASCII-only is relatively cheap.
+// If strings are ASCII only, use byte-based indices instead.
+//
+// A common pattern to call `substr()` is taking a small prefix of a long
Review Comment:
👍
##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -186,6 +202,53 @@ fn make_and_append_view(
null_builder.append_non_null();
}
+// String characters are variable length encoded in UTF-8, `substr()`
function's
+// arguments are character-based, converting them into byte-based indices
+// requires expensive decoding.
+// However, checking if a string is ASCII-only is relatively cheap.
+// If strings are ASCII only, use byte-based indices instead.
+//
+// A common pattern to call `substr()` is taking a small prefix of a long
+// string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+// In such case the overhead of ASCII-validation may not be worth it, so
+// skip the validation for short prefix for now.
+fn enable_ascii_fast_path<'a, V: StringArrayType<'a>>(
+ string_array: &V,
+ start: &Int64Array,
+ count: Option<&Int64Array>,
+) -> bool {
+ let is_short_prefix = match count {
+ Some(count) => {
+ let short_prefix_threshold = 32.0;
+ let n_sample = 10;
+
+ // HACK: can be simplified if function has specialized
Review Comment:
its a good point this could be faster if it had a specialization for
`ScalarValue`
Any chance you can file a ticket for this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]