kumarUjjawal commented on code in PR #22887:
URL: https://github.com/apache/datafusion/pull/22887#discussion_r3397172088
##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -273,12 +268,37 @@ macro_rules! regexp_is_match_flag {
}};
}
+/// Unpack a dictionary array to its value type, leaving other arrays as-is
+fn flatten_dictionary(array: &ArrayRef) -> Result<ArrayRef> {
+ match array.data_type() {
+ DataType::Dictionary(_, value_type) => Ok(cast(array, value_type)?),
+ _ => Ok(Arc::clone(array)),
+ }
+}
+
pub(crate) fn regex_match_dyn(
left: &ArrayRef,
right: &ArrayRef,
not_match: bool,
flag: bool,
) -> Result<ArrayRef> {
+ // Type coercion deliberately preserves `Dictionary(_, Utf8)` inputs so
+ // that literal patterns can use the dictionary fast path in
+ // `regex_match_dyn_scalar`. Here every row has its own pattern, so the
+ // dictionary encoding offers no shortcut: flatten the inputs, and if the
+ // dictionary's value type differs from the coerced pattern type (e.g.
+ // `Dictionary(Int32, Utf8)` matched against `Utf8View` patterns), cast to
+ // a common string type.
+ if matches!(left.data_type(), DataType::Dictionary(_, _))
+ || matches!(right.data_type(), DataType::Dictionary(_, _))
+ {
+ let left = flatten_dictionary(left)?;
+ let mut right = flatten_dictionary(right)?;
+ if right.data_type() != left.data_type() {
+ right = cast(&right, left.data_type())?;
Review Comment:
if the left side is `Dictionary(Int32, Utf8)` and the pattern is
`LargeUtf8`, type coercion should keep the common type as `LargeUtf8`. But we
are flatenning the dictionary to `Utf8`, then casts the `LargeUtf8` pattern
back to `Utf8`. That can fail for valid large strings.
We can cast the dictionary side to the already-coerced pattern type, or
compute the common string type again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]