Jefffrey commented on code in PR #19529:
URL: https://github.com/apache/datafusion/pull/19529#discussion_r2650075672
##########
datafusion/functions/src/string/contains.rs:
##########
@@ -89,51 +89,82 @@ impl ScalarUDFImpl for ContainsFunc {
}
fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result<ColumnarValue> {
- make_scalar_function(contains, vec![])(&args.args)
+ make_scalar_function_columnar(contains)(&args.args)
}
fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
+/// Converts a `ColumnarValue` to a value that implements `Datum` for use with
arrow kernels.
+/// If the value is a scalar, wraps the single-element array in `Scalar` to
signal to arrow
+/// that this is a scalar value (enabling optimized code paths).
+fn columnar_to_datum(value: &ColumnarValue) -> Result<(ArrayRef, bool)> {
Review Comment:
Docstring confusing here since we aren't doing the wrap in Scalar
##########
datafusion/functions/src/string/contains.rs:
##########
@@ -89,51 +89,82 @@ impl ScalarUDFImpl for ContainsFunc {
}
fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result<ColumnarValue> {
- make_scalar_function(contains, vec![])(&args.args)
+ make_scalar_function_columnar(contains)(&args.args)
}
fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
+/// Converts a `ColumnarValue` to a value that implements `Datum` for use with
arrow kernels.
+/// If the value is a scalar, wraps the single-element array in `Scalar` to
signal to arrow
+/// that this is a scalar value (enabling optimized code paths).
+fn columnar_to_datum(value: &ColumnarValue) -> Result<(ArrayRef, bool)> {
+ match value {
+ ColumnarValue::Array(array) => Ok((Arc::clone(array), false)),
+ ColumnarValue::Scalar(scalar) => Ok((scalar.to_array()?, true)),
+ }
+}
+
+/// Helper to call arrow_contains with proper Datum handling.
+/// When an argument is marked as scalar, we wrap it in `Scalar` to tell
arrow's
+/// kernel to use the optimized single-value code path instead of iterating.
+fn call_arrow_contains(
+ haystack: &ArrayRef,
+ haystack_is_scalar: bool,
+ needle: &ArrayRef,
+ needle_is_scalar: bool,
+) -> Result<ColumnarValue> {
+ // Arrow's Datum trait is implemented for ArrayRef, Arc<dyn Array>, and
Scalar<T>
+ // We pass ArrayRef directly when not scalar, or wrap in Scalar when it is
+ let result = match (haystack_is_scalar, needle_is_scalar) {
+ (false, false) => arrow_contains(haystack, needle)?,
+ (false, true) => arrow_contains(haystack,
&Scalar::new(Arc::clone(needle)))?,
+ (true, false) => arrow_contains(&Scalar::new(Arc::clone(haystack)),
needle)?,
+ (true, true) => arrow_contains(
+ &Scalar::new(Arc::clone(haystack)),
+ &Scalar::new(Arc::clone(needle)),
+ )?,
Review Comment:
I wonder if we could implement `Datum` on `ColumnerValue` (or at least on
`ScalarValue`), so we wouldn't need to do this check & wrapping logic in each
function we optimize 🤔
##########
datafusion/functions/src/utils.rs:
##########
@@ -24,6 +24,22 @@ use datafusion_expr::ColumnarValue;
use datafusion_expr::function::Hint;
use std::sync::Arc;
+/// Creates a scalar function implementation that receives `ColumnarValue`
directly.
+///
+/// Unlike `make_scalar_function`, this does NOT expand scalar arguments into
arrays.
+/// This allows the inner function to handle scalars more efficiently, for
example
+/// by using Arrow's `Datum` trait which has optimized paths for scalar
arguments.
+///
+/// * `inner` - the function to be executed, receives `ColumnarValue`
arguments directly
+pub fn make_scalar_function_columnar<F>(
Review Comment:
I don't see a value add to this function; the point of
`make_scalar_function` was to make it simpler for functions to implement based
on only arrays without considering columnarvalues; that way they can opt into a
manual implementation that does take into account columnarvalues.
`make_scalar_function_columnar` here just seems to be a thin wrapper that
doesn't do anything except call the passed in function? In which case it would
be better for the UDF (`contains`) to just put the implementing code inside
invoke itself (or have invoke call this passed in `inner` function)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]