LiaCastaneda commented on code in PR #22853:
URL: https://github.com/apache/datafusion/pull/22853#discussion_r3518559616
##########
datafusion/expr/src/higher_order_function.rs:
##########
@@ -239,6 +239,15 @@ pub struct LambdaArgument {
/// For example, for `array_transform([2], v -> -v)`,
/// this will be `vec![Field::new("v", DataType::Int32, true)]`
params: Vec<FieldRef>,
+ /// Indices into `params` of the parameters that are actually referenced
+ /// by `body` (taking nested-lambda shadowing into account).
+ ///
+ /// `None` means "no information, assume every declared parameter is used"
+ /// — that is the backwards-compatible behavior of [`Self::new`]. When set,
+ /// [`Self::evaluate`] skips evaluating and pushing the closures for the
+ /// parameters not listed here, so unused declared parameters do not shift
+ /// the columns the body's compressed indices expect.
+ used_param_indices: Option<Vec<usize>>,
Review Comment:
If we pass `None` in `new()` and assume all parameters are used, would it be
better to compute the indices directly from `params: Vec<FieldRef>`? That way
this field doesn't need to be an `Option` it always holds the used parameter
indices, which happen to be all of them when called via `new()`
##########
datafusion/expr/src/higher_order_function.rs:
##########
@@ -257,26 +266,64 @@ pub struct LambdaArgument {
}
impl LambdaArgument {
+ /// Build a [`LambdaArgument`] that treats every declared parameter as
+ /// used. This is the backwards-compatible behavior. Prefer
+ /// [`Self::new_with_used_params`] when the caller knows which subset of
+ /// the lambda's parameters the body actually references — otherwise the
+ /// merged batch will still contain columns for unused parameters.
pub fn new(
params: Vec<FieldRef>,
body: Arc<dyn PhysicalExpr>,
captures: Option<RecordBatch>,
) -> Self {
+ Self::new_with_used_params(params, body, captures, None)
+ }
+
+ /// Build a [`LambdaArgument`] knowing which subset of `params` (by name)
+ /// the lambda body actually references.
+ ///
+ /// When `used_params` is `Some(set)`, [`Self::evaluate`] only evaluates
+ /// and pushes the closures whose corresponding parameter name is in
+ /// `set`, in the original declaration order of `params`. Unused declared
+ /// parameters leave no slot in the merged batch, so the body's compressed
+ /// column indices line up directly. When `used_params` is `None`,
+ /// behavior is identical to [`Self::new`].
Review Comment:
Should we just require `used_params` to be set in this function? Otherwise
the behavior is the same as `new`, so I'm not sure why we'd need both for that
case
##########
datafusion/physical-expr/src/expressions/lambda.rs:
##########
@@ -129,6 +141,7 @@ impl LambdaExpr {
body,
projected_body,
projection,
+ used_params: used_param_names,
Review Comment:
Since `used_param_names` is already computed here during construction, could
the fix live entirely in `projected_body` instead? Rather than threading
`used_params` through to `LambdaArgument`, then LambdaArgument stays unchanged
and external callers don't need to think about used params at all
Left a rough idea here
https://github.com/LiaCastaneda/datafusion/tree/lia/lambda-body-centric-fix
##########
datafusion/expr/src/higher_order_function.rs:
##########
Review Comment:
Also, the body is dynamic, which technically means for higher order
functions indices positions can differ for each query no? for example if you
have a Higher Order function with parameters (x,y,x) for a given query you can
use x,y or y or all of them. This essentially means external callers of
`LambdaArgument::new_with_used_params` would have to walk the body themselves
to figure out which params are referenced.
##########
datafusion/expr/src/higher_order_function.rs:
##########
Review Comment:
Providing an api where we have to specify the indices of the parameters that
will be used in the body feels a bit unergonomic. Have you considered somehow
extracting this from the body and handling this internally so it's invisible to
the caller?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]