kosiew commented on code in PR #22239:
URL: https://github.com/apache/datafusion/pull/22239#discussion_r3292963939


##########
datafusion/functions/src/core/getfield.rs:
##########
@@ -828,4 +924,181 @@ mod tests {
         let args = vec![ExpressionPlacement::Literal, 
ExpressionPlacement::Literal];
         assert_eq!(func.placement(&args), ExpressionPlacement::KeepInPlace);
     }
+
+    // --- get_field over struct constructor simplification 
--------------------
+
+    use datafusion_common::Column;
+    use datafusion_expr::simplify::SimplifyContext;
+
+    /// A non-empty string literal expression.
+    fn lit_str(s: &str) -> Expr {
+        Expr::Literal(ScalarValue::Utf8(Some(s.to_string())), None)
+    }
+
+    /// A column reference expression.
+    fn col(name: &str) -> Expr {
+        Expr::Column(Column::from_name(name))
+    }
+
+    fn scalar_fn(udf: ScalarUDF, args: Vec<Expr>) -> Expr {
+        Expr::ScalarFunction(ScalarFunction::new_udf(Arc::new(udf), args))
+    }
+
+    /// `named_struct(name1, value1, name2, value2, ...)`.
+    fn named_struct(pairs: Vec<(&str, Expr)>) -> Expr {
+        let args = pairs
+            .into_iter()
+            .flat_map(|(name, value)| [lit_str(name), value])
+            .collect();
+        scalar_fn(ScalarUDF::new_from_impl(NamedStructFunc::new()), args)
+    }
+
+    /// `struct(value0, value1, ...)`.
+    fn struct_fn(values: Vec<Expr>) -> Expr {
+        scalar_fn(ScalarUDF::new_from_impl(StructFunc::new()), values)
+    }
+
+    /// `get_field(args...)`.
+    fn get_field(args: Vec<Expr>) -> Expr {
+        scalar_fn(ScalarUDF::new_from_impl(GetFieldFunc::new()), args)
+    }
+
+    /// Run `GetFieldFunc::simplify` once and return the rewritten expression,
+    /// panicking if the input was left unchanged.
+    fn simplified(args: Vec<Expr>) -> Expr {
+        match GetFieldFunc::new()
+            .simplify(args, &SimplifyContext::default())
+            .unwrap()
+        {
+            ExprSimplifyResult::Simplified(expr) => expr,
+            ExprSimplifyResult::Original(args) => {
+                panic!("expected the expression to be simplified, got 
{args:?}")
+            }
+        }
+    }
+
+    /// Assert that `GetFieldFunc::simplify` leaves the arguments unchanged.
+    fn assert_not_simplified(args: Vec<Expr>) {
+        match GetFieldFunc::new()
+            .simplify(args.clone(), &SimplifyContext::default())
+            .unwrap()
+        {
+            ExprSimplifyResult::Original(unchanged) => assert_eq!(unchanged, 
args),
+            ExprSimplifyResult::Simplified(expr) => {
+                panic!("expected no simplification, got {expr:?}")
+            }
+        }
+    }
+
+    #[test]
+    fn simplify_get_field_named_struct_returns_matching_value() {
+        // get_field(named_struct('min', a, 'max', b), 'max') => b
+        let expr = get_field(vec![
+            named_struct(vec![("min", col("a")), ("max", col("b"))]),
+            lit_str("max"),
+        ]);
+        let Expr::ScalarFunction(ScalarFunction { args, .. }) = expr else {
+            unreachable!()
+        };
+        assert_eq!(simplified(args), col("b"));
+    }
+

Review Comment:
   Small readability nit: this test reconstructs `get_field(...)`, immediately 
destructures it back into `args`, and then calls `simplified(args)`. The 
neighboring tests already pass `args` directly, which feels a bit easier to 
follow. It may be worth switching this one to the same style for consistency.



##########
datafusion/sqllogictest/test_files/order.slt:
##########
@@ -1705,15 +1705,16 @@ EXPLAIN SELECT named_struct('sum', a + b) AS s FROM 
ordered ORDER BY s['sum'];
 ----
 physical_plan DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, 
projection=[named_struct(sum, a@0 + b@1) as s], file_type=csv, has_header=true
 
-# Wrapping a non-ordered column into a struct — SortExec required
+# Wrapping a non-ordered column into a struct — SortExec required.
 # Reuses the `ordered` table above which has WITH ORDER (a + b).
+# The simplifier resolves `get_field(named_struct(...), 'a')` so the sort key
+# is not extracted into a separate scan projection column.
 query TT
 EXPLAIN SELECT named_struct('a', a, 'b', b) AS s FROM ordered ORDER BY s['a'];
 ----
 physical_plan
-01)ProjectionExec: expr=[s@0 as s]
-02)--SortExec: expr=[__datafusion_extracted_1@1 ASC NULLS LAST], 
preserve_partitioning=[false]
-03)----DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, 
projection=[named_struct(a, a@0, b, b@1) as s, get_field(named_struct(a, a@0, 
b, b@1), a) as __datafusion_extracted_1], file_type=csv, has_header=true
+01)SortExec: expr=[get_field(s@0, a) ASC NULLS LAST], 
preserve_partitioning=[false]
+02)--DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, 
projection=[named_struct(a, a@0, b, b@1) as s], file_type=csv, has_header=true
 
 # Simple column ordering tests using a table ordered by (a)
 statement ok

Review Comment:
   One thing that stood out in the updated plan is that the sort key now 
evaluates `get_field(s@0, a)` against the materialized struct instead of 
sorting directly on an extracted column. Functionally this looks correct, but 
it may introduce some extra per-row sort-key evaluation work compared with the 
previous shape. Might be worth tracking in a follow-up issue whether physical 
sort-key normalization could simplify `get_field(named_struct(...), f)` back 
into a direct column reference.



##########
datafusion/functions/src/core/getfield.rs:
##########
@@ -249,6 +252,96 @@ fn extract_single_field(base: ColumnarValue, name: 
ScalarValue) -> Result<Column
     }
 }
 
+/// Try to simplify a `get_field` call whose base is an inline struct
+/// constructor by resolving the field access at plan time.
+///
+/// Handles both struct constructors:
+/// * `named_struct('a', x, 'b', y)` — fields are looked up by name.
+/// * `struct(x, y)` — fields are positional and named `c0`, `c1`, ...
+///
+/// For example:
+/// * `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
+/// * `get_field(struct(a, b), 'c1')` => `b`
+///
+/// `args` is the (already flattened) argument list of the `get_field` call:
+/// `[base, field_name, rest_of_path...]`. When extra path elements remain
+/// after resolving the first one (`get_field(named_struct('s', inner), 's', 
'k')`),
+/// the resolved value is re-wrapped in a `get_field` call for the remaining
+/// path so the simplifier can recurse into it on the next pass.
+///
+/// Returns `None` — leaving the expression untouched — whenever the rewrite
+/// cannot be proven safe, e.g. a non-literal field name, a `named_struct`
+/// with a non-literal field name (which might shadow the requested field at
+/// runtime), or a field the constructor does not produce.
+fn simplify_get_field_over_struct_constructor(args: &[Expr]) -> Option<Expr> {
+    let [base, field_name, rest @ ..] = args else {
+        return None;
+    };
+
+    // The accessed field name must be a non-empty string literal.
+    let Expr::Literal(field_name, _) = field_name else {
+        return None;
+    };
+    let field_name = field_name
+        .try_as_str()
+        .flatten()
+        .filter(|s| !s.is_empty())?;
+
+    let Expr::ScalarFunction(ScalarFunction {
+        func,
+        args: ctor_args,
+    }) = base
+    else {
+        return None;
+    };
+
+    let value = if func.inner().is::<NamedStructFunc>() {
+        // named_struct(name1, value1, name2, value2, ...)
+        if !ctor_args.len().is_multiple_of(2) {
+            return None;
+        }
+        let mut matched = None;
+        for pair in ctor_args.chunks_exact(2) {
+            // Every name must be a literal string: a non-literal name could
+            // evaluate to `field_name` at runtime and shadow a later match,
+            // so we cannot safely pick a field unless all names are known.
+            let Expr::Literal(name, _) = &pair[0] else {
+                return None;
+            };
+            let name = name.try_as_str().flatten()?;
+            // `column_by_name` resolves to the first match, so do the same.
+            if matched.is_none() && name == field_name {
+                matched = Some(&pair[1]);
+            }
+        }
+        matched?.clone()
+    } else if func.inner().is::<StructFunc>() {
+        // struct(value0, value1, ...) produces fields named c0, c1, ...
+        let index: usize = field_name.strip_prefix('c')?.parse().ok()?;
+        // Reject non-canonical spellings (e.g. "c01") that name no real field.
+        if format!("c{index}") != field_name {
+            return None;
+        }
+        ctor_args.get(index)?.clone()
+    } else {
+        return None;
+    };
+
+    if rest.is_empty() {
+        return Some(value);
+    }
+
+    // Remaining path elements: re-wrap as get_field(value, rest...) and let
+    // the simplifier resolve the rest on a subsequent pass.
+    let mut new_args = Vec::with_capacity(rest.len() + 1);
+    new_args.push(value);
+    new_args.extend_from_slice(rest);
+    Some(Expr::ScalarFunction(ScalarFunction::new_udf(
+        Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())),
+        new_args,
+    )))
+}
+
 impl GetFieldFunc {
     pub fn new() -> Self {
         Self {

Review Comment:
   This simplification can eliminate evaluation of struct fields that are never 
accessed, for example `get_field(named_struct('a', 1/0, 'b', x), 'b')` 
simplifying down to `x`. That matches the existing immutable-expression 
optimizer contract, but I think a short doc comment here would help future 
readers understand that unused field expressions are intentionally not 
evaluated after simplification.



##########
datafusion/functions/src/core/getfield.rs:
##########
@@ -249,6 +252,96 @@ fn extract_single_field(base: ColumnarValue, name: 
ScalarValue) -> Result<Column
     }
 }
 
+/// Try to simplify a `get_field` call whose base is an inline struct
+/// constructor by resolving the field access at plan time.
+///
+/// Handles both struct constructors:
+/// * `named_struct('a', x, 'b', y)` — fields are looked up by name.
+/// * `struct(x, y)` — fields are positional and named `c0`, `c1`, ...
+///
+/// For example:
+/// * `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
+/// * `get_field(struct(a, b), 'c1')` => `b`
+///
+/// `args` is the (already flattened) argument list of the `get_field` call:
+/// `[base, field_name, rest_of_path...]`. When extra path elements remain
+/// after resolving the first one (`get_field(named_struct('s', inner), 's', 
'k')`),
+/// the resolved value is re-wrapped in a `get_field` call for the remaining
+/// path so the simplifier can recurse into it on the next pass.
+///
+/// Returns `None` — leaving the expression untouched — whenever the rewrite
+/// cannot be proven safe, e.g. a non-literal field name, a `named_struct`
+/// with a non-literal field name (which might shadow the requested field at
+/// runtime), or a field the constructor does not produce.
+fn simplify_get_field_over_struct_constructor(args: &[Expr]) -> Option<Expr> {
+    let [base, field_name, rest @ ..] = args else {
+        return None;
+    };
+
+    // The accessed field name must be a non-empty string literal.
+    let Expr::Literal(field_name, _) = field_name else {
+        return None;
+    };
+    let field_name = field_name
+        .try_as_str()
+        .flatten()
+        .filter(|s| !s.is_empty())?;
+

Review Comment:
   `try_as_str().flatten()` correctly handles `ScalarValue::Utf8(None)` here, 
so the guard is safe. It could still be nice to add a dedicated test for a null 
literal field name, something like 
`simplify_get_field_null_field_name_left_alone`, just to document the invariant 
explicitly.



##########
datafusion/functions/src/core/getfield.rs:
##########
@@ -249,6 +252,96 @@ fn extract_single_field(base: ColumnarValue, name: 
ScalarValue) -> Result<Column
     }
 }
 
+/// Try to simplify a `get_field` call whose base is an inline struct
+/// constructor by resolving the field access at plan time.
+///
+/// Handles both struct constructors:
+/// * `named_struct('a', x, 'b', y)` — fields are looked up by name.
+/// * `struct(x, y)` — fields are positional and named `c0`, `c1`, ...
+///
+/// For example:
+/// * `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
+/// * `get_field(struct(a, b), 'c1')` => `b`
+///
+/// `args` is the (already flattened) argument list of the `get_field` call:
+/// `[base, field_name, rest_of_path...]`. When extra path elements remain
+/// after resolving the first one (`get_field(named_struct('s', inner), 's', 
'k')`),
+/// the resolved value is re-wrapped in a `get_field` call for the remaining
+/// path so the simplifier can recurse into it on the next pass.
+///
+/// Returns `None` — leaving the expression untouched — whenever the rewrite
+/// cannot be proven safe, e.g. a non-literal field name, a `named_struct`
+/// with a non-literal field name (which might shadow the requested field at
+/// runtime), or a field the constructor does not produce.
+fn simplify_get_field_over_struct_constructor(args: &[Expr]) -> Option<Expr> {
+    let [base, field_name, rest @ ..] = args else {
+        return None;
+    };
+
+    // The accessed field name must be a non-empty string literal.
+    let Expr::Literal(field_name, _) = field_name else {
+        return None;
+    };
+    let field_name = field_name
+        .try_as_str()
+        .flatten()
+        .filter(|s| !s.is_empty())?;
+
+    let Expr::ScalarFunction(ScalarFunction {
+        func,
+        args: ctor_args,
+    }) = base
+    else {
+        return None;
+    };
+
+    let value = if func.inner().is::<NamedStructFunc>() {
+        // named_struct(name1, value1, name2, value2, ...)
+        if !ctor_args.len().is_multiple_of(2) {
+            return None;
+        }
+        let mut matched = None;
+        for pair in ctor_args.chunks_exact(2) {
+            // Every name must be a literal string: a non-literal name could
+            // evaluate to `field_name` at runtime and shadow a later match,
+            // so we cannot safely pick a field unless all names are known.
+            let Expr::Literal(name, _) = &pair[0] else {
+                return None;
+            };
+            let name = name.try_as_str().flatten()?;
+            // `column_by_name` resolves to the first match, so do the same.
+            if matched.is_none() && name == field_name {
+                matched = Some(&pair[1]);
+            }
+        }
+        matched?.clone()
+    } else if func.inner().is::<StructFunc>() {
+        // struct(value0, value1, ...) produces fields named c0, c1, ...
+        let index: usize = field_name.strip_prefix('c')?.parse().ok()?;
+        // Reject non-canonical spellings (e.g. "c01") that name no real field.
+        if format!("c{index}") != field_name {
+            return None;
+        }
+        ctor_args.get(index)?.clone()

Review Comment:
   The early `return None` on a dynamic field name makes sense because a later 
runtime name could shadow an earlier literal match. One subtle detail is that 
this also prevents simplification even when a matching literal field was 
already seen earlier in the loop. It may be worth calling out in the comment 
that this is a deliberate conservative approximation rather than an accidental 
missed optimization.



##########
datafusion/functions/src/core/getfield.rs:
##########
@@ -249,6 +252,96 @@ fn extract_single_field(base: ColumnarValue, name: 
ScalarValue) -> Result<Column
     }
 }
 
+/// Try to simplify a `get_field` call whose base is an inline struct
+/// constructor by resolving the field access at plan time.
+///
+/// Handles both struct constructors:
+/// * `named_struct('a', x, 'b', y)` — fields are looked up by name.
+/// * `struct(x, y)` — fields are positional and named `c0`, `c1`, ...
+///
+/// For example:
+/// * `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
+/// * `get_field(struct(a, b), 'c1')` => `b`
+///
+/// `args` is the (already flattened) argument list of the `get_field` call:
+/// `[base, field_name, rest_of_path...]`. When extra path elements remain
+/// after resolving the first one (`get_field(named_struct('s', inner), 's', 
'k')`),
+/// the resolved value is re-wrapped in a `get_field` call for the remaining
+/// path so the simplifier can recurse into it on the next pass.
+///
+/// Returns `None` — leaving the expression untouched — whenever the rewrite
+/// cannot be proven safe, e.g. a non-literal field name, a `named_struct`
+/// with a non-literal field name (which might shadow the requested field at
+/// runtime), or a field the constructor does not produce.
+fn simplify_get_field_over_struct_constructor(args: &[Expr]) -> Option<Expr> {
+    let [base, field_name, rest @ ..] = args else {
+        return None;
+    };
+
+    // The accessed field name must be a non-empty string literal.
+    let Expr::Literal(field_name, _) = field_name else {
+        return None;
+    };
+    let field_name = field_name
+        .try_as_str()
+        .flatten()
+        .filter(|s| !s.is_empty())?;
+
+    let Expr::ScalarFunction(ScalarFunction {
+        func,
+        args: ctor_args,
+    }) = base
+    else {
+        return None;
+    };
+
+    let value = if func.inner().is::<NamedStructFunc>() {
+        // named_struct(name1, value1, name2, value2, ...)
+        if !ctor_args.len().is_multiple_of(2) {
+            return None;
+        }
+        let mut matched = None;
+        for pair in ctor_args.chunks_exact(2) {
+            // Every name must be a literal string: a non-literal name could
+            // evaluate to `field_name` at runtime and shadow a later match,
+            // so we cannot safely pick a field unless all names are known.
+            let Expr::Literal(name, _) = &pair[0] else {
+                return None;
+            };
+            let name = name.try_as_str().flatten()?;
+            // `column_by_name` resolves to the first match, so do the same.
+            if matched.is_none() && name == field_name {
+                matched = Some(&pair[1]);
+            }
+        }
+        matched?.clone()
+    } else if func.inner().is::<StructFunc>() {
+        // struct(value0, value1, ...) produces fields named c0, c1, ...
+        let index: usize = field_name.strip_prefix('c')?.parse().ok()?;
+        // Reject non-canonical spellings (e.g. "c01") that name no real field.
+        if format!("c{index}") != field_name {
+            return None;
+        }
+        ctor_args.get(index)?.clone()
+    } else {
+        return None;
+    };
+
+    if rest.is_empty() {
+        return Some(value);
+    }
+
+    // Remaining path elements: re-wrap as get_field(value, rest...) and let
+    // the simplifier resolve the rest on a subsequent pass.
+    let mut new_args = Vec::with_capacity(rest.len() + 1);
+    new_args.push(value);
+    new_args.extend_from_slice(rest);
+    Some(Expr::ScalarFunction(ScalarFunction::new_udf(
+        Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())),
+        new_args,
+    )))
+}

Review Comment:
   The re-wrap path creates a fresh 
`Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new()))` each time it builds 
an intermediate `get_field` node. Since the same construction already appears 
elsewhere in `simplify`, it might be nice to centralize this behind a small 
helper like `get_field_udf()` or a shared `OnceLock`. Not a big deal 
performance-wise, but it would make the intent and reuse a bit clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to