Re: [PR] [SPARK-57022][SQL] Support nested column pruning for transform over arrays of structs [spark]

via GitHub Tue, 26 May 2026 03:35:37 -0700


peter-toth commented on code in PR #56070:
URL: https://github.com/apache/spark/pull/56070#discussion_r3303126758



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:
##########
@@ -140,6 +140,19 @@ object SchemaPruning extends SQLConfHelper {
    */
   private[catalyst] def getRootFields(expr: Expression): Seq[RootField] = {
     expr match {
+      case ArrayTransform(argument, lambda: LambdaFunction) =>

Review Comment:
   This case, and the matching `ArrayTransform` branch in 
`ProjectionOverSchema`, are the only two extension points for lambda-aware 
nested pruning, and they're hardcoded to `ArrayTransform`. The same shape 
applies to several other higher-order functions whose element is *consumed* 
rather than passed through to the output:
   
   - `ArrayExists`, `ArrayForAll` — predicate over the element; output is 
`Boolean`.
   - `ArrayAggregate` — aggregation; output is the merge type.
   - `MapFilter` — predicate over `(key, value)`; output type is the original 
map.
   
   For all of those, narrowing the input element struct based on what the 
lambda body accesses is sound (no downstream consumer sees the original element 
type). Each one currently requires duplicating both the `getRootFields` branch 
and the `ProjectionOverSchema` branch, including the lambda-variable rewrite 
mechanics.
   
   A generalized refactor could be a follow-up PR:
   
   - Lift `collectLambdaVariableFields`, `LambdaVariableField`, and the 
per-element pruning into a helper that takes any expression with a 
`LambdaFunction` child whose first argument binds an `array<struct<...>>` (or 
`map<k, struct<...>>`) element.
   - In `ProjectionOverSchema`, dispatch on `case h: HigherOrderFunction if 
eligible(h)` instead of by class, and rewrite via the same 
`ProjectionOverLambdaVariable` logic.
   
   The counter-arguments are real but bounded:
   - `ArrayFilter` / `ArraySort` / `ZipWith` pass the element through to the 
output, so they can't reuse this without also tracking *output* consumers — 
leave them out of the generalized set.
   - The `ArraySort` case is bigger because the lambda has two element 
variables; that argues for limiting v1 of the generalization to one-element 
HOFs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57022][SQL] Support nested column pruning for transform over arrays of structs [spark]

Reply via email to