peter-toth commented on code in PR #56070:
URL: https://github.com/apache/spark/pull/56070#discussion_r3303126758
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala:
##########
@@ -140,6 +140,19 @@ object SchemaPruning extends SQLConfHelper {
*/
private[catalyst] def getRootFields(expr: Expression): Seq[RootField] = {
expr match {
+ case ArrayTransform(argument, lambda: LambdaFunction) =>
Review Comment:
This case, and the matching `ArrayTransform` branch in
`ProjectionOverSchema`, are the only two extension points for lambda-aware
nested pruning, and they're hardcoded to `ArrayTransform`. The same shape
applies to several other higher-order functions whose element is *consumed*
rather than passed through to the output:
- `ArrayExists`, `ArrayForAll` — predicate over the element; output is
`Boolean`.
- `ArrayAggregate` — aggregation; output is the merge type.
- `MapFilter` — predicate over `(key, value)`; output type is the original
map.
For all of those, narrowing the input element struct based on what the
lambda body accesses is sound (no downstream consumer sees the original element
type). Each one currently requires duplicating both the `getRootFields` branch
and the `ProjectionOverSchema` branch, including the lambda-variable rewrite
mechanics.
A generalized refactor could be a follow-up PR:
- Lift `collectLambdaVariableFields`, `LambdaVariableField`, and the
per-element pruning into a helper that takes any expression with a
`LambdaFunction` child whose first argument binds an `array<struct<...>>` (or
`map<k, struct<...>>`) element.
- In `ProjectionOverSchema`, dispatch on `case h: HigherOrderFunction if
eligible(h)` instead of by class, and rewrite via the same
`ProjectionOverLambdaVariable` logic.
The counter-arguments are real but bounded:
- `ArrayFilter` / `ArraySort` / `ZipWith` pass the element through to the
output, so they can't reuse this without also tracking *output* consumers —
leave them out of the generalized set.
- The `ArraySort` case is bigger because the lambda has two element
variables; that argues for limiting v1 of the generalization to one-element
HOFs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]