andygrove opened a new issue, #3990:
URL: https://github.com/apache/datafusion-comet/issues/3990
### What is the problem the feature request solves?
## Describe the bug
Not a bug — a performance / correctness-of-decisions enhancement discovered
while doing the shuffle fallback cleanup (follow-up to PR
`refactor-shuffle-fallback-coordinator`).
## Describe the solution you'd like
Comet's plan rules (`CometScanRule`, `CometExecRule`) run at both initial
planning and on every AQE stage-prep pass. On each invocation the serde
framework redoes the same compatibility checks and tagging:
- `CometExecRule.isOperatorEnabled` calls `handler.getSupportLevel(op)`
per operator, per pass.
- `QueryPlanSerde.exprToProto` walks the expression tree and calls
`handler.getSupportLevel` + `handler.convert` on every sub-expression,
per pass.
When a node was already tagged as falling back in a prior pass (carries
reasons on `CometExplainInfo.EXTENSION_INFO` via `withInfo` / `withInfos`),
re-deriving the same decision is wasted work — and in some cases (#3949)
risks re-deriving against a reshaped subtree and flipping the answer.
## Additional context
The naive approach — `if (hasExplainInfo(op)) return None` at the top of
`convertToComet` / `exprToProto` — is too coarse:
1. **Multiple handlers per node.** A scan tagged by `CometScanRule`
("Comet Scan is not enabled") can still legitimately convert via
`CometSparkToColumnarExec` in `CometExecRule`. The fallback tag was
recorded by a *different* rule/handler and does not mean the current
handler should give up. `CometExpressionSuite."get_struct_field"`
breaks with a coarse short-circuit.
2. **Partial-path tagging.** Some serde code paths record fallback
reasons even when another path could still succeed. The shuffle
refactor fixes this for shuffle by only tagging on total failure;
other sites (expression rollups, etc.) would need the same discipline
before a global short-circuit is safe.
Suggested directions:
- Per-handler attempt record (`TreeNodeTag[Set[String]]` keyed on handler
name) so `convertToComet(op, handlerX)` can skip re-running
`handlerX.getSupportLevel(op)` / `handlerX.convert(op, ...)` when this
handler has already tried and failed, without blocking other handlers.
- Or: standardize all serde paths on "only tag on final fallback" (as the
shuffle coordinator now does) so `hasExplainInfo` becomes a reliable
sticky marker, then add the coarse short-circuit safely.
- For expressions: `exprToProto` is called many times per pass on the
same sub-expression trees (partitioning expressions in shuffle checks
are the hottest). Caching needs to handle `DecimalPrecision.promote`
possibly returning a fresh instance.
Hotspots to target first:
- Expression serialization inside shuffle/partitioning checks (many
`exprToProto` repeats).
- `CometExecRule.isOperatorEnabled` on every operator, per pass.
- Scan-level expression walks in `CometScanRule`.
## Component(s)
spark
### Describe the potential solution
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]