andygrove opened a new issue, #3990:
URL: https://github.com/apache/datafusion-comet/issues/3990

   ### What is the problem the feature request solves?
   
   ## Describe the bug
   
   Not a bug — a performance / correctness-of-decisions enhancement discovered
   while doing the shuffle fallback cleanup (follow-up to PR
   `refactor-shuffle-fallback-coordinator`).
   
   ## Describe the solution you'd like
   
   Comet's plan rules (`CometScanRule`, `CometExecRule`) run at both initial
   planning and on every AQE stage-prep pass. On each invocation the serde
   framework redoes the same compatibility checks and tagging:
   
   - `CometExecRule.isOperatorEnabled` calls `handler.getSupportLevel(op)`
     per operator, per pass.
   - `QueryPlanSerde.exprToProto` walks the expression tree and calls
     `handler.getSupportLevel` + `handler.convert` on every sub-expression,
     per pass.
   
   When a node was already tagged as falling back in a prior pass (carries
   reasons on `CometExplainInfo.EXTENSION_INFO` via `withInfo` / `withInfos`),
   re-deriving the same decision is wasted work — and in some cases (#3949)
   risks re-deriving against a reshaped subtree and flipping the answer.
   
   ## Additional context
   
   The naive approach — `if (hasExplainInfo(op)) return None` at the top of
   `convertToComet` / `exprToProto` — is too coarse:
   
   1. **Multiple handlers per node.** A scan tagged by `CometScanRule`
      ("Comet Scan is not enabled") can still legitimately convert via
      `CometSparkToColumnarExec` in `CometExecRule`. The fallback tag was
      recorded by a *different* rule/handler and does not mean the current
      handler should give up. `CometExpressionSuite."get_struct_field"`
      breaks with a coarse short-circuit.
   
   2. **Partial-path tagging.** Some serde code paths record fallback
      reasons even when another path could still succeed. The shuffle
      refactor fixes this for shuffle by only tagging on total failure;
      other sites (expression rollups, etc.) would need the same discipline
      before a global short-circuit is safe.
   
   Suggested directions:
   
   - Per-handler attempt record (`TreeNodeTag[Set[String]]` keyed on handler
     name) so `convertToComet(op, handlerX)` can skip re-running
     `handlerX.getSupportLevel(op)` / `handlerX.convert(op, ...)` when this
     handler has already tried and failed, without blocking other handlers.
   - Or: standardize all serde paths on "only tag on final fallback" (as the
     shuffle coordinator now does) so `hasExplainInfo` becomes a reliable
     sticky marker, then add the coarse short-circuit safely.
   - For expressions: `exprToProto` is called many times per pass on the
     same sub-expression trees (partitioning expressions in shuffle checks
     are the hottest). Caching needs to handle `DecimalPrecision.promote`
     possibly returning a fresh instance.
   
   Hotspots to target first:
   
   - Expression serialization inside shuffle/partitioning checks (many
     `exprToProto` repeats).
   - `CometExecRule.isOperatorEnabled` on every operator, per pass.
   - Scan-level expression walks in `CometScanRule`.
   
   ## Component(s)
   
   spark
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to