andygrove opened a new pull request, #3977:
URL: https://github.com/apache/datafusion-comet/pull/3977

   ## Which issue does this PR close?
   
   Related to #3949.
   
   ## Rationale for this change
   
   Issue #3949 reports an intermittent `AssertionError` from 
`ColumnarToRowExec.<init>` in the awslabs TPC-DS benchmark (q14a, q14b, q31, 
q47, q57 on Spark 4.0). Prior repro attempts using small DPP queries failed, 
and 1TB local runs did not reproduce either. This PR adds infrastructure to 
stress the Spark/Comet boundary and probe the canonicalization invariants the 
stack trace points at, so future regressions around mixed-mode plan shapes have 
automated coverage.
   
   The full stack trace shared on the issue shows the assertion fires from 
`QueryPlan.doCanonicalize` → `withNewChildren(canonicalizedChildren)` → 
`ColumnarToRowExec.copy(...)` inside `AdaptiveSparkPlanExec.createQueryStages`. 
That means `cometPlan.canonicalized.supportsColumnar` is `false` for some Comet 
plan produced during post-stage-materialization re-planning. None of the suites 
in this PR reproduce the bug yet — the bad canonical form does not appear in 
the initial physical plan — but the tools are useful regardless.
   
   ## What changes are included in this PR?
   
   - **`FuzzFallback` utility** 
(`spark/src/main/scala/org/apache/comet/FuzzFallback.scala`) — seeded, 
node-identity-keyed decision cache. Calling `shouldVetoShuffle(s)` or 
`shouldVetoExec(op)` twice for the same node returns the same answer, so 
`getSupportLevel` and `createExec` stay consistent.
   - **Injection points**:
     - `CometShuffleExchangeExec.nativeShuffleSupported` and 
`columnarShuffleSupported`
     - The generic operator branch of `CometExecRule.convertNode`
   - **Four new `testing`-category configs, all default-off**:
     - `spark.comet.fuzz.fallback.enabled`
     - `spark.comet.fuzz.fallback.seed`
     - `spark.comet.fuzz.fallback.shuffleVetoProbability` (default 0.5)
     - `spark.comet.fuzz.fallback.execVetoProbability` (default 0.0)
   - **`CometFuzzFallbackSuite`** — all 129 TPC-DS queries (v1.4 + v2.7) × N 
seeds, plan-only (no data required; uses `TPCDSBase` schemas).
   - **`CometFuzzDppSuite`** — small partitioned fact/dim data, actually 
executes DPP-flavored queries across SMJ/BHJ/coalesce variants with AQE on.
   - **`CometCanonicalizationSuite`** — targeted node-level canonicalization 
checks for scan/filter/project/aggregate/BHJ/DPP shapes, both directly 
(`p.canonicalized.supportsColumnar`) and wrapped 
(`ColumnarToRowExec(p).canonicalized` — mirrors the #3949 stack exactly).
   - **`CometCanonicalizationTpcdsSuite`** — same checks across every TPC-DS 
query plan.
   
   ## How are these changes tested?
   
   All four suites run green locally (Spark 4.0):
   - `CometFuzzFallbackSuite`: 129 tests × 2 seeds pass in ~20s.
   - `CometFuzzDppSuite`: 900 iterations (3 variants × 3 queries × 100 seeds) 
pass in ~2 min.
   - `CometCanonicalizationSuite`: 6 tests, ~10s.
   - `CometCanonicalizationTpcdsSuite`: 129 TPC-DS queries, ~8s.
   
   All new configs default to off; no existing behavior changes when the fuzz 
harness is disabled.
   
   Draft because the primary goal — reproducing #3949 — is not met by any of 
these suites. Leaving it in draft until either (a) one of the suites catches 
the bug after further tuning, (b) a real failing plan arrives from awslabs so 
we can tighten the canonicalization tests, or (c) we agree to merge the 
infrastructure as-is as a regression guard even though it doesn't demonstrate 
the repro today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to