yadavay-amzn opened a new pull request, #56079: URL: https://github.com/apache/spark/pull/56079
### What changes were proposed in this pull request? Remove the test-mode assertion in `DataSourceV2Relation.computeStats()` that throws when stats are requested before scan pushdown has been applied. ### Why are the changes needed? The `operatorOptimizationBatch` (containing `PushDownLeftSemiAntiJoin`) runs before `earlyScanPushDownRules` in the optimizer. When `PushDownLeftSemiAntiJoin` evaluates whether a join can be planned as broadcast, it calls `plan.stats` on DSv2 relations that have not yet had pushdown applied. The test-mode assertion throws `SparkException`, crashing the query. Repro: any LEFT SEMI or LEFT ANTI join over an Aggregate on a DSv2 table triggers this path. ### Does this PR introduce _any_ user-facing change? Yes -- queries with LEFT SEMI/ANTI joins on DSv2 tables no longer crash in test mode. In production mode the assertion was already inactive, but stats estimates were potentially inflated (pre-pushdown size). The method now consistently returns fallback stats from the catalog. ### How was this patch tested? Added test in `DataSourceV2SQLSuiteV2Filter` exercising LEFT SEMI join over Aggregate on a DSv2 table. Verifies correct query results via `checkAnswer`. - Without fix: `SparkException: [INTERNAL_ERROR] BUG: computeStats called before pushdown` - With fix: correct results returned - Full DSv2 test suites pass (456 tests, no regressions) ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
