HeartSaVioR opened a new pull request, #46820: URL: https://github.com/apache/spark/pull/46820
### What changes were proposed in this pull request? This PR proposes to exclude streaming Dataset from the target of OptimizeOneRowPlan. ### Why are the changes needed? The rule should not be applied to streaming source, since the number of rows it sees is just for current microbatch. It does not mean the streaming source will ever produce max 1 rows during lifetime of the query. Suppose the case: the streaming query has a case where batch 0 runs with empty data in streaming source A which triggers the rule with Aggregate, and batch 1 runs with several data in streaming source A which no longer trigger the rule. In the above scenario, this could fail the query as stateful operator is expected to be planned for every batches whereas here it is planned "selectively". ### Does this PR introduce _any_ user-facing change? Yes, but the behavior can be reverted back with a new config, `spark.sql.streaming.optimizeOneRowPlan.enabled`, although I wouldn't think there should be really rare case where users have to turn the config on. ### How was this patch tested? New UT. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
