HeartSaVioR opened a new pull request, #46820:
URL: https://github.com/apache/spark/pull/46820

   ### What changes were proposed in this pull request?
   
   This PR proposes to exclude streaming Dataset from the target of 
OptimizeOneRowPlan.
   
   ### Why are the changes needed?
   
   The rule should not be applied to streaming source, since the number of rows 
it sees is just for current microbatch. It does not mean the streaming source 
will ever produce max 1 rows during lifetime of the query.
   
   Suppose the case: the streaming query has a case where batch 0 runs with 
empty data in streaming source A which triggers the rule with Aggregate, and 
batch 1 runs with several data in streaming source A which no longer trigger 
the rule.
   
   In the above scenario, this could fail the query as stateful operator is 
expected to be planned for every batches whereas here it is planned 
"selectively".
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but the behavior can be reverted back with a new config, 
`spark.sql.streaming.optimizeOneRowPlan.enabled`, although I wouldn't think 
there should be really rare case where users have to turn the config on.
   
   ### How was this patch tested?
   
   New UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to