HeartSaVioR opened a new pull request, #42823:
URL: https://github.com/apache/spark/pull/42823

   ### What changes were proposed in this pull request?
   
   This PR proposes to override `Scan.columnarSupportMode` for DSv2 streaming 
data sources. All of them don't support columnar.
   
   Rationalization will be explained in the next section.
   
   ### Why are the changes needed?
   
   The default value for `Scan.columnarSupportMode` is `PARTITION_DEFINED`, 
which requires `inputPartitions` to be called/evaluated. That could be 
referenced multiple times during planning.
   
   In `MicrobatchScanExec`, we define `inputPartitions` as lazy val, so that 
there is no multiple evaluation of inputPartitions, which calls 
`MicroBatchStream.planInputPartitions`. But we missed that there is no 
guarantee that the instance will be initialized only once (although the actual 
execution will happen once) - for example, executedPlan clones the plan 
(internally we call constructor to make a deep copy of the node), explain 
(internally called to build a SQL execution start event), etc...
   
   I see `MicroBatchStream.planInputPartitions` gets called 4 times per 
microbatch, which can be concerning if the overhead of planInputPartitions is 
non-trivial.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing UTs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to