szehon-ho commented on PR #54330:
URL: https://github.com/apache/spark/pull/54330#issuecomment-3930717144

   > As far as I see you assume that the child is a DataSourceRDD, but the main 
point of this change is to move the grouping logic to the new operator 
(GroupPartitionsExec) so as to be able to insert it into those plans as well 
where there is no BatchScanExec / DataSourceRDD, e.g. cached or checkpointed 
plans.
   
   I see.
   
   Interesting, so you mean we are losing metrics.  Should we at least add a 
test?  It may make sense to do in separate pr, but depends the final approach.  
The approach does make sense, I am a bit unsure if ThreadLocal is the best/safe 
approach to avoid the memory leak, as you can see its a bit tricky, but I am 
not so familiar with DataSourceRDD code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to