Re: [PR] [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics [spark]

via GitHub Tue, 26 Mar 2024 11:49:36 -0700


szehon-ho commented on PR #45314:
URL: https://github.com/apache/spark/pull/45314#issuecomment-2021228813


   > cc @aokolnychyi @RussellSpitzer @rdblue do you think this could be useful 
for Iceberg to pass partition stats to Spark? SPJ could leverage this to make 
better decisions on how to combine partitions (like which side to choose during 
partially clustered distribution), but I'm not sure whether there are more use 
cases.
   
   @sunchao   Aside from picking the side of partially clustered distribution, 
would we also be able to use it to group smaller partitions?  Example a table 
is partition by date, and older days have not much data (on both sides), group 
many of the older days into the same task.
   
   Similar to AQE coalesce partitions, but it looks like it applies only after 
shuffle, so looks like it doesnt apply for SPJ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics [spark]

Reply via email to