gene-bordegaray commented on PR #23231:
URL: https://github.com/apache/datafusion/pull/23231#issuecomment-4843700583

   Thank you for the work @saadtajwar , I think this will be very useful in 
upcoming efforts 😄 
   
   Before really diving into this we shoudl step back and plan how 
repartitioning will work from a high level first before diving into the nitty 
gritty. Per descussions here #23236 it seems that we will be working toward 
deprecating `HashPartitioned` and move to `KeyPartitioned` distribution 
variant. 
   
   So essentially we are going to have operators that require a 
`KeyPartitioned` distirbution with two options to achieve this. Repartition via 
`Hash` or repartition via `Range`. It is unclear to me exactly the best way to 
make this decision and if / how we can recognize to use one or the other. 
Should this be something that users specify as a config? Is there some way to 
dtect this? Should we only repartition to range if it is to a superset of the 
current range partitioning (example: data partitioned on `day` -> repartition 
to `hour`)?
   
   These are some things I would like to discuss with other before we decide to 
implement anything regarding repartitioning (as of now we just preserve it from 
a `DataSourceExec`)
   
   cc: @alamb @gabotechs @stuhood 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to