gene-bordegaray commented on PR #23231: URL: https://github.com/apache/datafusion/pull/23231#issuecomment-4843700583
Thank you for the work @saadtajwar , I think this will be very useful in upcoming efforts 😄 Before really diving into this we shoudl step back and plan how repartitioning will work from a high level first before diving into the nitty gritty. Per descussions here #23236 it seems that we will be working toward deprecating `HashPartitioned` and move to `KeyPartitioned` distribution variant. So essentially we are going to have operators that require a `KeyPartitioned` distirbution with two options to achieve this. Repartition via `Hash` or repartition via `Range`. It is unclear to me exactly the best way to make this decision and if / how we can recognize to use one or the other. Should this be something that users specify as a config? Is there some way to dtect this? Should we only repartition to range if it is to a superset of the current range partitioning (example: data partitioned on `day` -> repartition to `hour`)? These are some things I would like to discuss with other before we decide to implement anything regarding repartitioning (as of now we just preserve it from a `DataSourceExec`) cc: @alamb @gabotechs @stuhood -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
