gabotechs commented on PR #18919: URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3603268511
My main two concerns with this approach are: - It seems to be adding a fair level of complexity that could potentially be expressed by enhancing the existing `Partitioning::Hash` mechanism. Maybe a reference to how other systems have decided to implement (Trino? Spark?) this could serve as a reference for other maintainers to better judge the right approach. - `Partitioning::PartitionKey` is not going to respect the `execution.target_partitions` set by DataFusion, creating plans with an unbounded amount of partitions. This means that DataFusion would `tokio::spawn` as many tasks as number of distinct keys there are, which could be orders of magnitude greater than the intended `execution.target_partitions`. I think more experience people in DataFusion should chime in and give their opinion, so in the mean time, one thing that comes to mind, is that we can ship first a benchmark or test that would benefit from this so that we can potentially compare different approaches. WDYT? In the mean time, some people that come to mind whose opinion could be useful is @crepererum as a core contributor to the repartitioning mechanism, and @adriangb as potential interested in a feature like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
