Re: [PR] feat: added parallelization on key partitioned data [datafusion]

via GitHub Tue, 02 Dec 2025 09:49:03 -0800


gabotechs commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3603268511


   My main two concerns with this approach are:
   - It seems to be adding a fair level of complexity that could potentially be 
expressed by enhancing the existing `Partitioning::Hash` mechanism. Maybe a 
reference to how other systems have decided to implement (Trino? Spark?) this 
could serve as a reference for other maintainers to better judge the right 
approach.
   
   - `Partitioning::PartitionKey` is not going to respect the 
`execution.target_partitions` set by DataFusion, creating plans with an 
unbounded amount of partitions. This means that DataFusion would `tokio::spawn` 
as many tasks as number of distinct keys there are, which could be orders of 
magnitude greater than the intended `execution.target_partitions`.
   
   I think more experience people in DataFusion should chime in and give their 
opinion, so in the mean time, one thing that comes to mind, is that we can ship 
first a benchmark or test that would benefit from this so that we can 
potentially compare different approaches. WDYT?
   
   In the mean time, some people that come to mind whose opinion could be 
useful is @crepererum as a core contributor to the repartitioning mechanism, 
and @adriangb as potential interested in a feature like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: added parallelization on key partitioned data [datafusion]

Reply via email to