Re: [PR] feat: added parallelization on key partitioned data [datafusion]

via GitHub Thu, 27 Nov 2025 08:47:43 -0800


fmonjalet commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3586757308


   Thanks a lot for the description and companion doc, they are super useful.
   
   This work is super nice and is even crucial for distributed DataFusion. 
Reusing partitioning and avoiding repartitions can make a huge difference when 
the repartition is done on the network.
   
   I think I am still missing part of the point of `KeyPartitioned` vs reusing 
`Hash`. I'll explain what I understand and you can correct me:
   - Anything `KeyPartitioned` is `Hash` partitioned (but the opposite is not 
true) ==> is this correct?
   - `KeyPartitioned` means each key is in a distinct partition ==> is this 
correct?
   - If the above is correct (if it's not, my reasoning does not hold and you 
can ignore the rest of this comment), I am not sure how this applies to high 
cardinality keys, for example `date_bin(timestamp, 15m)` or id ranges (say you 
have a million files, each one having a distinct range). I imagine we'd want to 
be able to group multiple "keys" into the same processing partition, to avoid 
having thousands of partitions. My understanding is that DataFusion partitions 
will add overhead if there are too many (subsequent repartitions, coalesce, 
merge sort), but I may be mistaken.
   - Once we group `KeyPartitioned` partitions together, they become `Hash` 
partitions. ==> is this correct?
   - So in practice, it appears to me that we'll almost always need to resort 
to `Hash` partitions.
   - What we'd loose compared to `KeyPartition` is the `SortExec` elision when 
aggregating then sorting by the partition key, but I'd argue that if you had 
one group per partition, then probably the sorting is cheap enough. ==> Do we 
lose something else?
   (This point is not challenging the PR as a whole but just an implementation 
choice.)
   
   So my current understanding is: `KeyPartitioned` is indeed different from 
`Hash` (a specific case) but the ratio complexity / added value is not obvious. 
The reason we'd not take full advantage of `KeyPartitioned` may be that DF 
partitions are actually bound to processing units (~threads), and maybe there 
would be value in separating the notion of processing thread and the notion of 
data partition, where you could have N processing unit per partitions (with 
partial repartitions), or N partitions per thread. But this sounds like a 
completely different topic and I don't know how much it makes sense.
   
   Sorry for the wall of text, I am mostly trying to wrap my head around this, 
please correct anything I missed in here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: added parallelization on key partitioned data [datafusion]

Reply via email to