Re: [PR] feat: added parallelization on key partitioned data [datafusion]

via GitHub Mon, 01 Dec 2025 04:59:40 -0800


fmonjalet commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3596432058


   Thanks a lot for the explanations @gene-bordegaray, I think I actually start 
to understand 💡
   
   When the partitioning is by one field, `KeyPartitioned` and `Hash` are the 
same (correct me if wrong)
   But what I understand is that the difference starts when you have two 
fields: `KeyPartitioned` is hierarchical.
   Taking a "practical" distribution: say you have an "order" data set that 
tracks orders of customers to providers. Files are partitioned by 
`(customer_id, provider_id)`, conceptually organized as follows:
   
   ```
   customer_id_hash_0/
       provider_id_hash_1.parquet
       provider_id_hash_2.parquet
       provider_id_hash_3.parquet
       provider_id_hash_4.parquet
   customer_id_hash_1/
       provider_id_hash_1.parquet
       provider_id_hash_2.parquet
       provider_id_hash_3.parquet
       provider_id_hash_4.parquet
   [... etc ...]
   ```
   
   Technically this layout satisfies all the following partitioning:
   - `KeyPartitioned(hash(customer_id), hash(provider_id))`
   - `KeyPartitioned(customer_id, provider_id)` (simpler expression of the 
above)
   - `Hash(customer_id, provider_id)`
   - `Hash(customer_id)`
   
   Now you want to compute:
   
   ```sql
   SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id
   ```
   
   - If the data partitioning is declared as `Hash(customer_id, provider_id)`, 
then you'd have to insert a repartition, because according to the partitioning, 
`(customer0, provider0)` and `(customer0, provider1)` may be in different 
partitions.
   - If the data partitioning is `KeyPartitioned(customer_id, provider_id)`, 
you can reuse the existing partitioning: `(customer0, provider0)` and 
`(customer0, provider1)` are in the same partition, this is what 
`KeyPartitioned` guarantees.
   - `Hash(customer_id)` also works, but we may lack the mechanism to ask the 
data source to say it satisfies this partitioning, and we lose information that 
can be useful.
   
   The following query (a bit artificial ,sorry):
   
   ```sql
   WITH max_order_per_provider AS (
     SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders 
GROUP BY customer_id, provider_id
   )
   SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider 
GROUP BY customer_id
   ```
   
   - Can work with 0 repartitions with `KeyPartitioned(customer_id, 
provider_id)` (if the partitioning is propagated properly through the plan)
   - `Hash(customer_id)` could also worl. Knowing the data source is 
`KeyPartitioned` mostly gives us the information that allows to say it 
satisfies `Hash(customer_id)`.
   
   From there, I see `KeyPartitioned` as a device to avoid propagating 
partitionings from the top.
   In the latest example, partitioning by `customer_id` along the entire 
subplan would be ideal, but I don't think we have a mechanism to propagate this 
information in the plan. We currently cannot ask the leaf node "could you 
provide `Hash(customer_id)` instead of `Hash(customer_id, provider_id)`.
   
   I am now wondering about whether we should have `KeyPartitioned`, or a 
mechanism to propagate ideal partitioning down the stack (e.g. `Hash(a, b)` can 
be changed to `Hash(a)` to avoid a repartition later on).
   Gene, does this capture your thoughts? Do you see cases where 
`KeyPartitioned` adds value in between execution nodes that is not captured 
here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: added parallelization on key partitioned data [datafusion]

Reply via email to