fmonjalet commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3596432058
Thanks a lot for the explanations @gene-bordegaray, I think I actually start
to understand 💡
When the partitioning is by one field, `KeyPartitioned` and `Hash` are the
same (correct me if wrong)
But what I understand is that the difference starts when you have two
fields: `KeyPartitioned` is hierarchical.
Taking a "practical" distribution: say you have an "order" data set that
tracks orders of customers to providers. Files are partitioned by
`(customer_id, provider_id)`, conceptually organized as follows:
```
customer_id_hash_0/
provider_id_hash_1.parquet
provider_id_hash_2.parquet
provider_id_hash_3.parquet
provider_id_hash_4.parquet
customer_id_hash_1/
provider_id_hash_1.parquet
provider_id_hash_2.parquet
provider_id_hash_3.parquet
provider_id_hash_4.parquet
[... etc ...]
```
Technically this layout satisfies all the following partitioning:
- `KeyPartitioned(hash(customer_id), hash(provider_id))`
- `KeyPartitioned(customer_id, provider_id)` (simpler expression of the
above)
- `Hash(customer_id, provider_id)`
- `Hash(customer_id)`
Now you want to compute:
```sql
SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id
```
- If the data partitioning is declared as `Hash(customer_id, provider_id)`,
then you'd have to insert a repartition, because according to the partitioning,
`(customer0, provider0)` and `(customer0, provider1)` may be in different
partitions.
- If the data partitioning is `KeyPartitioned(customer_id, provider_id)`,
you can reuse the existing partitioning: `(customer0, provider0)` and
`(customer0, provider1)` are in the same partition, this is what
`KeyPartitioned` guarantees.
- `Hash(customer_id)` also works, but we may lack the mechanism to ask the
data source to say it satisfies this partitioning, and we lose information that
can be useful.
The following query (a bit artificial ,sorry):
```sql
WITH max_order_per_provider AS (
SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders
GROUP BY customer_id, provider_id
)
SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider
GROUP BY customer_id
```
- Can work with 0 repartitions with `KeyPartitioned(customer_id,
provider_id)` (if the partitioning is propagated properly through the plan)
- `Hash(customer_id)` could also worl. Knowing the data source is
`KeyPartitioned` mostly gives us the information that allows to say it
satisfies `Hash(customer_id)`.
From there, I see `KeyPartitioned` as a device to avoid propagating
partitionings from the top.
In the latest example, partitioning by `customer_id` along the entire
subplan would be ideal, but I don't think we have a mechanism to propagate this
information in the plan. We currently cannot ask the leaf node "could you
provide `Hash(customer_id)` instead of `Hash(customer_id, provider_id)`.
I am now wondering about whether we should have `KeyPartitioned`, or a
mechanism to propagate ideal partitioning down the stack (e.g. `Hash(a, b)` can
be changed to `Hash(a)` to avoid a repartition later on).
Gene, does this capture your thoughts? Do you see cases where
`KeyPartitioned` adds value in between execution nodes that is not captured
here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]