gene-bordegaray commented on issue #23236: URL: https://github.com/apache/datafusion/issues/23236#issuecomment-4844480174
Yes of course @saadtajwar . Here is a brief summary of what is going on and please feel free to ask more. There are two ideas: `Distribution` and `Partitioning`. 1. `Distribution` describes the general guarantees of how rows are split across partitions. For example we have `UnspecifiedDistribution` which means we cannot guarantee a certain row value will be within any particular partition. Then we have `SinglePartition` which means DataFusion is only executing a single partition so we know all rows are going to be in that partition. Finally we have/had `HashPartitioned` which guarantees that for a given physical expression (usually a column, let's call it `key`) rows are split such that all occurrences of a value for `key` are all within a single partition. So all of `key=A` in partition 1, `key=B` located in partition 2 and so on. 2. `Partitioning` is a similar but slightly different idea. This is describing exactly how a `Distribution` is achieved. The easiest way to understand this is to look at `Partitioning::Hash` and `Partitioning::Range`. Both of these partitioning types uphold the contract for `Distribution::HashPartitioned` since we guarantee all occurrences of a value for `key` are all within a single partition, but the way they achieve this is different. - `Partitioning::Hash` does this by applying a hash function to each row, then modulo-ing that hashed result by the number of available partitions and sending the row to the partition number (say we have 3 partitions: `key=A` -> `Hash(A) = 5` -> `5 % 3 = 2` -> route to partition 2). - `Partitioning::Range` does this by defining split points which act as boundaries for `key` on each partition: ```text split points = [A, B] parition bounds: ` - partition 1: key < A - partition 2: A <= key < B - partition 3: B <= key ``` then guarantees that all rows for a `key` value will be located in the partition whose range contains the key value. So for `key=A` we know this will be located in partition 2. For `key=Z` we know this will be in partition 3. So now we understand that there is `Distribution` and `Partitioning` and that there are multiple `Partitioning`s which can satisfy a single `Distribution` (both `Partititoning::Hash` and `Partitioning::Range` satisfy `Distribution::HashPartitioned`). Now the problem should be pretty clear. `Distribution::HashPartitioned` is not a great name for the distribution since it implies to satisfy it the data must specifically be `Partitioning::Hash` and we know that this isn't necessarily true singe `Partitioning::Range` is also able to satisfy this. This no longer used to be a real issue since DataFusion did not have any alternative way to achieve this distribution before `Partitioning::Range` was introduced. Thus a more accurate name for the distribution is `KeyPartitioned` which correctly defines that to satisfy this distribution, your partitioning must guarantee that your partitions are split by the provided `key` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
