gene-bordegaray commented on issue #23236:
URL: https://github.com/apache/datafusion/issues/23236#issuecomment-4844480174

   Yes of course @saadtajwar . Here is a brief summary of what is going on and 
please feel free to ask more.
   
   There are two ideas: `Distribution` and `Partitioning`. 
   1. `Distribution` describes the general guarantees of how rows are split 
across partitions. For example we have `UnspecifiedDistribution` which means we 
cannot guarantee a certain row value will be within any particular partition. 
Then we have `SinglePartition` which means DataFusion is only executing a 
single partition so we know all rows are going to be in that partition. Finally 
we have/had `HashPartitioned` which guarantees that for a given physical 
expression (usually a column, let's call it `key`) rows are split such that all 
occurrences of a value for `key` are all within a single partition. So all of 
`key=A` in partition 1, `key=B` located in partition 2 and so on.
   2. `Partitioning` is a similar but slightly different idea. This is 
describing exactly how a `Distribution` is achieved. The easiest way to 
understand this is to look at `Partitioning::Hash` and `Partitioning::Range`. 
Both of these partitioning types uphold the contract for 
`Distribution::HashPartitioned` since we guarantee all occurrences of a value 
for `key` are all within a single partition, but the way they achieve this is 
different. 
   - `Partitioning::Hash` does this by applying a hash function to each row, 
then modulo-ing that hashed result by the number of available partitions and 
sending the row to the partition number (say we have 3 partitions: `key=A` -> 
`Hash(A) = 5` -> `5 % 3 = 2` -> route to partition 2). 
   - `Partitioning::Range` does this by defining split points which act as 
boundaries for `key` on each partition:
   ```text
   split points = [A, B]
   parition bounds: `
   - partition 1: key < A
   - partition 2: A <= key < B
   - partition 3: B <= key
   ```
   then guarantees that all rows for a `key` value will be located in the 
partition whose range contains the key value. So for `key=A` we know this will 
be located in partition 2. For `key=Z` we know this will be in partition 3.
   
   So now we understand that there is `Distribution` and `Partitioning` and 
that there are multiple `Partitioning`s which can satisfy a single 
`Distribution` (both `Partititoning::Hash` and `Partitioning::Range` satisfy 
`Distribution::HashPartitioned`).
   
   Now the problem should be pretty clear. `Distribution::HashPartitioned` is 
not a great name for the distribution since it implies to satisfy it the data 
must specifically be `Partitioning::Hash` and we know that this isn't 
necessarily true singe `Partitioning::Range` is also able to satisfy this. This 
no longer used to be a real issue since DataFusion did not have any alternative 
way to achieve this distribution before `Partitioning::Range` was introduced. 
Thus a more accurate name for the distribution is `KeyPartitioned` which 
correctly defines that to satisfy this distribution, your partitioning must 
guarantee that your partitions are split by the provided `key`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to