gabotechs commented on issue #20176:
URL: https://github.com/apache/datafusion/issues/20176#issuecomment-3858719580

   > A long term fix is to introduce a new type of partitioning for the file 
partitioning to safely distinguish the two. Something like KeyPartitoned or 
ValuePartitioned is suiting.
   
   I think the problem goes beyond that. Even if the two sides of a join are 
`Partitioning::Hash` because there was a `RepartitionExec` before, there is no 
guarantee that the partitioning strategy was the same in both. For example:
   - What if both sides of the join where manually repartitioned by the user 
with a custom rule, and the random seed used to build hashes is different?
   - What if in the future we want a new algorithm for RepartitionExec that is 
capable of adaptively increase or decrease the output partitions? still both 
sides need to match.
   
   Following the same rule, for the same reason we introduce `KeyPartitoned` or 
similar, we could argue that more partitioning modes would need to be added, 
when all these partitioning methods match the current definition of "Hash 
Partitioned" (a bit of an unfortunate name).
   
   > Oracle calls this 
[ListPartitioning](https://docs.oracle.com/en/database/oracle/oracle-database/26/cncpt/partitions-views-and-other-schema-objects.html)
   
   Note that this is referring to how data is laid out physically in a 
persistent storage. While the document you shared describes how to partition 
data storage, the problem here is how to partition read compute resources.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to