gabotechs commented on issue #20176: URL: https://github.com/apache/datafusion/issues/20176#issuecomment-3858719580
> A long term fix is to introduce a new type of partitioning for the file partitioning to safely distinguish the two. Something like KeyPartitoned or ValuePartitioned is suiting. I think the problem goes beyond that. Even if the two sides of a join are `Partitioning::Hash` because there was a `RepartitionExec` before, there is no guarantee that the partitioning strategy was the same in both. For example: - What if both sides of the join where manually repartitioned by the user with a custom rule, and the random seed used to build hashes is different? - What if in the future we want a new algorithm for RepartitionExec that is capable of adaptively increase or decrease the output partitions? still both sides need to match. Following the same rule, for the same reason we introduce `KeyPartitoned` or similar, we could argue that more partitioning modes would need to be added, when all these partitioning methods match the current definition of "Hash Partitioned" (a bit of an unfortunate name). > Oracle calls this [ListPartitioning](https://docs.oracle.com/en/database/oracle/oracle-database/26/cncpt/partitions-views-and-other-schema-objects.html) Note that this is referring to how data is laid out physically in a persistent storage. While the document you shared describes how to partition data storage, the problem here is how to partition read compute resources. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
