berkaysynnada commented on issue #12700: URL: https://github.com/apache/datafusion/issues/12700#issuecomment-2411134296
I have worked through [this comment](https://github.com/apache/datafusion/issues/12700#issuecomment-2389022925) in detail to determine whether there is an actual bug in the current implementation. Initially, my thoughts were similar to @akurmustafa's intuition, and I draft a few designs. However, I can now safely say that the current implementation is correct, and the tests are valid for both the `add_new_orderings()` and `satisfy()` APIs. Below are my brief findings: The example that triggered the initial bug was `[a, b, c, d]` and `[a, c, b, d]`. We thought it could be simplified to `[a, b, d]` and `[a, c, d]`, but this is a mistaken assumption. **Why?** Consider this table: | a | b | c | d | |---|---|---|---| | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 | | 1 | 2 | 3 | 3 | | 1 | 3 | 4 | 1 | This table is ordered by both `[a, b, c, d]` and `[a, c, b, d]`; also `[a, b, d]` and `[a, c, d]` are also valid orderings. However, when we add these orderings to the equivalence properties, can we make them stored as `[a, b, d]` and `[a, c, d]`? The answer is **no**. Consider this table: | a | b | c | d | |---|---|---|---| | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 | | 1 | 2 | 3 | 3 | | 1 | 3 | 3 | 1 | This table also satisfies both `[a, b, c, d]` and `[a, c, b, d]`, but we cannot simplify them to `[a, b, d]` and `[a, c, d]`. In this case, `[a, c, d]` is not valid. If someone needs to infer an ordering corresponding to the first table, it should be given as `[a, b, d]` and `[a, c, d]` because we can derive both `[a, b, c, d]` and `[a, c, b, d]` from these. However, the reverse is not true. **In short:** - `[a, c]` + `[b, c]` satisfies both `[a, b, c]` and `[b, a, c]`. (Global prefixes can be inserted anywhere in other orderings, and we can safely remove the duplicate element closer to the right-hand side.) - `[a, b, c]` + `[b, a, c]` does not satisfy `[a, c]` or `[b, c]`. (It actually satisfies one of them, but since we cannot know which one, we say it does not satisfy either.) However, if we know that `a` is unique, we can ensure that `[a, c]` is valid. Similarly, if `b` is unique, then `[b, c]` is valid. This is a task for another day. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
