peter-toth opened a new pull request, #10396: URL: https://github.com/apache/datafusion/pull/10396
## Which issue does this PR close? Part of https://github.com/apache/datafusion/issues/9873. ## Rationale for this change This PR started as part of https://github.com/apache/datafusion/issues/9873 to reduce number of `Expr` clones but after some investigation it shifted to be a fix for the rule's correctness issues. 1. The current `CommonSubexprEliminate` was refactored in https://github.com/apache/datafusion/pull/9871 to remove the `IdArray` cache and simplify the identifier of expresions. Unfortunately that change doesn't seem to be correct. The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions: ``` println!("{}", ExprSet::expr_identifier(&(col("a") + col("b")))); // a + b println!("{}", ExprSet::expr_identifier(&(col("a + b")))); // a + b ``` Sidenote: Actually I wanted to show that correctness issue of the current `CommonSubexprEliminate` in a test, but when I wrote a test with colliding column names I run into a different issue, that DataFusion resolves the `col("a") + col("b")` expression as if it was `col("a + b")` if an `a + b` field exists in the schema. This is a different issue (not related to `CommonSubexprEliminate` at all) and can be easily reproduced: ``` DataFusion CLI v37.1.0 > select a + b from (select 1 as a, 2 as b, 1 as "a + b"); +-------+ | a + b | +-------+ | 1 | <- Should be 3 +-------+ 1 row(s) fetched. Elapsed 0.009 seconds. ``` So in this the first commit of PR reverts https://github.com/apache/datafusion/issues/9873. 2. Then I investigated what is the actual purpose of `Identifier`s, why don't we use a simple `HashMap<Expr, (usize, DataType, Identifier)>` as `ExprSet`? It is clear that we need to generate a unique alias for the extracted common expressions, but why is the key of the map is an `Identifier` and not `&Expr` or `Expr` itself. And actually it turned out that the reasons are already explained in the comments. ``` /// An identifier should (ideally) be able to "hash", "accumulate", "equal" and "have no /// collision (as low as possible)" ``` If we used `Expr` as the key of the map computing the `hash()` of the keys would require traversing on the whole `Expr`, which can be very costly as `Expr`s contain indirections to subexpressions as `Box<Expr>` or `Vec<Expr>`. Using special identifiers to represent `Expr` trees and caching those identifiers by the preorder visit indexes in `IdArray` should actually significantly speed up the second top-down traversal that does the actual expression rewrite. Sidenote: the current long `String` identifiers are also not a good choice. We need to revisit this in a follow-up PR and choose something like `(usize, &Expr)` tuple as identifiers. The first element of a tuple is a pre-calculated `hash()` of an expression and the referece to expression is there to implement the `eq()`. 3. The second commit is a refactor and fix of the algorithm as reverting https://github.com/apache/datafusion/issues/9873 caused the https://github.com/apache/datafusion/issues/9870 issue to resurface. This is a major refactor but I think the code of `ExprIdentifierVisitor` and `CommonSubexprRewriter` became much cleaner. 4. The 3rd commit actually eliminates `Expr` clones in `ExprSet`s. 5. The 4th commit contains only renames and docs fixes. ## What changes are included in this PR? Please see above. ## Are these changes tested? Yes, with existing UTs. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org