peter-toth opened a new pull request, #10396:
URL: https://github.com/apache/datafusion/pull/10396

   ## Which issue does this PR close?
   
   Part of https://github.com/apache/datafusion/issues/9873.
   
   ## Rationale for this change
   
   This PR started as part of https://github.com/apache/datafusion/issues/9873 
to reduce number of `Expr` clones but after some investigation it shifted to be 
a fix for the rule's correctness issues.
   
   1. The current `CommonSubexprEliminate` was refactored in 
https://github.com/apache/datafusion/pull/9871 to remove the `IdArray` cache 
and simplify the identifier of expresions. Unfortunately that change doesn't 
seem to be correct. The source of the issue is that an idenfifier needs to 
represent an expression subtreee and the newly chosen "stringified expr" as 
identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't 
belong to 2 different expressions:
      ```
      println!("{}", ExprSet::expr_identifier(&(col("a") + col("b"))));    // a 
+ b
      println!("{}", ExprSet::expr_identifier(&(col("a + b"))));           // a 
+ b
      ```
      Sidenote:
      Actually I wanted to show that correctness issue of the current 
`CommonSubexprEliminate` in a test, but when I wrote a test with colliding 
column names I run into a different issue, that DataFusion resolves the 
`col("a") + col("b")` expression as if it was `col("a + b")` if an `a + b` 
field exists in the schema.
      This is a different issue (not related to `CommonSubexprEliminate` at 
all) and can be easily reproduced:
      ```
      DataFusion CLI v37.1.0
      > select a + b from (select 1 as a, 2 as b, 1 as "a + b");
      +-------+
      | a + b |
      +-------+
      | 1     | <- Should be 3
      +-------+
      1 row(s) fetched.
      Elapsed 0.009 seconds.
      ```
      So in this the first commit of PR reverts 
https://github.com/apache/datafusion/issues/9873.
   
   2. Then I investigated what is the actual purpose of `Identifier`s, why 
don't we use a simple `HashMap<Expr, (usize, DataType, Identifier)>` as 
`ExprSet`? It is clear that we need to generate a unique alias for the 
extracted common expressions, but why is the key of the map is an `Identifier` 
and not `&Expr` or `Expr` itself. And actually it turned out that the reasons 
are already explained in the comments.
      ```
      /// An identifier should (ideally) be able to "hash", "accumulate", 
"equal" and "have no
      /// collision (as low as possible)"
      ```
      If we used `Expr` as the key of the map computing the `hash()` of the 
keys would require traversing on the whole `Expr`, which can be very costly as 
`Expr`s contain indirections to subexpressions as `Box<Expr>` or `Vec<Expr>`.
   
      Using special identifiers to represent `Expr` trees and caching those 
identifiers by the preorder visit indexes in `IdArray` should actually 
significantly speed up the second top-down traversal that does the actual 
expression rewrite. 
   
      Sidenote: the current long `String` identifiers are also not a good 
choice. We need to revisit this in a follow-up PR and choose something like 
`(usize, &Expr)` tuple as identifiers. The first element of a tuple is a 
pre-calculated `hash()` of an expression and the referece to expression is 
there to implement the `eq()`.
   
   3. The second commit is a refactor and fix of the algorithm as reverting 
https://github.com/apache/datafusion/issues/9873 caused the 
https://github.com/apache/datafusion/issues/9870 issue to resurface. This is a 
major refactor but I think the code of `ExprIdentifierVisitor` and 
`CommonSubexprRewriter` became much cleaner.
   
   4. The 3rd commit actually eliminates `Expr` clones in `ExprSet`s.
   
   5. The 4th commit contains only renames and docs fixes.
   
   ## What changes are included in this PR?
   
   Please see above.
   
   ## Are these changes tested?
   
   Yes, with existing UTs.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to