peter-toth commented on code in PR #10333:
URL: https://github.com/apache/datafusion/pull/10333#discussion_r1595242837


##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -656,24 +656,16 @@ enum VisitRecord {
     EnterMark(usize),
     /// the node's children were skipped => jump to f_up on same node
     JumpMark,
-    /// Accumulated identifier of sub expression.
-    ExprItem(Identifier),
 }
 
 impl ExprIdentifierVisitor<'_> {
     /// Find the first `EnterMark` in the stack, and accumulates every 
`ExprItem`
     /// before it.
-    fn pop_enter_mark(&mut self) -> Option<(usize, Identifier)> {
-        let mut desc = String::new();
-
-        while let Some(item) = self.visit_stack.pop() {
+    fn pop_enter_mark(&mut self) -> Option<usize> {

Review Comment:
   We shoudn't change this part.
   The logic that builds up an identifier using `visit_stack` / 3 kinds of 
`VisitRecord` is neccessary and actually a very clever and way to build up an 
identifier from the current node and sub-identifiers. (An identifier to be a 
`String` was not that a clever decision and will be fixed in 
https://github.com/apache/datafusion/issues/10426, but that's a different 
issue).
   
   This PR shouldn't change what an identifier is / how it is built up 
otherwise we end up with identifier colliding bugs again. The `IdArray`, 
`ExprStats` and `CommonExprs` datastructures require an dentifier to represent 
a full expression subtreee. This means that:
   ```
   fn expr_identifier(expr: &Expr) -> Identifier {
       format!("#{{{expr}}}")
   }
   ```
   would cause bugs as shown in 1. of 
https://github.com/apache/datafusion/pull/10396.
   
   I.e. if we encountered both `col("a") + col("b")` and `col("a + b")` in the 
expression list to be CSEd and we used `"{expr}"` (the non-unique stringified 
representation) as identifiers then the equal identifier (`"a + b"`) of those 2 
different expressions would collide and we counted 2 for the occurance of one 
of the 2 expressions (and the other expression's count would be lost) resulting 
wrong CSE.
   
   Please note that currently the identifier of `col("a") + col("b")` is `"{a + 
b|b|a}"` so it doesn't collide with `col("a + b")`'s identifier: `"{a + b}"`.
   
   Again, this is hard to test now because of the resolution bug: 
https://github.com/apache/datafusion/issues/10413.
   I.e. if we wrote a test where we have
   ```
   select a + b, "a + b" from (
      select 1 as a, 2 as b, 1 as "a + b"
   )
   ```
   then currently it gets resolved as 
   ```
   select "a + b", "a + b" from (
      select 1 as a, 2 as b, 1 as "a + b"
   )
   ```
   and this prevents me to create a test case for CSE identifier collision.
   (Please note that I'm simplifying the identifier collision exmple as simple 
columns (`col("a + b")`) are not subject to CSE.)
   
   What this PR can do is to change the aliases (use something else than 
identifiers) to make the plans more readable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to