peter-toth commented on code in PR #10333: URL: https://github.com/apache/datafusion/pull/10333#discussion_r1595242837
########## datafusion/optimizer/src/common_subexpr_eliminate.rs: ########## @@ -656,24 +656,16 @@ enum VisitRecord { EnterMark(usize), /// the node's children were skipped => jump to f_up on same node JumpMark, - /// Accumulated identifier of sub expression. - ExprItem(Identifier), } impl ExprIdentifierVisitor<'_> { /// Find the first `EnterMark` in the stack, and accumulates every `ExprItem` /// before it. - fn pop_enter_mark(&mut self) -> Option<(usize, Identifier)> { - let mut desc = String::new(); - - while let Some(item) = self.visit_stack.pop() { + fn pop_enter_mark(&mut self) -> Option<usize> { Review Comment: We shoudn't change this part. The logic that builds up an identifier using `visit_stack` / 3 kinds of `VisitRecord` is neccessary and actually a very clever and way to build up an identifier from the current node and sub-identifiers. (An identifier to be a `String` was not that a clever decision and will be fixed in https://github.com/apache/datafusion/issues/10426, but that's a different issue). This PR shouldn't change what an identifier is / how it is built up otherwise we end up with identifier colliding bugs again. The `IdArray`, `ExprStats` and `CommonExprs` datastructures require an dentifier to represent a full expression subtreee. This means that: ``` fn expr_identifier(expr: &Expr) -> Identifier { format!("#{{{expr}}}") } ``` would cause bugs as shown in 1. of https://github.com/apache/datafusion/pull/10396. I.e. if we encountered both `col("a") + col("b")` and `col("a + b")` in the expression list to be CSEd and we used `"{expr}"` (the non-unique stringified representation) as identifiers then the equal identifier (`"a + b"`) of those 2 different expressions would collide and we counted 2 for the occurance of one of the 2 expressions (and the other expression's count would be lost) resulting wrong CSE. Please note that currently the identifier of `col("a") + col("b")` is `"{a + b|b|a}"` so it doesn't collide with `col("a + b")`'s identifier: `"{a + b}"`. Again, this is hard to test now because of the resolution bug: https://github.com/apache/datafusion/issues/10413. I.e. if we wrote a test where we have ``` select a + b, "a + b" from ( select 1 as a, 2 as b, 1 as "a + b" ) ``` then currently it gets resolved as ``` select "a + b", "a + b" from ( select 1 as a, 2 as b, 1 as "a + b" ) ``` and this prevents me to create a test case for CSE identifier collision. (Please note that I'm simplifying the identifier collision exmple as simple columns (`col("a + b")`) are not subject to CSE.) What this PR can do is to change the aliases (use something else than identifiers) to make the plans more readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org