Re: [PR] Implement `DISTINCT ON` from Postgres [arrow-datafusion]

via GitHub Wed, 01 Nov 2023 14:50:49 -0700


gruuya commented on code in PR #7981:
URL: https://github.com/apache/arrow-datafusion/pull/7981#discussion_r1379324782



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -238,8 +238,17 @@ impl CommonSubexprEliminate {
         let rewritten = pop_expr(&mut rewritten)?;
 
         if affected_id.is_empty() {
+            // Alias aggregation epxressions if they have changed
+            // TODO: This should have really been identified above and handled 
in the `else` branch
+            let aggr_exprs = new_aggr_expr
+                .iter()
+                .zip(aggr_expr.iter())
+                .map(|(new_expr, old_expr)| {
+                    new_expr.clone().alias_if_changed(old_expr.display_name()?)
+                })
+                .collect::<Result<Vec<Expr>>>()?;
             // Since group_epxr changes, schema changes also. Use try_new 
method.
-            Aggregate::try_new(Arc::new(new_input), new_group_expr, 
new_aggr_expr)
+            Aggregate::try_new(Arc::new(new_input), new_group_expr, aggr_exprs)
                 .map(LogicalPlan::Aggregate)

Review Comment:
   Ok, I think I understand what is going on:
   - in general the root aggregation expressions need not be rewritten/aliased 
above since they can only appear once (i.e. they're not common), so this is ok
   - however, some of their child expressions may get rewritten; an example is 
the query `select distinct on (c1 + 1) c2 from test order by c1 + 1, c3`, where 
the initial aggregation of the sole projected field gets rewritten from 
`FIRST_VALUE(test.column2) ORDER BY [test.column1 + Int64(1) ASC NULLS LAST, 
test.column3 ASC NULLS LAST]` to `FIRST_VALUE(test.column2) ORDER BY 
[test.column1 + Int64(1)Int64(1)test.column1 AS test.column1 + Int64(1) ASC 
NULLS LAST, test.column3 ASC NULLS LAST]` (i.e. `c1 + 1` is a common expression 
of it and the `ON` expression so it gets rewritten)
   
   I think this explains the behavior I've encountered here, and that this 
behavior is not anomalous. The only question is whether there is a better way 
to align the new aggregation schema with the old one besides aliasing as above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement `DISTINCT ON` from Postgres [arrow-datafusion]

Reply via email to