gruuya commented on code in PR #7981:
URL: https://github.com/apache/arrow-datafusion/pull/7981#discussion_r1377468064


##########
datafusion/sqllogictest/test_files/distinct_on.slt:
##########


Review Comment:
   All tests here were compared for accuracy against Postgres.
   
   Note that, while the 
[docs](https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT) 
state that in the absence of the ORDER BY clause the output is unpredictable, 
in practice I've noticed it is stable on Postgres for the table below, whereas 
that isn't the case for DataFusion, hence why I omitted those tests for now.
   
   Any ideas on what else can be tested here are welcome (cc @universalmind303).



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2132,9 +2181,96 @@ pub struct Limit {
 
 /// Removes duplicate rows from the input
 #[derive(Clone, PartialEq, Eq, Hash)]
-pub struct Distinct {
+pub enum Distinct {
+    /// Plain `DISTINCT` referencing all selection expressions
+    All(Arc<LogicalPlan>),
+    /// The `Postgres` addition, allowing separate control over DISTINCT'd and 
selected columns
+    On(DistinctOn),
+}
+
+/// Removes duplicate rows from the input
+#[derive(Clone, PartialEq, Eq, Hash)]
+pub struct DistinctOn {
+    /// The `DISTINCT ON` clause expression list
+    pub on_expr: Vec<Expr>,
+    /// The selected projection expression list
+    pub select_expr: Vec<Expr>,
+    /// The `ORDER BY` clause, whose initial expressions must match those of 
the `ON` clause
+    pub sort_expr: Option<Vec<Expr>>,
     /// The logical plan that is being DISTINCT'd
     pub input: Arc<LogicalPlan>,
+    /// The schema description of the DISTINCT ON output
+    pub schema: DFSchemaRef,
+}
+
+impl DistinctOn {
+    /// Create a new `DistintOn` struct.
+    pub fn try_new(
+        on_expr: Vec<Expr>,
+        select_expr: Vec<Expr>,
+        sort_expr: Option<Vec<Expr>>,
+        input: Arc<LogicalPlan>,
+    ) -> Result<Self> {
+        let on_expr = normalize_cols(on_expr, input.as_ref())?;
+
+        // Create fields with any qualifier stuffed in the name itself

Review Comment:
   This is sort of a hack, but the reason I was forced to do it is so that the 
plan schema remains unchanged after `replace_distinct_aggregate`. In 
particular, in that optimizer I alias the final projected expressions with the 
selection expression display name, but that always results in a schema without 
qualifiers (with any original qualifier crammed into the field name).
   
   I feel like this case should be handled by 
`Projection::try_new_with_schema`, but it seems that after #7919 the custom 
schema doesn't propagate through the optimization chain.



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -238,8 +238,17 @@ impl CommonSubexprEliminate {
         let rewritten = pop_expr(&mut rewritten)?;
 
         if affected_id.is_empty() {
+            // Alias aggregation epxressions if they have changed
+            // TODO: This should have really been identified above and handled 
in the `else` branch
+            let aggr_exprs = new_aggr_expr
+                .iter()
+                .zip(aggr_expr.iter())
+                .map(|(new_expr, old_expr)| {
+                    new_expr.clone().alias_if_changed(old_expr.display_name()?)
+                })
+                .collect::<Result<Vec<Expr>>>()?;
             // Since group_epxr changes, schema changes also. Use try_new 
method.
-            Aggregate::try_new(Arc::new(new_input), new_group_expr, 
new_aggr_expr)
+            Aggregate::try_new(Arc::new(new_input), new_group_expr, aggr_exprs)
                 .map(LogicalPlan::Aggregate)

Review Comment:
   I don't yet understand why this wasn't identified/aliased automatically, but 
I found this part to be necessary in order to get `DISTINCT ON` with complex 
expressions to work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to