Re: [PR] Implement `DISTINCT ON` from Postgres [arrow-datafusion]

via GitHub Thu, 02 Nov 2023 23:53:28 -0700


gruuya commented on code in PR #7981:
URL: https://github.com/apache/arrow-datafusion/pull/7981#discussion_r1381211603



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2132,9 +2181,100 @@ pub struct Limit {
 
 /// Removes duplicate rows from the input
 #[derive(Clone, PartialEq, Eq, Hash)]
-pub struct Distinct {
+pub enum Distinct {
+    /// Plain `DISTINCT` referencing all selection expressions
+    All(Arc<LogicalPlan>),
+    /// The `Postgres` addition, allowing separate control over DISTINCT'd and 
selected columns
+    On(DistinctOn),
+}
+
+/// Removes duplicate rows from the input
+#[derive(Clone, PartialEq, Eq, Hash)]
+pub struct DistinctOn {
+    /// The `DISTINCT ON` clause expression list
+    pub on_expr: Vec<Expr>,
+    /// The selected projection expression list
+    pub select_expr: Vec<Expr>,
+    /// The `ORDER BY` clause, whose initial expressions must match those of 
the `ON` clause

Review Comment:
   Yeah, `on_expr` and `sort_expr` should have a (partial) overlap 
(`select_expr` and `sort_expr` are arbitrary and different in general).
   
   Condensing `on_expr` and `sort_expr` into one is doable, though it seems to 
me it would introduce more complexity, and make the logic harder to grasp. For 
starters `sort_expr` is not an exact super-set of `on_expr`, but instead wraps 
them inside `Expr::Sort`s to capture `asc` and `nulls_first`, but it can also 
be omitted (unlike `ON` exprs). On the other hand, these two are parsed at 
different places: first the `on_expr` is parsed in `select_to_plan`, and then 
`sort_expr` is extracted in `order_by`.
   
   So in order to keep them both in one vec we'd have to: a) add an additional 
field to `DistinctOn` to track the length of the `ON` expressions as you 
mention (e.g. `on_expr_count`), b) if there are no actual `sort_expr` provided 
keep the `ON` expressions in them, otherwise validate that the wrapped sorting 
expressions match the existing `ON` expressions and replace the vector with the 
sorting expressions, and then c) inside `replace_distinct_aggregate` 
deconstruct that. That would mean take `on_expr_count` first expressions from 
`sort_expr`, and unwrapping the underlying expressions from `Expr::Sort` in 
case sort expression length is > `sort_expr`. In case sort expression length is 
equal to `on_expr_count` it could mean no `ORDER BY` was provided (no need for 
unwrapping), or the `ORDER BY` expressions match the `ON` expressions modulo 
the aforementioned sorting specifiers (`asc` and `nulls_first`).
   
   This strikes me as less clear all in all, though if you'd prefer it I'll 
make that change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement `DISTINCT ON` from Postgres [arrow-datafusion]

Reply via email to