gruuya commented on code in PR #7981:
URL: https://github.com/apache/arrow-datafusion/pull/7981#discussion_r1381211603
##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2132,9 +2181,100 @@ pub struct Limit {
/// Removes duplicate rows from the input
#[derive(Clone, PartialEq, Eq, Hash)]
-pub struct Distinct {
+pub enum Distinct {
+ /// Plain `DISTINCT` referencing all selection expressions
+ All(Arc<LogicalPlan>),
+ /// The `Postgres` addition, allowing separate control over DISTINCT'd and
selected columns
+ On(DistinctOn),
+}
+
+/// Removes duplicate rows from the input
+#[derive(Clone, PartialEq, Eq, Hash)]
+pub struct DistinctOn {
+ /// The `DISTINCT ON` clause expression list
+ pub on_expr: Vec<Expr>,
+ /// The selected projection expression list
+ pub select_expr: Vec<Expr>,
+ /// The `ORDER BY` clause, whose initial expressions must match those of
the `ON` clause
Review Comment:
Yeah, `on_expr` and `sort_expr` should have a (partial) overlap
(`select_expr` and `sort_expr` are arbitrary and different in general).
Condensing `on_expr` and `sort_expr` into one is doable, though it seems to
me it would introduce more complexity, and make the logic harder to grasp. For
starters `sort_expr` is not an exact super-set of `on_expr`, but instead wraps
them inside `Expr::Sort`s to capture `asc` and `nulls_first`, but it can also
be omitted (unlike `ON` exprs). On the other hand, these two are parsed at
different places: first the `on_expr` is parsed in `select_to_plan`, and then
`sort_expr` is extracted in `order_by`.
So in order to keep them both in one vec we'd have to: a) add an additional
field to `DistinctOn` to track the length of the `ON` expressions as you
mention (e.g. `on_expr_count`), b) if there are no actual `sort_expr` provided
keep the `ON` expressions in them, otherwise validate that the wrapped sorting
expressions match the existing `ON` expressions and replace the vector with the
sorting expressions, and then c) inside `replace_distinct_aggregate`
deconstruct that. That would mean take `on_expr_count` first expressions from
`sort_expr`, and unwrapping the underlying expressions from `Expr::Sort` in
case sort expression length is > `sort_expr`. In case sort expression length is
equal to `on_expr_count` it could mean no `ORDER BY` was provided (no need for
unwrapping), or the `ORDER BY` expressions match the `ON` expressions modulo
the aforementioned sorting specifiers (`asc` and `nulls_first`).
This strikes me as less clear all in all, though if you'd prefer it I'll
make that change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]