[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6445: Add support for FIRST_VALUE, LAST_VALUE Aggregate Functions

via GitHub Fri, 26 May 2023 09:51:21 -0700


alamb commented on code in PR #6445:
URL: https://github.com/apache/arrow-datafusion/pull/6445#discussion_r1206907936



##########
datafusion/core/tests/dataframe_functions.rs:
##########
@@ -175,11 +175,11 @@ async fn test_fn_approx_percentile_cont() -> Result<()> {
     let expr = approx_percentile_cont(col("b"), lit(0.5));
 
     let expected = vec![
-        "+-------------------------------------------+",
-        "| APPROXPERCENTILECONT(test.b,Float64(0.5)) |",
-        "+-------------------------------------------+",
-        "| 10                                        |",
-        "+-------------------------------------------+",
+        "+---------------------------------------------+",

Review Comment:
   I think this is nice -- it makes the aggregates consistent with the (very) 
recent change from @2010YOUY01 for scalar functions: 
https://github.com/apache/arrow-datafusion/pull/6448



##########
datafusion/expr/src/aggregate_function.rs:
##########
@@ -79,7 +84,9 @@ pub enum AggregateFunction {
 impl fmt::Display for AggregateFunction {
     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
         // uppercase of the debug.
-        write!(f, "{}", format!("{self:?}").to_uppercase())
+        // Convert Camel form to uppercase snake
+        // such as FirstValue => FIRST_VALUE
+        write!(f, "{}", convert_camel_to_upper_snake(format!("{self:?}")))
     }

Review Comment:
   I think this would be better if it were explicit rather than trying to 
reverse engineer the debug name. That way we wouldn't have to tie the rust 
variant name to the enum name. 
   
   For example
   
   ```rust
   impl AggregateFunction
     fn name(&self) -> &str {
       match self
         ...
         FirstValue => "first_value",
        ...
       }
     }
   }
   
   impl fmt::Display for AggregateFunction {
       fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{}", self.name())
      }
   }
   ```
   
   What do you think? 
   
   We can do this in a follow on PR but it would be nice to avoid the 
`convert_camel_to_upper_snake` code



##########
datafusion/core/tests/sqllogictests/test_files/window.slt:
##########
@@ -3017,6 +3017,13 @@ SELECT a, b, c,
 0 0 3 11 96 11 2 10 36 10 36 11 5 11 9
 0 0 4 9 72 9 NULL 14 45 14 45 9 4 9 9
 
+#fn aggregate order by with window frame

Review Comment:
   👍 



##########
datafusion/core/tests/sqllogictests/test_files/window.slt:
##########
@@ -3017,6 +3017,13 @@ SELECT a, b, c,
 0 0 3 11 96 11 2 10 36 10 36 11 5 11 9
 0 0 4 9 72 9 NULL 14 45 14 45 9 4 9 9
 
+#fn aggregate order by with window frame
+# In window expressions, aggregate functions should not have an ordering 
requirement, such requirements
+# should be defined in the window frame. Therefore, the query below should 
generate an error. Note that
+# PostgreSQL also behaves this way.
+statement error DataFusion error: Error during planning: Aggregate ORDER BY is 
not implemented for window functions
+SELECT SUM(inc_col ORDER BY a DESC) OVER() as last
+       FROM annotated_data_infinite2

Review Comment:
   I recommend another negative test to make sure the window function and over 
clause can not be used in the same query. Something like
   
   ```sql
   EXPLAIN SELECT a, b, LAST_VALUE(c ORDER BY a DESC) OVER (order by a) as 
last_c
   ```



##########
docs/source/user-guide/sql/aggregate_functions.md:
##########
@@ -202,6 +204,32 @@ array_agg(expression [ORDER BY expression])
 
 #### Arguments
 
+- **expression**: Expression to operate on.

Review Comment:
   👍 



##########
datafusion/core/tests/sqllogictests/test_files/groupby.slt:
##########
@@ -2007,6 +2008,58 @@ SELECT a, d,
 1 4 913
 1 2 848
 
+# test_source_sorted_groupby3
+
+query TT
+EXPLAIN SELECT a, b, FIRST_VALUE(c ORDER BY a DESC) as first_c
+  FROM annotated_data_infinite2
+  GROUP BY a, b
+----
+logical_plan
+Projection: annotated_data_infinite2.a, annotated_data_infinite2.b, 
FIRST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST] AS first_c
+--Aggregate: groupBy=[[annotated_data_infinite2.a, 
annotated_data_infinite2.b]], aggr=[[FIRST_VALUE(annotated_data_infinite2.c) 
ORDER BY [annotated_data_infinite2.a DESC NULLS FIRST]]]
+----TableScan: annotated_data_infinite2 projection=[a, b, c]
+physical_plan
+ProjectionExec: expr=[a@0 as a, b@1 as b, 
FIRST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST]@2 as first_c]
+--AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[FIRST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
+----CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, 
c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS 
LAST, c@2 ASC NULLS LAST], has_header=true
+
+query III
+SELECT a, b, FIRST_VALUE(c ORDER BY a DESC) as first_c
+  FROM annotated_data_infinite2
+  GROUP BY a, b
+----
+0 0 0
+0 1 25
+1 2 50
+1 3 75
+
+# test_source_sorted_groupby4
+
+query TT
+EXPLAIN SELECT a, b, LAST_VALUE(c ORDER BY a DESC) as last_c
+  FROM annotated_data_infinite2
+  GROUP BY a, b
+----
+logical_plan
+Projection: annotated_data_infinite2.a, annotated_data_infinite2.b, 
LAST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST] AS last_c
+--Aggregate: groupBy=[[annotated_data_infinite2.a, 
annotated_data_infinite2.b]], aggr=[[LAST_VALUE(annotated_data_infinite2.c) 
ORDER BY [annotated_data_infinite2.a DESC NULLS FIRST]]]
+----TableScan: annotated_data_infinite2 projection=[a, b, c]
+physical_plan
+ProjectionExec: expr=[a@0 as a, b@1 as b, 
LAST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST]@2 as last_c]
+--AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[LAST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
+----CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, 
c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS 
LAST, c@2 ASC NULLS LAST], has_header=true
+
+query III

Review Comment:
   I wonder if it is worth adding some sort of test of the behavior when there 
is no order? Like `SELECT a, b, LAST_VALUE(c) as last_c` -- mostly to ensure it 
doesn't cause an error. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6445: Add support for FIRST_VALUE, LAST_VALUE Aggregate Functions

Reply via email to