[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6734: Add support for order-sensitive aggregation for multipartitions

via GitHub Thu, 22 Jun 2023 08:28:35 -0700


alamb commented on code in PR #6734:
URL: https://github.com/apache/arrow-datafusion/pull/6734#discussion_r1238698423



##########
datafusion/core/tests/sqllogictests/test_files/groupby.slt:
##########
@@ -2076,18 +2076,18 @@ Projection: annotated_data_infinite2.a, 
annotated_data_infinite2.b, FIRST_VALUE(
 ----TableScan: annotated_data_infinite2 projection=[a, b, c]
 physical_plan
 ProjectionExec: expr=[a@0 as a, b@1 as b, 
FIRST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST]@2 as first_c]
---AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[FIRST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
+--AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[LAST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
 ----CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, 
c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS 
LAST, c@2 ASC NULLS LAST], has_header=true
 
 query III
 SELECT a, b, FIRST_VALUE(c ORDER BY a DESC) as first_c
   FROM annotated_data_infinite2
   GROUP BY a, b
 ----
-0 0 0
-0 1 25
-1 2 50
-1 3 75
+0 0 24

Review Comment:
   I was just looking at the output (not the plan) -- I see now that the `ORDER 
BY` is on `a` but the value is `c` 
   
   Since the query groups by `a, b` each group that `FIRST_VALUE` is evaluated 
on, will have the same value of `c` and thus `FIRST_VALUE` is effectively 
arbitrary.
   
   When I printed out the values in `annotated_data_infinite2` it is clearer to 
me that the output of this query is "undefined" in the sense that any of the 
values of `c` are acceptable (I wonder if this test will therefore be unstable 
🤔 ) . Maybe we can somehow make the query more representative for the future
   
   ```
   query III
   select a, b, c from annotated_data_infinite2 order by a, b, c;
   ----
   0 0 0
   0 0 1
   0 0 2
   0 0 3
   0 0 4
   0 0 5
   0 0 6
   0 0 7
   0 0 8
   0 0 9
   0 0 10
   0 0 11
   0 0 12
   0 0 13
   0 0 14
   0 0 15
   0 0 16
   0 0 17
   0 0 18
   0 0 19
   0 0 20
   0 0 21
   0 0 22
   0 0 23
   0 0 24
   0 1 25
   0 1 26
   0 1 27
   0 1 28
   0 1 29
   0 1 30
   0 1 31
   0 1 32
   0 1 33
   0 1 34
   0 1 35
   0 1 36
   0 1 37
   0 1 38
   0 1 39
   0 1 40
   0 1 41
   0 1 42
   0 1 43
   0 1 44
   0 1 45
   0 1 46
   0 1 47
   0 1 48
   0 1 49
   1 2 50
   1 2 51
   1 2 52
   1 2 53
   1 2 54
   1 2 55
   1 2 56
   1 2 57
   1 2 58
   1 2 59
   1 2 60
   1 2 61
   1 2 62
   1 2 63
   1 2 64
   1 2 65
   1 2 66
   1 2 67
   1 2 68
   1 2 69
   1 2 70
   1 2 71
   1 2 72
   1 2 73
   1 2 74
   1 3 75
   1 3 76
   1 3 77
   1 3 78
   1 3 79
   1 3 80
   1 3 81
   1 3 82
   1 3 83
   1 3 84
   1 3 85
   1 3 86
   1 3 87
   1 3 88
   1 3 89
   1 3 90
   1 3 91
   1 3 92
   1 3 93
   1 3 94
   1 3 95
   1 3 96
   1 3 97
   1 3 98
   1 3 99
   ``` 



##########
datafusion/core/tests/sqllogictests/test_files/groupby.slt:
##########
@@ -2076,18 +2076,18 @@ Projection: annotated_data_infinite2.a, 
annotated_data_infinite2.b, FIRST_VALUE(
 ----TableScan: annotated_data_infinite2 projection=[a, b, c]
 physical_plan
 ProjectionExec: expr=[a@0 as a, b@1 as b, 
FIRST_VALUE(annotated_data_infinite2.c) ORDER BY [annotated_data_infinite2.a 
DESC NULLS FIRST]@2 as first_c]
---AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[FIRST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
+--AggregateExec: mode=Single, gby=[a@0 as a, b@1 as b], 
aggr=[LAST_VALUE(annotated_data_infinite2.c)], ordering_mode=FullyOrdered
 ----CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, 
c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS 
LAST, c@2 ASC NULLS LAST], has_header=true
 
 query III
 SELECT a, b, FIRST_VALUE(c ORDER BY a DESC) as first_c
   FROM annotated_data_infinite2
   GROUP BY a, b
 ----
-0 0 0
-0 1 25
-1 2 50
-1 3 75
+0 0 24

Review Comment:
   (this change now makes sense to me)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6734: Add support for order-sensitive aggregation for multipartitions

Reply via email to