Re: [PR] feat: Support distinct window for sum [datafusion]

via GitHub Mon, 28 Jul 2025 04:19:25 -0700


zhuqi-lucas commented on code in PR #16943:
URL: https://github.com/apache/datafusion/pull/16943#discussion_r2235948057



##########
datafusion/sqllogictest/test_files/window.slt:
##########
@@ -5715,17 +5715,82 @@ EXPLAIN SELECT
         RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
     ) AS distinct_count
 FROM table_test_distinct_count
-ODER BY k, time;
+ORDER BY k, time;
 ----
 logical_plan
-01)Projection: oder.k, oder.time, count(oder.v) PARTITION BY [oder.k] ORDER BY 
[oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW AS 
normal_count, count(DISTINCT oder.v) PARTITION BY [oder.k] ORDER BY [oder.time 
ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW AS 
distinct_count
-02)--WindowAggr: windowExpr=[[count(oder.v) PARTITION BY [oder.k] ORDER BY 
[oder.time ASC NULLS LAST] RANGE BETWEEN IntervalMonthDayNano { months: 0, 
days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW AS count(oder.v) 
PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW, count(DISTINCT oder.v) PARTITION BY [oder.k] 
ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN IntervalMonthDayNano { 
months: 0, days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW AS 
count(DISTINCT oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS 
LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW]]
-03)----SubqueryAlias: oder
+01)Sort: table_test_distinct_count.k ASC NULLS LAST, 
table_test_distinct_count.time ASC NULLS LAST
+02)--Projection: table_test_distinct_count.k, table_test_distinct_count.time, 
count(table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] 
ORDER BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW AS normal_count, count(DISTINCT 
table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] ORDER 
BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 minutes 
PRECEDING AND CURRENT ROW AS distinct_count
+03)----WindowAggr: windowExpr=[[count(table_test_distinct_count.v) PARTITION 
BY [table_test_distinct_count.k] ORDER BY [table_test_distinct_count.time ASC 
NULLS LAST] RANGE BETWEEN IntervalMonthDayNano { months: 0, days: 0, 
nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW AS 
count(table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] 
ORDER BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW, count(DISTINCT table_test_distinct_count.v) 
PARTITION BY [table_test_distinct_count.k] ORDER BY 
[table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 
IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 120000000000 } 
PRECEDING AND CURRENT ROW AS count(DISTINCT table_test_distinct_count.v) 
PARTITION BY [table_test_distinct_count.k] ORDER BY 
[table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 minutes 
PRECEDING AND CURRENT ROW]]
 04)------TableScan: table_test_distinct_count projection=[k, v, time]
 physical_plan
-01)ProjectionExec: expr=[k@0 as k, time@2 as time, count(oder.v) PARTITION BY 
[oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING 
AND CURRENT ROW@3 as normal_count, count(DISTINCT oder.v) PARTITION BY [oder.k] 
ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND 
CURRENT ROW@4 as distinct_count]
-02)--BoundedWindowAggExec: wdw=[count(oder.v) PARTITION BY [oder.k] ORDER BY 
[oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW: 
Field { name: "count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC 
NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW", data_type: 
Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
frame: RANGE BETWEEN IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 
120000000000 } PRECEDING AND CURRENT ROW, count(DISTINCT oder.v) PARTITION BY 
[oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING 
AND CURRENT ROW: Field { name: "count(DISTINCT oder.v) PARTITION BY [oder.k] 
ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND 
CURRENT ROW", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: 
false, metadata: {} }, frame: RANGE BETWEEN IntervalMonthDayNano { months: 0, 
days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW], mode=[Sort
 ed]
-03)----SortExec: expr=[k@0 ASC NULLS LAST, time@2 ASC NULLS LAST], 
preserve_partitioning=[true]
-04)------CoalesceBatchesExec: target_batch_size=1
-05)--------RepartitionExec: partitioning=Hash([k@0], 2), input_partitions=2
-06)----------DataSourceExec: partitions=2, partition_sizes=[5, 4]
+01)SortPreservingMergeExec: [k@0 ASC NULLS LAST, time@1 ASC NULLS LAST]
+02)--ProjectionExec: expr=[k@0 as k, time@2 as time, 
count(table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] 
ORDER BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW@3 as normal_count, count(DISTINCT 
table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] ORDER 
BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 minutes 
PRECEDING AND CURRENT ROW@4 as distinct_count]
+03)----BoundedWindowAggExec: wdw=[count(table_test_distinct_count.v) PARTITION 
BY [table_test_distinct_count.k] ORDER BY [table_test_distinct_count.time ASC 
NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW: Field { name: 
"count(table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] 
ORDER BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW", data_type: Int64, nullable: false, dict_id: 
0, dict_is_ordered: false, metadata: {} }, frame: RANGE BETWEEN 
IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 120000000000 } 
PRECEDING AND CURRENT ROW, count(DISTINCT table_test_distinct_count.v) 
PARTITION BY [table_test_distinct_count.k] ORDER BY 
[table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2 minutes 
PRECEDING AND CURRENT ROW: Field { name: "count(DISTINCT 
table_test_distinct_count.v) PARTITION BY [table_test_distinct_count.k] ORDER 
BY [table_test_distinct_count.time ASC NULLS LAST] RANGE BETWEEN 2
  minutes PRECEDING AND CURRENT ROW", data_type: Int64, nullable: false, 
dict_id: 0, dict_is_ordered: false, metadata: {} }, frame: RANGE BETWEEN 
IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 120000000000 } 
PRECEDING AND CURRENT ROW], mode=[Sorted]
+04)------SortExec: expr=[k@0 ASC NULLS LAST, time@2 ASC NULLS LAST], 
preserve_partitioning=[true]
+05)--------CoalesceBatchesExec: target_batch_size=1
+06)----------RepartitionExec: partitioning=Hash([k@0], 2), input_partitions=2
+07)------------DataSourceExec: partitions=2, partition_sizes=[5, 4]
+
+
+# Add testing for distinct sum

Review Comment:
   This is the corresponding slt testing for this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support distinct window for sum [datafusion]

Reply via email to