crepererum opened a new issue, #16887:
URL: https://github.com/apache/datafusion/issues/16887

   ### Describe the bug
   
   It seems that `COUNT(DISTINCT c) OVER ...` is treated as `COUNT(c) OVER 
...`, i.e. an ordinary count.
   
   ### To Reproduce
   
   Use the following sqllogictest:
   
   ```text
   statement ok
   CREATE TABLE table_test_distinct_count (
       k VARCHAR,
       v Int,
       time TIMESTAMP WITH TIME ZONE
   );
   
   statement ok
   INSERT INTO table_test_distinct_count (k, v, time) VALUES
       ('a', 1, '1970-01-01T00:01:00.00Z'),
       ('a', 1, '1970-01-01T00:02:00.00Z'),
       ('a', 1, '1970-01-01T00:03:00.00Z'),
       ('a', 2, '1970-01-01T00:03:00.00Z'),
       ('a', 1, '1970-01-01T00:04:00.00Z'),
       ('b', 3, '1970-01-01T00:01:00.00Z'),
       ('b', 3, '1970-01-01T00:02:00.00Z'),
       ('b', 4, '1970-01-01T00:03:00.00Z'),
       ('b', 4, '1970-01-01T00:03:00.00Z');
   
   query TPII
   SELECT
       k,
       time,
       COUNT(v) OVER (
           PARTITION BY k
           ORDER BY time
           RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
       ) AS normal_count,
       COUNT(DISTINCT v) OVER (
           PARTITION BY k
           ORDER BY time
           RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
       ) AS distinct_count
   FROM table_test_distinct_count
   ODER BY k, time;
   ----
   a 1970-01-01T00:01:00Z 1 1
   a 1970-01-01T00:02:00Z 2 2
   a 1970-01-01T00:03:00Z 4 4
   a 1970-01-01T00:03:00Z 4 4
   a 1970-01-01T00:04:00Z 4 4
   b 1970-01-01T00:01:00Z 1 1
   b 1970-01-01T00:02:00Z 2 2
   b 1970-01-01T00:03:00Z 4 4
   b 1970-01-01T00:03:00Z 4 4
   
   query TT
   EXPLAIN SELECT
       k,
       time,
       COUNT(v) OVER (
           PARTITION BY k
           ORDER BY time
           RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
       ) AS normal_count,
       COUNT(DISTINCT v) OVER (
           PARTITION BY k
           ORDER BY time
           RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
       ) AS distinct_count
   FROM table_test_distinct_count
   ODER BY k, time;
   ----
   logical_plan
   01)Projection: oder.k, oder.time, count(oder.v) PARTITION BY [oder.k] ORDER 
BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW 
AS normal_count, count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC 
NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW AS distinct_count
   02)--WindowAggr: windowExpr=[[count(oder.v) PARTITION BY [oder.k] ORDER BY 
[oder.time ASC NULLS LAST] RANGE BETWEEN IntervalMonthDayNano { months: 0, 
days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW AS count(oder.v) 
PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 
minutes PRECEDING AND CURRENT ROW]]
   03)----SubqueryAlias: oder
   04)------TableScan: table_test_distinct_count projection=[k, v, time]
   physical_plan
   01)ProjectionExec: expr=[k@0 as k, time@2 as time, count(oder.v) PARTITION 
BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes 
PRECEDING AND CURRENT ROW@3 as normal_count, count(oder.v) PARTITION BY 
[oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING 
AND CURRENT ROW@3 as distinct_count]
   02)--BoundedWindowAggExec: wdw=[count(oder.v) PARTITION BY [oder.k] ORDER BY 
[oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW: 
Field { name: "count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC 
NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW", data_type: 
Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
frame: RANGE BETWEEN IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 
120000000000 } PRECEDING AND CURRENT ROW], mode=[Sorted]
   03)----SortExec: expr=[k@0 ASC NULLS LAST, time@2 ASC NULLS LAST], 
preserve_partitioning=[false]
   04)------DataSourceExec: partitions=1, partition_sizes=[1]
   ```
   
   ### Expected behavior
   
   `normal_count` and `distinct_count` should clearly not be the same.
   
   ### Additional context
   
   This reproduces on 3d4fdf24ec799b1a0d36f55bde3f3a527aabe327 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to