NGA-TRAN opened a new issue, #9586:
URL: https://github.com/apache/arrow-datafusion/issues/9586

   ### Describe the bug
   
   Wrong result when there are 2 count distinct in the select clause, 
   
   ### To Reproduce
   
   I will share there file with @alamb because I cannot attached .parquet file 
here
   
   ### Bug
   
   ```SQL
   -- BUG
   SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as 
pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE 
time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and 
server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
   
+---------------+------------+-------------+---------+------------+------------+
   | servers_count | pool_count | server_role | os      | env        | 
datacenter |
   
+---------------+------------+-------------+---------+------------+------------+
   | 1             | 1          | mesg        | windows | production | mn       
  |
   | 2             | 2          | mesg        | windows | production | va       
  |  -- should have 4 servers_count and 3 pool_count
   
+---------------+------------+-------------+---------+------------+------------+
   ```
   
   ### Here are other queries running in DF CLI that tell me the right results
   
   ```SQL
   
   SELECT DISTINCT host, server_role, os, env, datacenter from 
'/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < 
'2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
   +---------------+-------------+---------+------------+------------+
   | host          | server_role | os      | env        | datacenter |
   +---------------+-------------+---------+------------+------------+
   | mpm-mesg1002c | mesg        | windows | production | mn         |
   | vpm-mesg1001c | mesg        | windows | production | va         |
   | vpm-mesg1005b | mesg        | windows | production | va         |
   | vpm-mesg1008d | mesg        | windows | production | va         |
   | vpm-mesg1007b | mesg        | windows | production | va         |
   +---------------+-------------+---------+------------+------------+
   
   SELECT DISTINCT pool, server_role, os, env, datacenter from 
'/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < 
'2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
   +------+-------------+---------+------------+------------+
   | pool | server_role | os      | env        | datacenter |
   +------+-------------+---------+------------+------------+
   | c    | mesg        | windows | production | mn         |
   | b    | mesg        | windows | production | va         |
   | c    | mesg        | windows | production | va         |
   | d    | mesg        | windows | production | va         |
   +------+-------------+---------+------------+------------+
   
   SELECT  COUNT(DISTINCT host) AS servers_count, "server_role", "os", "env", 
"datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and 
time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", 
"os", "env", "datacenter";
   +---------------+-------------+---------+------------+------------+
   | servers_count | server_role | os      | env        | datacenter |
   +---------------+-------------+---------+------------+------------+
   | 4             | mesg        | windows | production | va         |
   | 1             | mesg        | windows | production | mn         |
   +---------------+-------------+---------+------------+------------+
   
   SELECT  count(distinct pool) as pool_count, "server_role", "os", "env", 
"datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and 
time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", 
"os", "env", "datacenter";
   +------------+-------------+---------+------------+------------+
   | pool_count | server_role | os      | env        | datacenter |
   +------------+-------------+---------+------------+------------+
   | 1          | mesg        | windows | production | mn         |
   | 3          | mesg        | windows | production | va         |
   +------------+-------------+---------+------------+------------+
   ```
   
   
   
   
   ### Expected behavior
   
   ```SQL
   SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as 
pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE 
time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and 
server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
   
+---------------+------------+-------------+---------+------------+------------+
   | servers_count | pool_count | server_role | os      | env        | 
datacenter |
   
+---------------+------------+-------------+---------+------------+------------+
   | 1             | 1          | mesg        | windows | production | mn       
  |
   | 4             | 3          | mesg        | windows | production | va       
  |  
   
+---------------+------------+-------------+---------+------------+------------+
   ```
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to