Re: [I] Wrong result when there are 2 count(distinct) [arrow-datafusion]

via GitHub Fri, 15 Mar 2024 13:09:47 -0700


alamb commented on issue #9586:
URL: 
https://github.com/apache/arrow-datafusion/issues/9586#issuecomment-2000375649


   Here is a self contained reproducer with just datafusion-cli
   
   Step 1: Create data
   ```sql
   ------
   -- Create a table using UNION ALL to get 2 partitions (very important)
   ------
   create table test as
       select * from (values('foo', 'bar', 1))
       UNION ALL
       select * from (values('foo', 'baz', 1));
   
   ------
   -- Now, create a table with the same data, but column2 has type 
`Dictionary(Int32)` to trigger the fallback code
   -----
   create table test_dict as
     select
       column1,
       arrow_cast(column2, 'Dictionary(Int32, Utf8)') as "column2",
       column3
   from test;
   ```
   
   Now, we expect 
   
   ```sql
   -- there are two distinct values of column 2: 'bar' and 'baz'
   
   select * from test_dict;
   +---------+---------+---------+
   | column1 | column2 | column3 |
   +---------+---------+---------+
   | foo     | bar     | 1       |
   | foo     | baz     | 1       |
   +---------+---------+---------+
   2 rows in set. Query took 0.000 seconds.
   
   
   -- but this query says there is only 1 distinct value in column2
   select
     count(distinct column1),
     count(distinct column2)
   FROM test_dict
   GROUP BY column3;
   
   +-----------------------------------+-----------------------------------+
   | COUNT(DISTINCT test_dict.column1) | COUNT(DISTINCT test_dict.column2) |
   +-----------------------------------+-----------------------------------+
   | 1                                 | 1                                 |
   +-----------------------------------+-----------------------------------+
   1 row in set. Query took 0.002 seconds.
   ```
   
   
   
   Running the same query against the non dictionary encoded column in `test` 
produces the correct result:
   
   ```sql
   select
     count(distinct column1),
     count(distinct column2)
   FROM test_dict
   GROUP BY column3;
   
   +------------------------------+------------------------------+
   | COUNT(DISTINCT test.column1) | COUNT(DISTINCT test.column2) |
   +------------------------------+------------------------------+
   | 1                            | 2                            |
   +------------------------------+------------------------------+
   1 row in set. Query took 0.001 seconds.
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Wrong result when there are 2 count(distinct) [arrow-datafusion]

Reply via email to