alamb commented on issue #9586:
URL:
https://github.com/apache/arrow-datafusion/issues/9586#issuecomment-2000375649
Here is a self contained reproducer with just datafusion-cli
Step 1: Create data
```sql
------
-- Create a table using UNION ALL to get 2 partitions (very important)
------
create table test as
select * from (values('foo', 'bar', 1))
UNION ALL
select * from (values('foo', 'baz', 1));
------
-- Now, create a table with the same data, but column2 has type
`Dictionary(Int32)` to trigger the fallback code
-----
create table test_dict as
select
column1,
arrow_cast(column2, 'Dictionary(Int32, Utf8)') as "column2",
column3
from test;
```
Now, we expect
```sql
-- there are two distinct values of column 2: 'bar' and 'baz'
select * from test_dict;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| foo | bar | 1 |
| foo | baz | 1 |
+---------+---------+---------+
2 rows in set. Query took 0.000 seconds.
-- but this query says there is only 1 distinct value in column2
select
count(distinct column1),
count(distinct column2)
FROM test_dict
GROUP BY column3;
+-----------------------------------+-----------------------------------+
| COUNT(DISTINCT test_dict.column1) | COUNT(DISTINCT test_dict.column2) |
+-----------------------------------+-----------------------------------+
| 1 | 1 |
+-----------------------------------+-----------------------------------+
1 row in set. Query took 0.002 seconds.
```
Running the same query against the non dictionary encoded column in `test`
produces the correct result:
```sql
select
count(distinct column1),
count(distinct column2)
FROM test_dict
GROUP BY column3;
+------------------------------+------------------------------+
| COUNT(DISTINCT test.column1) | COUNT(DISTINCT test.column2) |
+------------------------------+------------------------------+
| 1 | 2 |
+------------------------------+------------------------------+
1 row in set. Query took 0.001 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]