alamb commented on code in PR #9679:
URL: https://github.com/apache/arrow-datafusion/pull/9679#discussion_r1530372371
##########
datafusion/sqllogictest/test_files/dictionary.slt:
##########
@@ -280,3 +280,70 @@ ORDER BY
2023-12-20T01:20:00 1000 f2 foo
2023-12-20T01:30:00 1000 f1 32.0
2023-12-20T01:30:00 1000 f2 foo
+
+# Cleanup
+statement error DataFusion error: Execution error: Table 'm1' doesn't exist\.
+drop table m1;
+
+statement error DataFusion error: Execution error: Table 'm2' doesn't exist\.
+drop table m2;
+
+######
+# Create a table using UNION ALL to get 2 partitions (very important)
+######
+statement ok
+create table m3_source as
+ select * from (values('foo', 'bar', 1))
+ UNION ALL
+ select * from (values('foo', 'baz', 1));
+
+######
+# Now, create a table with the same data, but column2 has type
`Dictionary(Int32)` to trigger the fallback code
Review Comment:
> why does the cast to the dictionary trigger the fallback code?
The reason the dictionary triggers `merge` is that when grouping on strings
or primitive values, the `DistinctCountAccumulator` code path is not used.
Instead one of the specialized implementations (like
`BytesDistinctCountAccumulator`) is used instead, which use the
`GroupsAccumulator` interface.
Dictionary encoded columns run this path `DistinctCountAccumulator`
https://github.com/apache/arrow-datafusion/blob/b0b329ba39403b9e87156d6f9b8c5464dc6d2480/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L160-L163
> specifically, why the key of dict is the sub group index after casting? 🤔
What is happening is that we are doing a two phase groupby (illustated here)
https://github.com/apache/arrow-datafusion/blob/b0b329ba39403b9e87156d6f9b8c5464dc6d2480/datafusion/expr/src/accumulator.rs#L99-L131
And so there are two different `Partial` group bys happening. Each
PartialGroupBy produces a a set of distinct values. Using your example, I think
it would be more like the following (where we have the same group in multiple
partial results):
group 1 (partial): "a", "b",
group 1 (partial): "c"
The merge is called to combine the results together with a two element array
```
("a, "b")
("c")
```
But I may be misunderstanding your question
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]