alamb commented on PR #8849:
URL:
https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901
Thanks @jayzhan211 -- looks basically on the right track. Is there any
chance you can run some sort of benchmark on this code? My thinking is that we
should get benchmark results showing that the idea actually improves
performance before spending too much time polishing
I looked at ClickBench and I don't actually think there are any queries that
do `COUNT(distinct <utf8>)`
Q8 looks like it should be helped
```sql
SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
```
however, I am pretty sure datfusion rewrites this query to avoid the
distinct with `SELECT COUNT(..) GROUP BY "SearchPhrase"`
Maybe you could try manually runing a query that can't be rewritten (throw
un multiple `DISTINCT`s) such as
```sql
❯ SELECT
COUNT(DISTINCT "SearchPhrase"),
COUNT(DISTINCT "MobilePhone"),
COUNT(DISTINCT "MobilePhoneModel")
FROM 'hits.parquet';
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
| COUNT(DISTINCT hits.parquet.SearchPhrase) | COUNT(DISTINCT
hits.parquet.MobilePhone) | COUNT(DISTINCT hits.parquet.MobilePhoneModel) |
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
| 6019103 | 44
| 166 |
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]