alamb commented on PR #8849:
URL: 
https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901

   Thanks @jayzhan211  -- looks basically on the right track. Is there any 
chance you can run some sort of benchmark on this code? My thinking is that we 
should get benchmark results  showing that the idea actually improves 
performance before spending too much time polishing
   
   I looked at ClickBench and I don't actually think there are any queries that 
do `COUNT(distinct <utf8>)` 
   
   Q8 looks like it should be helped
   
   ```sql
   SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
   ```
   
   however,  I am pretty sure datfusion rewrites this query to avoid the 
distinct with `SELECT COUNT(..) GROUP BY "SearchPhrase"`
   
   Maybe you could try manually runing a query that can't be rewritten (throw 
un multiple `DISTINCT`s) such as
   
   ```sql
   ❯ SELECT
     COUNT(DISTINCT "SearchPhrase"),
     COUNT(DISTINCT "MobilePhone"),
     COUNT(DISTINCT "MobilePhoneModel")
   FROM 'hits.parquet';
   
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
   | COUNT(DISTINCT hits.parquet.SearchPhrase) | COUNT(DISTINCT 
hits.parquet.MobilePhone) | COUNT(DISTINCT hits.parquet.MobilePhoneModel) |
   
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
   | 6019103                                   | 44                             
          | 166                                           |
   
+-------------------------------------------+------------------------------------------+-----------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to