alamb commented on issue #258:
URL: 
https://github.com/apache/arrow-datafusion/issues/258#issuecomment-1465171962

   I agree with @waynexia that this scenario is not covered by any existing 
datafusion benchmarks I know of
   
   Clickbench has several queries that include count distinct (see for example  
https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1432070491) 
but I am not sure if the input is dictionary encoded.
   
   ```
   > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet';
   
   > SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, 
AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" 
ORDER BY c DESC LIMIT 10;
   ```
   
   However, I think with #5166  you could now create a dictionary encoded 
version with a command like the following (untested as I don't not to have the 
data downloaded -- data is here 
https://github.com/ClickHouse/ClickBench/tree/main#data-loading)
   
   ```sql
   CREATE TABLE hits_dictionary as 
   select 
     arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID",
     "ResolutionWidth",
     "UserID",
   FROM hits;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to