alamb commented on issue #258: URL: https://github.com/apache/arrow-datafusion/issues/258#issuecomment-1465171962
I agree with @waynexia that this scenario is not covered by any existing datafusion benchmarks I know of Clickbench has several queries that include count distinct (see for example https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1432070491) but I am not sure if the input is dictionary encoded. ``` > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet'; > SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10; ``` However, I think with #5166 you could now create a dictionary encoded version with a command like the following (untested as I don't not to have the data downloaded -- data is here https://github.com/ClickHouse/ClickBench/tree/main#data-loading) ```sql CREATE TABLE hits_dictionary as select arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID", "ResolutionWidth", "UserID", FROM hits; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
