alamb commented on issue #18411: URL: https://github.com/apache/datafusion/issues/18411#issuecomment-3675893106
Update: When I tested locally with a combination of the following two PRS: - https://github.com/apache/datafusion/pull/19413 - https://github.com/apache/datafusion/pull/19374 DataFusion (will be 52) is at 883ms ```shell hyperfine --warmup 3 " ./datafusion-cli-the-works -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " Benchmark 1: ./datafusion-cli-the-works -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 897.6 ms ± 17.7 ms [User: 10386.4 ms, System: 642.5 ms] Range (min … max): 883.7 ms … 942.0 ms 10 runs ``` DuckDB is at 808ms ``` hyperfine --warmup 3 "duckdb -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " Benchmark 1: duckdb -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 794.6 ms ± 6.7 ms [User: 9802.8 ms, System: 470.0 ms] Range (min … max): 787.3 ms … 808.1 ms 10 runs ``` I think to get it much faster we would have to really optimize for short strings, likely similar to to @rluvaton 's suggestion: - For small number of groups we can also use a faster hash map that is optimized for small number of keys In this case, I think we could keep a hash table of `u128s` (entirely inlined short strings) and then fall back to using a hash table w/ buffers if/when we saw long strings -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
