alamb commented on issue #18411:
URL: https://github.com/apache/datafusion/issues/18411#issuecomment-3675893106

   Update: When I tested locally with a combination of the following two PRS:
   - https://github.com/apache/datafusion/pull/19413
   - https://github.com/apache/datafusion/pull/19374
   
   DataFusion (will be 52) is at 883ms
   ```shell
   hyperfine --warmup 3 " ./datafusion-cli-the-works   -c \"select 
l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' 
group by l_returnflag, l_linestatus;\" "
   
   Benchmark 1:  ./datafusion-cli-the-works   -c "select 
l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' 
group by l_returnflag, l_linestatus;"
     Time (mean ± σ):     897.6 ms ±  17.7 ms    [User: 10386.4 ms, System: 
642.5 ms]
     Range (min … max):   883.7 ms … 942.0 ms    10 runs
   ```
   
   DuckDB is at 808ms
   ```
   hyperfine --warmup 3 "duckdb -c \"select l_returnflag,l_linestatus, count(*) 
as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
   
   Benchmark 1: duckdb -c "select l_returnflag,l_linestatus, count(*) as 
count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
     Time (mean ± σ):     794.6 ms ±   6.7 ms    [User: 9802.8 ms, System: 
470.0 ms]
     Range (min … max):   787.3 ms … 808.1 ms    10 runs
   ```
   
   I think to get it much faster we would have to really optimize for short 
strings, likely similar to to @rluvaton 's suggestion:
   - For small number of groups we can also use a faster hash map that is 
optimized for small number of keys
   
   In this case, I think we could keep a hash table of `u128s` (entirely 
inlined short strings) and then fall back to using a hash table w/ buffers 
if/when we saw long strings
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to