andygrove opened a new issue, #809:
URL: https://github.com/apache/datafusion-comet/issues/809

   ### What is the problem the feature request solves?
   
   Microbenchmarks show that broadcast hash joins are currently not performing 
well in Comet.
   
   Benchmark query (note that this does not read any decimal columns).
   
   ```sql
   select ss_sold_date_sk, ss_sold_time_sk, ss_quantity, d_year, d_moy, d_dom
   from date_dim join store_sales on d_date_sk = ss_sold_date_sk
   where d_year = 2000;
   ```
   
   Benchmark results (running against sf=100 dataset):
   
   ```
   AMD Ryzen 9 7950X3D 16-Core Processor
   TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   join_inner                                          515            527       
   10        559.0           1.8       1.0X
   join_inner: Comet (Scan)                            751            770       
   17        383.4           2.6       0.7X
   join_inner: Comet (Scan, Exec)                     1089           1099       
   14        264.6           3.8       0.5X
   ```
   
   Native plan with metrics from one of the tasks:
   
   ```
   ProjectionExec: expr=[col_0@4 as col_0, col_1@5 as col_1, col_2@6 as col_2, 
col_1@1 as col_3, col_2@2 as col_4, col_3@3 as col_5], 
metrics=[output_rows=403302, elapsed_compute=133.629µs]
     HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_0@0)], 
metrics=[output_rows=403302, output_batches=256, build_input_batches=1, 
input_batches=256, input_rows=2003266, build_input_rows=366, 
build_mem_used=15032, build_time=51.757µs, join_time=12.999015ms]
       CopyExec, metrics=[output_rows=366, elapsed_compute=10.26µs]
         ScanExec: source=[BroadcastExchange (unknown)], schema=[col_0: Int32, 
col_1: Int32, col_2: Int32, col_3: Int32], metrics=[output_rows=366, 
elapsed_compute=711ns]
       CopyExec, metrics=[output_rows=2003266, elapsed_compute=2.142776ms]
         FilterExec: col_0@0 IS NOT NULL, metrics=[output_rows=2003266, 
elapsed_compute=17.330129ms]
           ScanExec: source=[CometScan parquet 
spark_catalog.default.store_sales (unknown)], schema=[col_0: Int32, col_1: 
Int32, col_2: Int32], metrics=[output_rows=2097152, elapsed_compute=10.135894ms]
   ```
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to