[I] Proposal: Port most microbenchmarks to PySpark [datafusion-comet]

via GitHub Sat, 07 Feb 2026 11:59:42 -0800


andygrove opened a new issue, #3440:
URL: https://github.com/apache/datafusion-comet/issues/3440


   ### What is the problem the feature request solves?
   
   We have some very useful microbenchmarks implemented in Scala and tightly 
integrated with Comet. I would like to propose moving most of these to PySpark 
instead, making it easier to run them against clusters, and easier to run with 
different versions of Comet and with other Spark accelerators for comparison.
   
   It also removes the compile time from running the benchmarks. The 
accelerator jar is provided by `COMET_JAR` environment variable.
   
   I already prototyped this, using Claude Code to automate the conversion.
   
   Here are some sample outputs:
   
   ### Parquet Scans
   
   ```
   SQL Single DOUBLE Column Scan (13,421,772 rows)
   
---------------------------------------------------------------------------------------------------
   Case                           Best(ms)     Avg(ms)      Median      StdDev  
        Rate   vs base
   
---------------------------------------------------------------------------------------------------
   spark                              42.9        43.2        43.1         0.4  
    313.2M/s          
   gluten                             88.8        90.8        90.3         1.9  
    151.2M/s     0.48x
   comet_native_datafusion            35.1        36.9        37.2         1.1  
    382.4M/s     1.22x
   comet_native_iceberg_compat        36.5        37.5        37.6         0.7  
    367.9M/s     1.17x
   
---------------------------------------------------------------------------------------------------
   
   SQL Single Decimal(5,2) Column Scan (1,572,864 rows)
   
---------------------------------------------------------------------------------------------------
   Case                           Best(ms)     Avg(ms)      Median      StdDev  
        Rate   vs base
   
---------------------------------------------------------------------------------------------------
   spark                              19.0        21.8        21.1         2.1  
     82.8M/s          
   gluten                             66.8        70.4        69.6         3.1  
     23.5M/s     0.28x
   comet_native_datafusion            23.8        24.5        24.6         0.5  
     66.1M/s     0.80x
   comet_native_iceberg_compat        24.8        25.5        25.1         0.9  
     63.5M/s     0.77x
   
---------------------------------------------------------------------------------------------------
   ```
   
   ### Expressions
   
   ```
   contains (102 rows)
   
---------------------------------------------------------------------------------------------------
   Case                           Best(ms)     Avg(ms)      Median      StdDev  
        Rate   vs base
   
---------------------------------------------------------------------------------------------------
   spark                              15.2        17.9        17.0         3.2  
      6.7K/s          
   gluten                             36.6        39.2        38.8         2.0  
      2.8K/s     0.42x
   comet_native_datafusion            19.3        19.8        19.7         0.4  
      5.3K/s     0.79x
   comet_native_iceberg_compat        18.8        19.2        19.0         0.5  
      5.4K/s     0.81x
   
---------------------------------------------------------------------------------------------------
   
   endswith (102 rows)
   
---------------------------------------------------------------------------------------------------
   Case                           Best(ms)     Avg(ms)      Median      StdDev  
        Rate   vs base
   
---------------------------------------------------------------------------------------------------
   spark                              15.7        16.3        16.4         0.4  
      6.5K/s          
   gluten                             35.9        37.7        37.2         1.9  
      2.8K/s     0.44x
   comet_native_datafusion            17.9        18.6        18.4         0.6  
      5.7K/s     0.87x
   comet_native_iceberg_compat        18.2        20.6        19.5         3.7  
      5.6K/s     0.86x
   
---------------------------------------------------------------------------------------------------
   ```
   
   ### Joins
   
   ```
   Inner Join 20 cols (10,485 rows)
   
------------------------------------------------------------------------------------------
   Case                  Best(ms)     Avg(ms)      Median      StdDev          
Rate   vs base
   
------------------------------------------------------------------------------------------
   spark                    267.5       283.9       285.8        11.4       
39.2K/s          
   gluten                   297.8       315.7       318.9        15.4       
35.2K/s     0.90x
   gluten (hash join)       307.7       316.6       314.6         6.9       
34.1K/s     0.87x
   comet                    250.0       269.0       268.2        14.7       
41.9K/s     1.07x
   comet (hash join)        277.1       289.8       289.3        14.8       
37.8K/s     0.97x
   
------------------------------------------------------------------------------------------
   ```
   
   
   ### Describe the potential solution
   
   I already have this working, with most benchmarks ported over. If this 
approach is acceptable to the community then I can create a PR to add this and 
remove some of the existing Scala-based microbenchmarks, so we don't have 
duplicates to maintain.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Proposal: Port most microbenchmarks to PySpark [datafusion-comet]

Reply via email to