[PR] Add microbenchmark for IcebergScan operator serde roundtrip [datafusion-comet]

via GitHub Tue, 27 Jan 2026 06:06:20 -0800


andygrove opened a new pull request, #3296:
URL: https://github.com/apache/datafusion-comet/pull/3296


   ## Summary
   
   This PR adds a microbenchmark for measuring the 
serialization/deserialization performance of Iceberg `FileScanTask` objects to 
protobuf.
   
   The benchmark:
   - Creates a real Iceberg table with configurable number of partitions 
(default: 30,000)
   - Extracts actual `FileScanTask` objects through query planning
   - Benchmarks conversion from `FileScanTask` to Protobuf via 
`CometIcebergNativeScan.convert()`
   - Benchmarks serialization to bytes and deserialization
   
   ### Usage
   
   ```bash
   # Run with default 30000 partitions
   make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark
   
   # Run with custom partition count
   make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 
1000
   ```
   
   ### Sample Results (1000 partitions)
   
   ```
   IcebergScan serde (1000 partitions, 1000 tasks):    Best Time(ms)   Avg 
Time(ms)   Relative
   
-------------------------------------------------------------------------------------------
   FileScanTask -> Protobuf (convert)                           1043           
1058       1.0X
   FileScanTask -> Protobuf -> bytes                            1126           
1133       0.9X
   bytes -> Protobuf (parseFrom)                                  10            
 11     107.7X
   Full roundtrip (convert + serialize + deserialize)           1150           
1159       0.9X
   ```
   
   Key insight: The conversion from `FileScanTask` to protobuf dominates (~99% 
of time). Protobuf parsing is extremely fast.
   
   Serialized size: 178.7 KB for 1000 tasks (~179 bytes/task)
   
   ## Test plan
   
   - [x] Benchmark compiles and runs successfully
   - [x] Results are consistent across multiple runs
   
   🤖 Generated with [Claude Code](https://claude.ai/code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add microbenchmark for IcebergScan operator serde roundtrip [datafusion-comet]

Reply via email to