andygrove opened a new pull request, #3296: URL: https://github.com/apache/datafusion-comet/pull/3296
## Summary This PR adds a microbenchmark for measuring the serialization/deserialization performance of Iceberg `FileScanTask` objects to protobuf. The benchmark: - Creates a real Iceberg table with configurable number of partitions (default: 30,000) - Extracts actual `FileScanTask` objects through query planning - Benchmarks conversion from `FileScanTask` to Protobuf via `CometIcebergNativeScan.convert()` - Benchmarks serialization to bytes and deserialization ### Usage ```bash # Run with default 30000 partitions make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark # Run with custom partition count make benchmark-org.apache.spark.sql.benchmark.CometOperatorSerdeBenchmark -- 1000 ``` ### Sample Results (1000 partitions) ``` IcebergScan serde (1000 partitions, 1000 tasks): Best Time(ms) Avg Time(ms) Relative ------------------------------------------------------------------------------------------- FileScanTask -> Protobuf (convert) 1043 1058 1.0X FileScanTask -> Protobuf -> bytes 1126 1133 0.9X bytes -> Protobuf (parseFrom) 10 11 107.7X Full roundtrip (convert + serialize + deserialize) 1150 1159 0.9X ``` Key insight: The conversion from `FileScanTask` to protobuf dominates (~99% of time). Protobuf parsing is extremely fast. Serialized size: 178.7 KB for 1000 tasks (~179 bytes/task) ## Test plan - [x] Benchmark compiles and runs successfully - [x] Results are consistent across multiple runs 🤖 Generated with [Claude Code](https://claude.ai/code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
