[PR] perf: [iceberg] reduce nativeIcebergScanMetadata serialization points [datafusion-comet]

via GitHub Thu, 22 Jan 2026 11:47:10 -0800


mbutrovich opened a new pull request, #3243:
URL: https://github.com/apache/datafusion-comet/pull/3243


   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   During Iceberg workloads, profiling revealed high GC time caused by Java 
deserialization of `List$SerializationProxy` objects (22.1 MiB allocations per 
task).
   
   Root cause: `CometIcebergNativeScanMetadata` contains heavyweight Iceberg 
objects (FileScanTasks list, table, schemas) with nested Scala collections. 
These were being included in `equals()`/`hashCode()` methods, causing Java 
serializer to recursively traverse the entire object graph during task 
serialization, creating allocation pressure.
   
   The metadata is only needed during query planning to convert Iceberg 
metadata to protobuf. After conversion, all necessary information exists in the 
serialized `nativeOp`, making the metadata unnecessary.
   
   ## What changes are included in this PR?
   
   1. Mark `nativeIcebergScanMetadata` as `@transient` in `CometBatchScanExec` 
to prevent serialization to executors
   2. Remove `nativeIcebergScanMetadata` from `equals()`/`hashCode()` in both 
`CometBatchScanExec` and `CometIcebergNativeScanExec` to prevent object graph 
traversal during plan comparison
   3. Set `nativeIcebergScanMetadata = None` in 
`CometBatchScanExec.doCanonicalize()` to free driver memory during plan 
optimization
   
   Equivalence checking remains correct because `wrappedScan.equals()` already 
compares the underlying `BatchScanExec` with all its Iceberg metadata.
   
   ## How are these changes tested?
   
   Existing tests. I will try to benchmark it locally as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: [iceberg] reduce nativeIcebergScanMetadata serialization points [datafusion-comet]

Reply via email to