mbutrovich opened a new pull request, #3243: URL: https://github.com/apache/datafusion-comet/pull/3243
## Which issue does this PR close? Closes #. ## Rationale for this change During Iceberg workloads, profiling revealed high GC time caused by Java deserialization of `List$SerializationProxy` objects (22.1 MiB allocations per task). Root cause: `CometIcebergNativeScanMetadata` contains heavyweight Iceberg objects (FileScanTasks list, table, schemas) with nested Scala collections. These were being included in `equals()`/`hashCode()` methods, causing Java serializer to recursively traverse the entire object graph during task serialization, creating allocation pressure. The metadata is only needed during query planning to convert Iceberg metadata to protobuf. After conversion, all necessary information exists in the serialized `nativeOp`, making the metadata unnecessary. ## What changes are included in this PR? 1. Mark `nativeIcebergScanMetadata` as `@transient` in `CometBatchScanExec` to prevent serialization to executors 2. Remove `nativeIcebergScanMetadata` from `equals()`/`hashCode()` in both `CometBatchScanExec` and `CometIcebergNativeScanExec` to prevent object graph traversal during plan comparison 3. Set `nativeIcebergScanMetadata = None` in `CometBatchScanExec.doCanonicalize()` to free driver memory during plan optimization Equivalence checking remains correct because `wrappedScan.equals()` already compares the underlying `BatchScanExec` with all its Iceberg metadata. ## How are these changes tested? Existing tests. I will try to benchmark it locally as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
