andygrove opened a new pull request, #3246:
URL: https://github.com/apache/datafusion-comet/pull/3246
## Which issue does this PR close?
N/A - Performance optimization
## Rationale for this change
The `CometExec.getCometIterator()` method serializes the protobuf query plan
to bytes every time it's called. Since this method is called inside
`mapPartitions*` for each partition, the same plan was being serialized N times
for a query with N partitions. This causes unnecessary CPU overhead and GC
pressure.
## What changes are included in this PR?
1. **operators.scala** - Added caching infrastructure:
- `serializeNativePlan(nativePlan: Operator): Array[Byte]` - helper to
serialize a plan once
- New `getCometIterator` overload accepting pre-serialized `Array[Byte]`
instead of `Operator`
- Refactored existing method to use the new helper
2. **CometExecUtils.scala** - Updated `getNativeLimitRDD`:
- Serialize the limit plan once before `mapPartitionsWithIndexInternal`
- Pass pre-serialized bytes to each partition
3. **CometTakeOrderedAndProjectExec.scala** - Updated `doExecuteColumnar()`:
- Serialize the topK plan once before the local topK mapping
- Serialize the topKAndProjection plan once before the final mapping
## How are these changes tested?
Existing tests pass:
- CometNativeSuite: 4 tests
- CometExecSuite: 89 tests (includes TakeOrderedAndProjectExec tests)
- CometShuffleSuite: 41 tests
🤖 Generated with [Claude Code](https://claude.ai/code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]