[PR] perf: cache serialized query plans to avoid per-partition serialization [datafusion-comet]

via GitHub Thu, 22 Jan 2026 15:24:05 -0800


andygrove opened a new pull request, #3246:
URL: https://github.com/apache/datafusion-comet/pull/3246


   ## Which issue does this PR close?
   
   N/A - Performance optimization
   
   ## Rationale for this change
   
   The `CometExec.getCometIterator()` method serializes the protobuf query plan 
to bytes every time it's called. Since this method is called inside 
`mapPartitions*` for each partition, the same plan was being serialized N times 
for a query with N partitions. This causes unnecessary CPU overhead and GC 
pressure.
   
   ## What changes are included in this PR?
   
   1. **operators.scala** - Added caching infrastructure:
      - `serializeNativePlan(nativePlan: Operator): Array[Byte]` - helper to 
serialize a plan once
      - New `getCometIterator` overload accepting pre-serialized `Array[Byte]` 
instead of `Operator`
      - Refactored existing method to use the new helper
   
   2. **CometExecUtils.scala** - Updated `getNativeLimitRDD`:
      - Serialize the limit plan once before `mapPartitionsWithIndexInternal`
      - Pass pre-serialized bytes to each partition
   
   3. **CometTakeOrderedAndProjectExec.scala** - Updated `doExecuteColumnar()`:
      - Serialize the topK plan once before the local topK mapping
      - Serialize the topKAndProjection plan once before the final mapping
   
   ## How are these changes tested?
   
   Existing tests pass:
   - CometNativeSuite: 4 tests
   - CometExecSuite: 89 tests (includes TakeOrderedAndProjectExec tests)
   - CometShuffleSuite: 41 tests
   
   🤖 Generated with [Claude Code](https://claude.ai/code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: cache serialized query plans to avoid per-partition serialization [datafusion-comet]

Reply via email to