andygrove opened a new issue, #3762: URL: https://github.com/apache/datafusion-comet/issues/3762
### What is the problem the feature request solves? ## Summary When a native operator reads from a broadcast exchange, the data currently passes through Arrow FFI via `ScanExec`. This is the same path that shuffle reads used before `ShuffleScanExec` was introduced in #3731. ## Current Path 1. `CometBroadcastExchangeExec` serializes data as coalesced, compressed Arrow IPC on the driver 2. On each executor, `CometBatchRDD.compute()` deserializes the IPC stream → `ColumnarBatch` objects on the JVM 3. A `CometBatchIterator` exposes those batches to native code 4. Native `ScanExec` reads via Arrow FFI (`CometBatchIterator.next()` exports via Arrow C Data Interface) ## Proposed Path Introduce a `BroadcastScanExec` native operator analogous to `ShuffleScanExec`: 1. Pass raw compressed Arrow IPC bytes from the JVM to native code (skip JVM-side deserialization) 2. Decompress and decode natively using `read_ipc_compressed()` 3. This avoids the JVM-side decompression + `ArrowStreamReader` deserialization + Arrow FFI export overhead ## Work Required - New JVM iterator class (e.g. `CometBroadcastBlockIterator`) that yields raw compressed IPC bytes, similar to `CometShuffleBlockIterator` - New protobuf op `BroadcastScan` alongside existing `ShuffleScan` - New Rust operator `BroadcastScanExec` in `native/core/src/execution/operators/` - Planner changes in `CometSink.scala` to emit `BroadcastScan` instead of `Scan` for broadcast exchange inputs (behind a config flag like `spark.comet.exec.broadcast.directRead.enabled`) ## Notes - Broadcast data is already marked `arrow_ffi_safe=true` since it comes from `ArrowStreamReader` with no mutable buffers, so the current FFI path is at least safe - Broadcast volumes are typically smaller than shuffle, so the impact may be less dramatic than #3731, but the overhead of JVM-side decompression and FFI export is still real ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
