Re: [PR] feat: Add Native Support for In-Memory Cache [datafusion-comet]

via GitHub Fri, 19 Jun 2026 14:51:24 -0700


andygrove commented on PR #4591:
URL: 
https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4754903178


   ## Comparison of #4569 and #4591
   
   These two PRs both close #2391 and take the same fundamental approach, so 
cross-linking a comparison here for visibility.
   
   ### Shared goal and mechanics
   
   Both solve the same problem: Comet does not treat `InMemoryTableScanExec` as 
native, so it inserts a `CometSparkToColumnarExec` and pays a JVM-to-Arrow 
conversion on every cached read. Both fix this by storing the cache as 
compressed Arrow IPC once at build time, so repeated scans feed native 
execution directly.
   
   Both share the same core building blocks:
   
   - A custom `CachedBatchSerializer` that encodes batches with Comet's 
existing `serializeBatches` / `decodeBatches` (compressed Arrow IPC).
   - A `CometCachedBatch` payload holding the IPC bytes.
   - Installation of the serializer via `CometDriverPlugin` at startup.
   - A new config, off by default.
   - Decode back to `CometVector`-backed `ColumnarBatch` with column pruning, 
plus an `InternalRow` fallback for non-Comet consumers.
   - Roughly the same size and the same test layout (a serializer suite plus a 
plugin/exec suite).
   
   ### Key differences
   
   | Dimension | #4569 | #4591 |
   |---|---|---|
   | Serializer base class | `SimpleMetricsCachedBatchSerializer` | plain 
`CachedBatchSerializer` |
   | Batch-level pruning | Yes: stores a Spark-format per-column stats row 
(min/max/null/count), so `buildFilter` prunes batches | No: `stats = 
InternalRow.empty`, `buildFilter` is a pass-through no-op |
   | How the scan stays native | No new operator. Reuses 
`CometSparkToColumnarExec` with a passthrough fast-path: if columns are already 
`CometVector`, forward without re-copy (adds `numPassthroughBatches` metric) | 
New operator `CometInMemoryTableScanExec` (a `CometExec` / `LeafExecNode`) plus 
a `CometOperatorSerde`, wired into `CometExecRule`'s `nativeExecs` map |
   | Unsupported types | Delegates to `DefaultCachedBatchSerializer` 
(nested/complex) as an explicit drop-in | Delegates to 
`DefaultCachedBatchSerializer` by inspecting batch class in the convert methods 
|
   | Serializer install policy | Sets `spark.sql.cache.serializer` only when 
enabled, and never overrides a user-provided non-default serializer | Always 
registers the serializer; the serializer decides at runtime whether to use 
Comet format or delegate |
   | Config | `spark.comet.cache.serializer.enabled` | 
`spark.comet.exec.inMemoryCache.enabled` (EXEC category) |
   | Plan-rule changes | Minimal (works through the existing SparkToColumnar 
path) | Adds an explicit `InMemoryTableScanExec` case with detailed 
fallback-reason messages (disabled / wrong-batch-class / empty-buffer) |
   
   ### Architectural distinction
   
   The main difference is the integration strategy:
   
   - **#4569 is serializer-centric.** It does the minimum on the plan-rule side 
and leans on the existing `CometSparkToColumnarExec`, teaching it to skip the 
copy when batches are already Arrow. It also invests in stats so filter 
pushdown prunes cached batches, and is careful to respect a user-set serializer.
   - **#4591 is operator-centric.** It introduces a dedicated 
`CometInMemoryTableScanExec` node and serde, giving cached scans a first-class 
place in the Comet operator framework with explicit fallback reasons. It 
currently skips column statistics, so cached filters do not get batch pruning.
   
   ### Suggested path forward
   
   A strong combined outcome would pair #4591's dedicated scan operator and 
explicit fallback reasons with #4569's stats-based `buildFilter`, passthrough 
fast-path, and respect-user-serializer install logic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add Native Support for In-Memory Cache [datafusion-comet]

Reply via email to