andygrove commented on PR #4591: URL: https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4754903178
## Comparison of #4569 and #4591 These two PRs both close #2391 and take the same fundamental approach, so cross-linking a comparison here for visibility. ### Shared goal and mechanics Both solve the same problem: Comet does not treat `InMemoryTableScanExec` as native, so it inserts a `CometSparkToColumnarExec` and pays a JVM-to-Arrow conversion on every cached read. Both fix this by storing the cache as compressed Arrow IPC once at build time, so repeated scans feed native execution directly. Both share the same core building blocks: - A custom `CachedBatchSerializer` that encodes batches with Comet's existing `serializeBatches` / `decodeBatches` (compressed Arrow IPC). - A `CometCachedBatch` payload holding the IPC bytes. - Installation of the serializer via `CometDriverPlugin` at startup. - A new config, off by default. - Decode back to `CometVector`-backed `ColumnarBatch` with column pruning, plus an `InternalRow` fallback for non-Comet consumers. - Roughly the same size and the same test layout (a serializer suite plus a plugin/exec suite). ### Key differences | Dimension | #4569 | #4591 | |---|---|---| | Serializer base class | `SimpleMetricsCachedBatchSerializer` | plain `CachedBatchSerializer` | | Batch-level pruning | Yes: stores a Spark-format per-column stats row (min/max/null/count), so `buildFilter` prunes batches | No: `stats = InternalRow.empty`, `buildFilter` is a pass-through no-op | | How the scan stays native | No new operator. Reuses `CometSparkToColumnarExec` with a passthrough fast-path: if columns are already `CometVector`, forward without re-copy (adds `numPassthroughBatches` metric) | New operator `CometInMemoryTableScanExec` (a `CometExec` / `LeafExecNode`) plus a `CometOperatorSerde`, wired into `CometExecRule`'s `nativeExecs` map | | Unsupported types | Delegates to `DefaultCachedBatchSerializer` (nested/complex) as an explicit drop-in | Delegates to `DefaultCachedBatchSerializer` by inspecting batch class in the convert methods | | Serializer install policy | Sets `spark.sql.cache.serializer` only when enabled, and never overrides a user-provided non-default serializer | Always registers the serializer; the serializer decides at runtime whether to use Comet format or delegate | | Config | `spark.comet.cache.serializer.enabled` | `spark.comet.exec.inMemoryCache.enabled` (EXEC category) | | Plan-rule changes | Minimal (works through the existing SparkToColumnar path) | Adds an explicit `InMemoryTableScanExec` case with detailed fallback-reason messages (disabled / wrong-batch-class / empty-buffer) | ### Architectural distinction The main difference is the integration strategy: - **#4569 is serializer-centric.** It does the minimum on the plan-rule side and leans on the existing `CometSparkToColumnarExec`, teaching it to skip the copy when batches are already Arrow. It also invests in stats so filter pushdown prunes cached batches, and is careful to respect a user-set serializer. - **#4591 is operator-centric.** It introduces a dedicated `CometInMemoryTableScanExec` node and serde, giving cached scans a first-class place in the Comet operator framework with explicit fallback reasons. It currently skips column statistics, so cached filters do not get batch pruning. ### Suggested path forward A strong combined outcome would pair #4591's dedicated scan operator and explicit fallback reasons with #4569's stats-based `buildFilter`, passthrough fast-path, and respect-user-serializer install logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
