adityavaish opened a new pull request, #4669:
URL: https://github.com/apache/datafusion-comet/pull/4669

   ## Which issue does this PR close?
   
   Part of #174 (Explore integration with Delta Lake). It does **not** close 
#174 — that issue tracks broader Delta integration (writes, deletion vectors, 
column mapping, full native scan); this PR adds the minimal read-only piece.
   
   Prior art / related work:
   - #3035 ("Support basic Delta scans") proposed the same idea and was closed 
only as **stale**, not rejected. This PR revives that approach.
   - It is intentionally distinct from the heavier `delta-kernel-rs` native 
scan explored in #3932 (closed in favor of a contrib module) and #4366 (draft 
`contrib/` tracking PR). **This PR adds no native Rust code and no new runtime 
dependency** — it reuses Comet's existing native Parquet reader.
   
   ## Rationale for this change
   
   A Delta table that uses neither deletion vectors nor column mapping is just 
plain Parquet on disk, read through Spark's `FileSourceScanExec` with 
`DeltaParquetFileFormat` (a subclass of `ParquetFileFormat`). Today Comet 
rejects it because `CometScanExec.isFileFormatSupported` requires the **exact** 
`ParquetFileFormat` class, so these scans run entirely in Spark even though 
Comet's native Parquet reader could read the files unchanged.
   
   On a scan-heavy micro-benchmark (20M-row plain Delta table, 
`filter(...).agg(sum, sum, sum)`, 5 iterations after warmup, `local[5]`), 
enabling the native scan was **~12.5x faster than Spark** (~600 ms vs ~7.5 s 
end-to-end), and ~41x faster than the existing 
`spark.comet.convert.parquet.enabled` path (which adds per-row Arrow conversion 
and is actually slower than Spark for scan-bound queries). Numbers are from a 
single dev box and a favorable query shape — real-world gains will be more 
modest.
   
   ## What changes are included in this PR?
   
   - New config `spark.comet.scan.delta.enabled` (default `false`, 
experimental).
   - `CometScanExec.isFileFormatSupported` accepts `DeltaParquetFileFormat` 
(matched by class name, **no compile-time dependency on delta-spark**) when the 
flag is enabled.
   - `CometScanRule` routes a Delta `FileSourceScanExec` through the existing 
native Parquet scan path, with **conservative fallback guards**:
     - **Column mapping** — detected via 
`DeltaParquetFileFormat.columnMappingMode` (reflection). Delta strips 
column-mapping metadata from the schema it exposes to the scan, so the file 
format object is the only reliable signal.
     - **Deletion vectors** — detected via Delta's synthetic 
`__delta_internal_*` scan columns.
     - On any reflection failure the table is treated as unsupported (fall 
back, never guess).
   - `CometDeltaReadSuite` + a per-Spark-profile `delta` test dependency wired 
through `delta.version` / `delta.artifact.name` properties (`delta-spark` for 
Delta ≥ 3.0, `delta-core` for Delta < 3.0).
   
   Deletion vectors, column mapping, and native Delta **writes** are 
intentionally out of scope and continue to run in Spark.
   
   ## How are these changes tested?
   
   New `CometDeltaReadSuite`, validated on the default build (Spark 4.1.2 + 
Delta 4.1.0), all green:
   
   ```
   CometDeltaReadSuite:
     + tier0 plain plan nodes:        CometNativeColumnarToRowExec, 
CometNativeScanExec
     + tier0 partitioned plan nodes:  CometNativeColumnarToRowExec, 
CometProjectExec, CometNativeScanExec
     + tier0 deletion-vectors nodes:  ... FileSourceScanExec   (falls back to 
Spark)
     + tier0 column-mapping nodes:    ... FileSourceScanExec   (falls back to 
Spark)
   Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
   ```
   
   Coverage:
   - Plain and partitioned reads now use `CometNativeScanExec`; results 
verified equal to Spark via `checkSparkAnswer`.
   - Deletion-vector and column-mapping tables correctly fall back to Spark 
(asserted) and still return correct results.
   - Read-analysis cases (convert off/on, downstream aggregate) documenting how 
a Delta scan flows through Comet.
   - Feature-specific tests self-cancel on Delta versions that lack 
`columnMappingMode` / deletion vectors, so the suite stays green across the 
Spark/Delta build matrix.
   
   No impact on non-Delta scans — the new branch is gated on a default-`false` 
flag and only runs for Delta `FileSourceScanExec`. Regression checks:
   - `CometScanRuleSuite` + `CometExecRuleSuite`: 16 passed
   - `CometNativeReaderSuite`: 51 passed (1 canceled)
   - `cargo test` (native): 634 passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to