Re: [PR] feat: support native Comet scan of plain Delta Lake tables [datafusion-comet]

via GitHub Thu, 18 Jun 2026 07:21:14 -0700


andygrove commented on PR #4669:
URL: 
https://github.com/apache/datafusion-comet/pull/4669#issuecomment-4742904875


   Thanks for the careful writeup, the explicit framing of how this complements 
(rather than replaces) @schenksj's contrib work in #4366, and the AI-disclosure 
block. I appreciate the discipline of disclosing tooling per the ASF guidance 
and including the human-oversight statement.
   
   I'll defer to @schenksj on the strategic question of whether an in-core Tier 
0 path is the right shape alongside the contrib `delta-kernel-rs` work, since 
they have the most context. A few specific suggestions on the change itself:
   
   1. **Fragile DV detection.** The deletion-vector guard relies on the 
`__delta_internal` column-name prefix, which is a Delta implementation detail 
rather than a public contract. The Tier 0 DV test cancels itself when `DELETE` 
materializes as a copy-on-write rewrite instead of a deletion vector, so on 
some build-matrix slots there's no guaranteed coverage of the DV fallback path. 
Could you add a unit-level test that constructs a synthetic 
`FileSourceScanExec` with a `__delta_internal_row_index` column in its required 
schema and asserts the rule rejects it? That would pin the column-name contract 
independent of Delta version.
   
   2. **Type widening (Delta 3.2+).** A Delta table that has had a column 
widened stores the older type in Parquet and widens it on read. A native 
Parquet scan would return the narrower type, which would silently disagree with 
Spark. Worth either adding a guard or filing a tracking issue and noting it in 
the docstring's unsupported list. Same question for row tracking and the 
variant type in Delta 4.0+.
   
   3. **Reflection catch is too wide.** In `deltaColumnMappingEnabled`, `case 
_: Throwable => true` also swallows `Error` subclasses like `OutOfMemoryError` 
and `LinkageError`. The conservative-fallback policy is right, but `case _: 
Exception` (or `ReflectiveOperationException` then `Exception`) keeps that 
policy without masking VM-level errors.
   
   4. **Both flags on.** There's no test that pins down behavior when 
`spark.comet.scan.delta.enabled=true` and 
`spark.comet.convert.parquet.enabled=true` are both set. From reading the code 
the native scan rule runs first so the convert path doesn't kick in, but a 
one-line assertion would lock it down.
   
   5. **Older Delta versions.** The Tier 0 tests `assume` away builds where 
`columnMappingMode` reflection isn't available (Delta 2.x). If 
`isFileFormatSupported` refused to widen when that method can't be located, 
runtime behavior on older Delta would be unchanged and the test matrix wouldn't 
be doing the load-bearing version gating.
   
   I'll approve the CI workflows so the matrix actually runs across all the 
Delta versions you've wired in. The `cancelIfDeltaDmlSkew` carve-out 
specifically suggests the Spark 4.1.2 + Delta 4.1.0 slot may have an 
interesting DML-skew interaction worth eyeballing in the logs once they exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: support native Comet scan of plain Delta Lake tables [datafusion-comet]

Reply via email to