andygrove commented on PR #4669: URL: https://github.com/apache/datafusion-comet/pull/4669#issuecomment-4742904875
Thanks for the careful writeup, the explicit framing of how this complements (rather than replaces) @schenksj's contrib work in #4366, and the AI-disclosure block. I appreciate the discipline of disclosing tooling per the ASF guidance and including the human-oversight statement. I'll defer to @schenksj on the strategic question of whether an in-core Tier 0 path is the right shape alongside the contrib `delta-kernel-rs` work, since they have the most context. A few specific suggestions on the change itself: 1. **Fragile DV detection.** The deletion-vector guard relies on the `__delta_internal` column-name prefix, which is a Delta implementation detail rather than a public contract. The Tier 0 DV test cancels itself when `DELETE` materializes as a copy-on-write rewrite instead of a deletion vector, so on some build-matrix slots there's no guaranteed coverage of the DV fallback path. Could you add a unit-level test that constructs a synthetic `FileSourceScanExec` with a `__delta_internal_row_index` column in its required schema and asserts the rule rejects it? That would pin the column-name contract independent of Delta version. 2. **Type widening (Delta 3.2+).** A Delta table that has had a column widened stores the older type in Parquet and widens it on read. A native Parquet scan would return the narrower type, which would silently disagree with Spark. Worth either adding a guard or filing a tracking issue and noting it in the docstring's unsupported list. Same question for row tracking and the variant type in Delta 4.0+. 3. **Reflection catch is too wide.** In `deltaColumnMappingEnabled`, `case _: Throwable => true` also swallows `Error` subclasses like `OutOfMemoryError` and `LinkageError`. The conservative-fallback policy is right, but `case _: Exception` (or `ReflectiveOperationException` then `Exception`) keeps that policy without masking VM-level errors. 4. **Both flags on.** There's no test that pins down behavior when `spark.comet.scan.delta.enabled=true` and `spark.comet.convert.parquet.enabled=true` are both set. From reading the code the native scan rule runs first so the convert path doesn't kick in, but a one-line assertion would lock it down. 5. **Older Delta versions.** The Tier 0 tests `assume` away builds where `columnMappingMode` reflection isn't available (Delta 2.x). If `isFileFormatSupported` refused to widen when that method can't be located, runtime behavior on older Delta would be unchanged and the test matrix wouldn't be doing the load-bearing version gating. I'll approve the CI workflows so the matrix actually runs across all the Delta versions you've wired in. The `cancelIfDeltaDmlSkew` carve-out specifically suggests the Spark 4.1.2 + Delta 4.1.0 slot may have an interesting DML-skew interaction worth eyeballing in the logs once they exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
