schenksj opened a new pull request, #4366:
URL: https://github.com/apache/datafusion-comet/pull/4366
## Summary
Native Delta Lake scan integration for Comet, structured as an Iceberg-style
contrib (typed `OpStruct::DeltaScan` proto + a small set of feature-gated
core
touchpoints + a separate `contrib/delta/` tree). Replaces the SPI / registry
design from #3932 per the feedback on that PR.
**This PR supersedes #3932**, which is being closed in favor of this one.
## What's in here
- **`contrib/delta/`** — full Delta integration (Scala + Rust + dev scripts).
Scan rule, native scan exec, plan-data injector, kernel-rs engine cache,
partition-value parsing, DV filtering, column-mapping, row-tracking, DPP
partition pruning, multi-task packing, `input_file_name()` support.
- **Typed proto variant** `delta_scan = 117` on `OpStruct` (no envelope op).
- **~40 lines of core touchpoints**, all feature-gated:
- `DeltaIntegration` reflection bridge in `org.apache.comet.rules`
- One arm in `CometScanRule.transformV1Scan`
- One arm in `CometExecRule.transform` for the Delta scan marker
- One trait method (`opStructCase`) on `PlanDataInjector`
- Per-partition file-path plumbing through `CometExecRDD` →
`CometExecIterator` so `wrapNativeParquetError` and
`InputFileBlockHolder` get the right path (used by Delta's
UPDATE/DELETE/MERGE `input_file_name()` flows)
- **Maven `-Pcontrib-delta` profile** in `spark/pom.xml` + a parallel Cargo
`contrib-delta` feature in the native crates
- **Standalone regression harness** at `contrib/delta/dev/run-regression.sh`
runs the full Delta 4.1 Spark test suite against this branch
## Notable fixes that landed during validation
- `de9e0d3c` + `ed0d8acb` — `FAILED_READ_FILE.NO_HINT` wrapping for native
parquet errors, with file path threaded from scan partitions (fixes
`SnapshotManagementSuite` checkpoint-broken tests)
- `effe5f76` — filesystem scheme allowlist on V1 scans (fixes `fake://` test
fallback)
- `56c2b011` — decline `CreateArray` when children have mismatched data
types (fixes CDF `replaceWhere` panics; upstream issue filed as
[apache/datafusion#22366](https://github.com/apache/datafusion/issues/22366))
- `90969575` — engine cache by `(scheme, authority, config)` to bound
OS-thread churn under high scan rate
- `43768c1c` — review-fix bundle: missing `InputFileBlockHolder` hook, DV
filter ordering safeguards, tighter `is_not_found` matching, multi-line parquet
error regex
## Perf-sweep items addressed (vs. initial direct-port)
- `7e9249f6` — cache resolved `Method` handle in `DeltaIntegration`
- `fea28d7e` — `Map[OpStructCase, PlanDataInjector]` O(1) injector lookup
- `fea28d7e` — pre-parsed `SessionTimezone` (one parse per scan, not per row)
- `a805f813` — hoist `CometScanTypeChecker` out of per-scan loop
- `e3467761` — O(1) partition-value lookup in `build_delta_partitioned_files`
## Test plan
- [x] Targeted retest of all surfaced failure clusters passes against
current branch:
- `DescribeDeltaHistorySuite "replaceWhere on data column"` — 8/8 pass
- `DeltaTableHadoopOptionsSuite "dropFeatureSupport - with filesystem
options"` — 1/1 pass
- `SnapshotManagementSuite "should not recover when the current checkpoint
is broken..."` — 2/2 pass
- [ ] Full Delta 4.1 regression in progress (relaunched post review-fix
bundle)
- [x] Default builds (no `-Pcontrib-delta`) still build green: `mvn -pl
spark -am test-compile`
- [x] `-Pcontrib-delta` builds green
- [x] Engine-cache fix verified to bound OS thread count (previously hitting
`pthread_create EAGAIN`)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]