felipepessoto opened a new pull request, #12388:
URL: https://github.com/apache/gluten/pull/12388

   Fix https://github.com/apache/gluten/issues/9296.
   
   ## What changes are proposed in this pull request?
   Adds a CI pipeline that runs delta-io/delta's `spark` ScalaTest suite 
against the Gluten Velox bundle, so we can validate Gluten against a real Delta 
release and catch regressions over time.
   
   Running the Delta UTs on Gluten produces **many expected failures** (Gluten 
does not yet offload every Delta code path, and falls back or behaves 
differently in places). A plain "red on any failure" gate would be useless. 
Instead, the pipeline keeps a **committed baseline of known failures** and 
gates each run against it:
   
   - **regression** -- a test fails that is *not* in the baseline -> the shard 
fails.
   - **expected** -- a failing test that *is* in the baseline -> ignored.
   - **now-passing** -- a baseline test that starts passing -> fails the shard 
(keeps the baseline honest), unless `fail_on_fixed=false`.
   
   ### How it works
   
   1. Runs as a **reusable workflow** (`on: workflow_call`) invoked from 
`velox_backend_x86.yml`, so it **reuses the Velox native libs + Arrow jars that 
workflow already builds** instead of duplicating the expensive native C++ 
build. It then assembles the `gluten-velox-bundle` fat jar (Spark 4.1 + Scala 
2.13 + JDK 17, Delta profile). A `workflow_dispatch` trigger is kept for 
standalone manual runs (which build the native lib themselves).
   2. Clones delta-io/delta at a release tag (currently `v4.2.0`), drops the 
bundle onto the `spark` project's test classpath, patches `DeltaSQLCommandTest` 
to register `GlutenPlugin`, and cherry-picks two merged upstream Delta 
test-only fixes (delta-io/delta#7104 + #7105) that widen `FileSourceScanExec` 
checks to `FileSourceScanLike` so Gluten's transformed plan is recognized.
   3. Runs `sbt spark/test` **sharded by suite** across **4 shards (4 forked 
test JVMs each, ~16-way parallelism)**, with ScalaTest's JUnit XML reporter 
enabled, then gates each shard with `compare-test-results.py` against 
`known-failures.txt`. A final job aggregates all shards into a single 
ready-to-commit baseline and flags stale entries.
   
   ### Files
   
   | File | Purpose |
   |---|---|
   | `.github/workflows/velox_backend_x86.yml` | Caller: builds the native lib 
once, uploads the native + Arrow artifacts, and invokes the reusable Delta 
workflow (reusing that build instead of duplicating it). |
   | `.github/workflows/delta_spark_ut.yml` | The reusable Delta workflow 
(build bundle -> shard tests -> gate). |
   | `.github/workflows/util/delta-spark-ut/setup-delta.sh` | Clones Delta, 
injects the Gluten bundle, patches `DeltaSQLCommandTest`, cherry-picks the 
upstream test fixes. |
   | `.github/workflows/util/delta-spark-ut/compare-test-results.py` | Parses 
JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only). 
|
   | `.github/workflows/util/delta-spark-ut/known-failures.txt` | Committed 
baseline of currently-expected failures (`#` comments per line). |
   | `.github/workflows/util/delta-spark-ut/README.md` | Documents the gate, 
bootstrapping, and baseline refresh. |
   
   ### Operational hardening
   
   - **JDK 17 + Arrow/Netty**: forked test JVMs get the `--add-opens` set plus 
`-Dio.netty.tryReflectionSetAccessible=true` (otherwise Arrow's allocator fails 
to initialize).
   - **Heap tuning**: forked-test heap and the sbt launcher's idle G1 behavior 
are tuned to keep the ~16 GB runner under the cgroup OOM threshold.
   - **Hang watchdog**: a per-shard watchdog dumps threads and kills a forked 
test JVM that has gone silent too long, so a wedged suite can't stall the whole 
job.
   - **DeletionVectorsSuite 2B-row tests**: two tests build/read/delete a 
2-billion-row table and balloon the fork to ~13 GB of native memory (Velox 
row-index materialization), OOM-killing it and hanging the shard. They are 
force-failed (with a clear message) rather than silently ignored, so the gap 
stays visible until the native memory blow-up is fixed.
   
   ### Scope / known limitations
   
   - Velox backend, x86 only; Delta `v4.2.0` / Spark 4.1 / Scala 2.13 / JDK 17.
   - The baseline reflects the *current* set of known Delta-on-Gluten failures; 
refresh it via a `workflow_dispatch` run with `update_baseline=true`.
   - **Future work -- Delta 4.3.0**: attempted, but the bundle (compiled 
against Delta 4.1.0) hits a binary-incompatible Delta change 
(`IdentityColumn.logTableWrite` first param `Snapshot` -> 
`SnapshotDescriptor`), which `NoSuchMethodError`s on every write. Supporting 
4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.
   
   ## How was this patch tested?
   
   This change *is* CI. The Delta suite runs as part of `velox_backend_x86.yml` 
-- on every PR/trigger that touches Velox/core/cpp or the Delta CI files -- and 
via manual `workflow_dispatch`. In the latest runs all shards pass against the 
committed baseline (failures limited to known-failures entries; no regressions).
   
   19,073 Delta tests run (18,297 passed / 776 failed).
   
   ### Main failures (776 baseline):
   - 226 tests - Increment Metric: known issue 
https://github.com/apache/gluten/issues/9003. [Test with increment metric 
offload 
disabled](https://github.com/apache/gluten/actions/runs/28226442887/job/83623837492?pr=12380)
   - 99 tests - VariantType - java.lang.UnsupportedOperationException: 
Unsupported data type: variant - Arrow throws (SparkArrowUtil.scala:60)
   - ~47 tests - ClassCast  ProjectExec -> WholeStageTransformer  (Delta stats) 
- This will be addressed by 
https://github.com/apache/gluten/issues/11622#issuecomment-4421317668 
`timestamp -> timestamp_ntz`
   
   **Fixed since the first draft (#12371):** the 187 `MatchError List()` 
DataSkipping-empty-stats failures (caused by a `FileSourceScanExec` match) were 
fixed by cherry-picking the merged Delta PRs 7104 + 7105 (`FileSourceScanExec` 
-> `FileSourceScanLike`) during test setup. That dropped the baseline from 963 
to **776** known failures (187 now-passing removed, 0 regressions).
   
   ## Delta Spark UT (Gluten) -- shard count vs test parallelism
   
   Sharding is by **suite** (`MurmurHash3(suiteName) % NUM_SHARDS`), so total 
test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. The 
committed config is **4 shards x 4 forks**.
   
   | Config | Runner jobs | Forks/shard | Max shard | Wall-clock | Billed 
job-hrs* | Outcome |
   |---|---|---|---|---|---|---|
   | 16 shards x 1 fork | 16 | 1 | ~110 min | ~130 min | ~29 | green |
   | **4 shards x 4 forks** | **4** | **4** | **158 min** | **178 min** | 
**~10.5** | **green** |
   | 4 shards x 1 fork | 4 | 1 | 360 min (hit cap) | -- | -- | cancelled |
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: GitHub Copilot CLI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to