felipepessoto opened a new pull request, #12371: URL: https://github.com/apache/gluten/pull/12371
Fix https://github.com/apache/gluten/issues/9296. I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc. ## What changes are proposed in this pull request? Adds an CI pipeline that runs delta-io/delta's `spark` ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time. Running the Delta UTs on Gluten produces **many expected failures** (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a **committed baseline of known failures** and gates each run against it: - **regression** -- a test fails that is *not* in the baseline -> the shard fails. - **expected** -- a failing test that *is* in the baseline -> ignored. - **now-passing** -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless `fail_on_fixed=false`. ### How it works 1. Builds the Velox/Gluten native libs and assembles the `gluten-velox-bundle` fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile). 2. Clones delta-io/delta at a release tag (currently `v4.2.0`), drops the bundle onto the `spark` project's test classpath, and patches `DeltaSQLCommandTest` to register `GlutenPlugin`. 3. Runs `sbt spark/test` **sharded by suite** across 16 shards, with ScalaTest's JUnit XML reporter enabled, then gates each shard with `compare-test-results.py` against `known-failures.txt`. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries. ### Files | File | Purpose | |---|---| | `.github/workflows/delta_spark_ut.yml` | The workflow (build bundle -> shard tests -> gate). | | `.github/workflows/util/delta-spark-ut/setup-delta.sh` | Clones Delta, injects the Gluten bundle, patches `DeltaSQLCommandTest`. | | `.github/workflows/util/delta-spark-ut/compare-test-results.py` | Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only). | | `.github/workflows/util/delta-spark-ut/known-failures.txt` | Committed baseline of currently-expected failures (`<suite>#<test>` per line). | | `.github/workflows/util/delta-spark-ut/README.md` | Documents the gate, bootstrapping, and baseline refresh. | ### Operational hardening - **JDK 17 + Arrow/Netty**: forked test JVMs get the `--add-opens` set plus `-Dio.netty.tryReflectionSetAccessible=true` (otherwise Arrow's allocator fails to initialize). - **Heap tuning**: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold. - **Hang watchdog**: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job. - **DeletionVectorsSuite 2B-row tests**: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed. ### Scope / known limitations - Velox backend, x86 only; Delta `v4.2.0` / Spark 4.1 / Scala 2.13 / JDK 17. - The baseline reflects the *current* set of known Delta-on-Gluten failures; refresh it via a `workflow_dispatch` run with `update_baseline=true`. - **Future work -- Delta 4.3.0**: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (`IdentityColumn.logTableWrite` first param `Snapshot` -> `SnapshotDescriptor`), which `NoSuchMethodError`s on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up. ## How was this patch tested? This change *is* CI. The workflow runs automatically on PRs that touch its files and via manual dispatch. In the latest runs all 16 shards pass against the committed baseline (failures limited to known-failures entries; no regressions). ## Delta Spark UT (Gluten) — shard count vs test parallelism Sharding is by **suite** (`MurmurHash3(suiteName) % NUM_SHARDS`), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. | Config | Runner jobs | Forks/shard | Max shard | Wall-clock | Billed job-hrs* | Outcome | |---|---|---|---|---|---|---| | 16 shards × 1 fork | 16 | 1 | ~110 min | ~130 min | ~29 | ✅ green | | **4 shards × 4 forks** | **4** | **4** | **158 min** | **178 min** | **~10.5** | **✅ green** | | 4 shards × 1 fork | 4 | 1 | 360 min (hit cap) | — | — | ❌ cancelled | ## Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot CLI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
