brijrajk opened a new issue, #12375: URL: https://github.com/apache/gluten/issues/12375
## Summary `GlutenTPCHPlanStabilitySuite` → `tpch/q19` fails in `spark-test-spark40` CI for any PR that touches Velox backend Scala files. The failure is caused by a stale golden file combined with a known limitation in the ExprId normalizer. ## Affected check `spark-test-spark40` (and `spark-test-spark41`) ## Root cause `GlutenPlanStabilitySuite.glutenNormalizeIds()` uses the regex `(?<prefix>(?<!id=)#)\\d+L?` which matches **any** `#<number>` in the explain text — including TPC-H string literals. The `p_brand` filter in q19 uses values `Brand#11`, `Brand#12`, `Brand#13` (actual TPC-H spec data values). These appear unquoted in the explain output: ``` EqualTo(p_brand, Brand#12) ``` The normalizer incorrectly treats `#12` as an ExprId and remaps it sequentially based on encounter order. The suite code itself documents this limitation at line 67–68: > *"Running all suites together in one JVM is recommended to avoid ExprId normalization issues where string constants (e.g., Brand#23 in TPCH q19) may collide with ExprId numbers."* ## How it manifests The golden file was committed in #11805 (`c37fee4e5`, 2026-03-24). Over the 264 commits since then, new optimizer rules and expressions shifted the ExprId counter. `Brand#12` now normalizes to `Brand#6` and `_pre_1#14` shifts to `_pre_1#13`, causing a spurious mismatch. Reproduced on `main` at commit `6097b59a6` (2026-06-25) without any pending PR: ``` Tests: succeeded 21, failed 1 ← tpch/q19 BUILD FAILURE ``` ## PRs affected - #12151 — [GLUTEN-12013][VL] Fix bloom-filter bytes corruption on whole-stage AQE fallback - #12095 — [GLUTEN-12094][VL] Strip default comparator from array_sort for Velox offloading - #12056 — [GLUTEN-11921] Enable Parquet read/write test for NullType ## Short-term fix Refresh `q19/explain.txt` via `SPARK_GENERATE_GOLDEN_FILES=1` — tracked in #12374. ## Long-term fix Make `glutenNormalizeIds` skip `#N` patterns that appear inside string literal contexts (i.e., where the `#` is preceded by non-whitespace word characters that are not a column/expression name). This would prevent TPC-H brand values like `Brand#12` from being incorrectly normalized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
