yaooqinn opened a new pull request, #56096: URL: https://github.com/apache/spark/pull/56096
### What changes were proposed in this pull request? Add a test-only visibility golden suite for ICU collation sort keys: - New test: `sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala` - New golden: `sql/core/src/test/resources/collations/ICU-collations-sort-keys.md` (38 cells, ~1900 bytes) The suite snapshots `(collation, input) -> hex(CollationKey)` for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors. Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell. The pattern mirrors `ICUCollationsMapSuite` (which lists the ICU locale surface) and is scoped to ICU-backed collations only. `UTF8_LCASE` is out of scope -- it does not go through `com.ibm.icu.text.Collator.getCollationKey()` and is already covered by `CollationFactorySuite`. ### Why are the changes needed? icu4j upgrades silently change `ORDER BY ... COLLATE` semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal. This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is **not** a stability contract -- the disclaimer at golden line 1, the `GOLDEN_DISCLAIMER` constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more. Reviewer note: when this golden file changes on a PR that does not bump `icu4j`, please request a revert -- regeneration belongs in the ICU upgrade PR. ### Does this PR introduce _any_ user-facing change? No. Test-only; no SQLConf, no public API, no production code path. ### How was this patch tested? - New suite `ICUCollationSortKeyGoldenSuite` (16 tests). Local 16/16 PASS deterministic 2-round on master. - Regenerate the golden with `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"`; the suite enforces idempotency and that on-disk bytes match the regen output. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
