yaooqinn opened a new pull request, #56096:
URL: https://github.com/apache/spark/pull/56096

   ### What changes were proposed in this pull request?
   
   Add a test-only visibility golden suite for ICU collation sort keys:
   
   - New test: 
`sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala`
   - New golden: 
`sql/core/src/test/resources/collations/ICU-collations-sort-keys.md` (38 cells, 
~1900 bytes)
   
   The suite snapshots `(collation, input) -> hex(CollationKey)` for 14 
dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / 
secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder 
visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at 
primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; 
empty string boundary; U+FFFD; C0 controls; variation selectors.
   
   Each test asserts a contract on the recorded bytes: row existence, non-empty 
hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share 
invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for 
CJK implicit weights. Drift-prone dims fire with named-condition messages if 
Spark's ICU configuration or library version changes the semantic; stable dims 
fire if a regression silently drops or folds a cell.
   
   The pattern mirrors `ICUCollationsMapSuite` (which lists the ICU locale 
surface) and is scoped to ICU-backed collations only. `UTF8_LCASE` is out of 
scope -- it does not go through `com.ibm.icu.text.Collator.getCollationKey()` 
and is already covered by `CollationFactorySuite`.
   
   ### Why are the changes needed?
   
   icu4j upgrades silently change `ORDER BY ... COLLATE` semantics across Spark 
versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, 
SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results 
-- they ship no byte-level regression on sort output, so a CLDR re-weighting 
can land in master without any reviewer signal.
   
   This suite makes such drift visible during ICU upgrade review: any change to 
the recorded bytes shows up as a golden diff that a reviewer must explicitly 
accept. It is **not** a stability contract -- the disclaimer at golden line 1, 
the `GOLDEN_DISCLAIMER` constant (and the line-1 assert that pins it), and the 
suite scaladoc all state that downstream consumers MUST NOT rely on byte 
equality across Spark versions. The file is a review-trigger snapshot, nothing 
more.
   
   Reviewer note: when this golden file changes on a PR that does not bump 
`icu4j`, please request a revert -- regeneration belongs in the ICU upgrade PR.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only; no SQLConf, no public API, no production code path.
   
   ### How was this patch tested?
   
   - New suite `ICUCollationSortKeyGoldenSuite` (16 tests). Local 16/16 PASS 
deterministic 2-round on master.
   - Regenerate the golden with `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt 
"sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"`; the suite 
enforces idempotency and that on-disk bytes match the regen output.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to