Kent Yao created SPARK-57029:
--------------------------------
Summary: Add byte-level golden file for ICU sort keys to detect
collation regressions on ICU upgrades
Key: SPARK-57029
URL: https://issues.apache.org/jira/browse/SPARK-57029
Project: Spark
Issue Type: Improvement
Components: SQL, Tests
Affects Versions: 5.0.0
Reporter: Kent Yao
Assignee: Kent Yao
h3. Motivation
ICU upgrade PRs (SPARK-50189 76.1 / SPARK-52038 77.1 / SPARK-54447 78.1
/ SPARK-55308 78.2 / SPARK-56397 78.3) currently only touch the
dependency file and {{ICUCollationsMapSuite}}. Sort-key byte changes
between ICU versions go undetected without a review-trigger snapshot.
Spark relies on ICU4J's {{Collator.getCollationKey(...).toByteArray()}}
to compute binary-comparable sort keys for ICU collations. These byte
sequences are silently versioned by ICU — an ICU library upgrade can
change the bytes for the same locale + input without any test catching it.
h3. Proposal
Add a small golden file that snapshots the byte-level ICU sort keys for
a representative matrix of (locale x case/accent sensitivity x Unicode
input). A new test suite reads the markdown file and asserts byte-equality
against ICU's runtime output. When ICU is upgraded, regenerating the
golden file will make the diff loudly visible in code review.
h3. Scope
Single PR. Skeleton (P1a) wires the suite + disclaimer-only golden file.
Follow-up commits in the same PR fill in: (P1b) the actual cell matrix,
(P1c) the regenerator, (P1d) CI hook, (P1e) migration-guide note.
h3. Non-goals
* Not a stability contract — the golden file *surfaces* divergence;
reviewers still decide whether to accept the new bytes.
* No new SQLConf or user-visible runtime behavior.
* No change to CollationKey semantics or ICU version.
h3. Verification
* New test suite {{o.a.s.sql.ICUCollationSortKeyGoldenSuite}} under
{{sql/core/src/test/scala/}}.
* GREEN locally (44s sbt warm, 27ms test).
* Tests-only change; zero production code touched.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]