[ 
https://issues.apache.org/jira/browse/SPARK-57029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-57029:
-----------------------------
    Labels: Correctness collation correctness pull-request-available testing  
(was: collation pull-request-available testing)

> Add byte-level golden file for ICU sort keys to detect collation regressions 
> on ICU upgrades
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57029
>                 URL: https://issues.apache.org/jira/browse/SPARK-57029
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Tests
>    Affects Versions: 4.2.0
>            Reporter: Kent Yao
>            Assignee: Kent Yao
>            Priority: Blocker
>              Labels: Correctness, collation, correctness, 
> pull-request-available, testing
>
> h3. Motivation
> ICU upgrade PRs (SPARK-50189 76.1 / SPARK-52038 77.1 / SPARK-54447 78.1
> / SPARK-55308 78.2 / SPARK-56397 78.3) currently only touch the
> dependency file and {{ICUCollationsMapSuite}}. Sort-key byte changes
> between ICU versions go undetected without a review-trigger snapshot.
> Spark relies on ICU4J's {{Collator.getCollationKey(...).toByteArray()}}
> to compute binary-comparable sort keys for ICU collations. These byte
> sequences are silently versioned by ICU — an ICU library upgrade can
> change the bytes for the same locale + input without any test catching it.
> h3. Proposal
> Add a small golden file that snapshots the byte-level ICU sort keys for
> a representative matrix of (locale x case/accent sensitivity x Unicode
> input). A new test suite reads the markdown file and asserts byte-equality
> against ICU's runtime output. When ICU is upgraded, regenerating the
> golden file will make the diff loudly visible in code review.
> h3. Scope
> Single PR. Skeleton (P1a) wires the suite + disclaimer-only golden file.
> Follow-up commits in the same PR fill in: (P1b) the actual cell matrix,
> (P1c) the regenerator, (P1d) CI hook, (P1e) migration-guide note.
> h3. Non-goals
> * Not a stability contract — the golden file *surfaces* divergence;
>   reviewers still decide whether to accept the new bytes.
> * No new SQLConf or user-visible runtime behavior.
> * No change to CollationKey semantics or ICU version.
> h3. Verification
> * New test suite {{o.a.s.sql.ICUCollationSortKeyGoldenSuite}} under
>   {{sql/core/src/test/scala/}}.
> * GREEN locally (44s sbt warm, 27ms test).
> * Tests-only change; zero production code touched.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to