dejankrak-db opened a new pull request, #56236: URL: https://github.com/apache/spark/pull/56236
### What changes were proposed in this pull request? Use thread-local `Collator` instances in `CollationSpecICU.buildCollation()` to eliminate lock contention on ICU's `RuleBasedCollator`. A frozen `RuleBasedCollator` serializes all threads through a `ReentrantLock` on its internal collation buffer (used by `getCollationKey`/`compare`), which causes a significant parallelism loss when many threads compare/hash collated strings concurrently. By creating independent per-thread instances via `Collator.getInstance()`, each thread operates on its own buffer without locking. Each instance is still frozen as a mutation guard. The `Collation.getCollator()` accessor now returns the current thread's instance (or `null` for non-ICU collations). ### Why are the changes needed? To remove a concurrency bottleneck when comparing or hashing collated columns under parallel access. ### Does this PR introduce _any_ user-facing change? No. This is purely a concurrency optimization; collation results are identical. ### How was this patch tested? Added a concurrent test in `CollationFactorySuite` that verifies `comparator`, `hashFunction`, and `getCollator()` produce consistent results under parallel access across `UNICODE`, `en`, `de`, `en_CI`, and `en_AI` collations. Existing `CollationFactorySuite` tests continue to pass. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored using Claude code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
