[PR] [SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory [spark]

via GitHub Sun, 31 May 2026 13:43:06 -0700


dejankrak-db opened a new pull request, #56236:
URL: https://github.com/apache/spark/pull/56236


   ### What changes were proposed in this pull request?
   
   Use thread-local `Collator` instances in `CollationSpecICU.buildCollation()` 
to eliminate lock contention on ICU's `RuleBasedCollator`. A frozen 
`RuleBasedCollator` serializes all threads through a `ReentrantLock` on its 
internal collation buffer (used by `getCollationKey`/`compare`), which causes a 
significant parallelism loss when many threads compare/hash collated strings 
concurrently.
   
   By creating independent per-thread instances via `Collator.getInstance()`, 
each thread operates on its own buffer without locking. Each instance is still 
frozen as a mutation guard. The `Collation.getCollator()` accessor now returns 
the current thread's instance (or `null` for non-ICU collations).
   
   ### Why are the changes needed?
   
   To remove a concurrency bottleneck when comparing or hashing collated 
columns under parallel access.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is purely a concurrency optimization; collation results are 
identical.
   
   ### How was this patch tested?
   
   Added a concurrent test in `CollationFactorySuite` that verifies 
`comparator`, `hashFunction`, and `getCollator()` produce consistent results 
under parallel access across `UNICODE`, `en`, `de`, `en_CI`, and `en_AI` 
collations. Existing `CollationFactorySuite` tests continue to pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, co-authored using Claude code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57185][SQL] Use thread-local ICU collators to fix lock contention in CollationFactory [spark]

Reply via email to