On 10/27/2025 10:22 PM, Thomas Munro wrote:
> Here's a very short patch to experiment with the idea of using
> Windows' native UTF-8 support when possible, ie when using
> "en-US.UTF-8" in a UTF-8 database.  Otherwise it continues to use the
> special Windows-only wchar_t conversion that allows for locales with
> non-matching locales, ie the reason you're allowed to use
> "English_United States.1252" in a UTF-8 database on that OS, something
> we wouldn't allow on Unix.
> 
> As I understand it, that mechanism dates from the pre-Windows 10 era
> when it had no .UTF-8 locales but users wanted or needed to use UTF-8
> databases.  I think some locales used encodings that we don't even
> support as server encodings, eg SJIS in Japan, so that was a
> workaround.  I assume you could use "ja-JP.UTF-8" these days.
> 
> CI tells me it compiles and passes, but I am not a Windows person, I'm
> primarily interested in code cleanup and removing weird platform
> differences.  I wonder if someone directly interested in Windows would
> like to experiment with this and report whether (1) it works as
> expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
> (which I guess uses WIN1252 encoding and triggers the conversion
> path?), and similarly for other locale pairs you might be interested
> in?

I wrote a standalone test to check this. Results on Windows 11 x64,
16 cores, ACP=1252.

(1) Correctness: PASS. strcoll_l() with UTF-8 locale matches wcscoll_l()
    for all 26 test cases (ASCII, accents, umlauts, ß, Greek, etc).
    Sorting 38 German/French words with both methods produces identical
    order.

(2) Performance vs WIN1252: It depends on the data.

Basic comparison with short real-world strings (1M iterations each):

  Test                   UTF8-new   UTF8-cur   WIN1252
  ----                   --------   --------   -------
  'hello' vs 'world'       82 ms     108 ms     76 ms
  'apple' vs 'banana'      85 ms     110 ms     77 ms
  'PostgreSQL' vs 'MySQL'  89 ms     113 ms     83 ms

UTF8-new = strcoll_l with UTF-8 locale (proposed patch)
UTF8-cur = wcscoll_l via conversion (current PostgreSQL)
WIN1252  = strcoll_l with legacy locale (baseline)

For ASCII strings, there's a crossover around 15-20 characters
(500K iterations each):

  Length   UTF8     WIN1252   Ratio
  ------   ----     -------   -----
       5   43 ms     40 ms    0.93x  (UTF8 7% slower)
      10   50 ms     48 ms    0.96x  (UTF8 4% slower)
      20   57 ms     65 ms    1.13x  (UTF8 13% faster)
      50  104 ms    122 ms    1.17x  (UTF8 17% faster)
     100  150 ms    195 ms    1.30x  (UTF8 30% faster)
     500  550 ms    783 ms    1.43x  (UTF8 43% faster)

For accented characters (á = 2 bytes UTF-8, 1 byte WIN1252), UTF-8
is consistently ~2x slower, as expected from the byte count
(500K iterations each):

  Chars    UTF8     WIN1252   Ratio
  -----    ----     -------   -----
      5    55 ms     42 ms    0.76x
     50   233 ms    117 ms    0.50x
    200   694 ms    342 ms    0.49x

With 200-char ASCII strings, UTF-8 beats WIN1252 even when the
difference is at position 0 (500K iterations each):

  Difference at    UTF8     WIN1252
  -------------    ----     -------
  Position 0       168 ms   260 ms
  Position 199     252 ms   342 ms

This suggests WIN1252's strcoll_l has poor scaling characteristics
that UTF-8's implementation avoids. I don't have an explanation for
why.

The patch is correct, and the new strcoll_l() path is 10-25% faster than
the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
or slower than WIN1252 depends on string length and content - but users
choosing UTF-8 locales presumably want Unicode support, not WIN1252
compatibility.

I can test more if needed.  I can also provide the test program for
anyone who wants it.

-- 
Bryan Green
EDB: https://www.enterprisedb.com


Reply via email to