Hi,

While trying pg16beta1 libc collations on Windows, I noticed that UTF-8
text sorts sometimes differently across invocations with the same
locales, which is wrong since these collations are deterministic.

The OS is Windows 10 Home, version 10.0.19045 Build 19045,
self-built 16beta1 with VS Community 2022, without ICU, default
configuration in postgresql.conf.

It seems to occur more or less randomly with all libc locales except
C/POSIX, with the probability of getting differences being seemingly
much higher when the data gets larger in number of rows and uses
higher codepoints (like if all character are in [U+0001,U+0400] the
sorts never differ with 40k rows, but they do if there are much more
rows or if the range is [U+0001,U+2000]).

Also, it does not occur at all if parallel scan is disabled.

I've come up with a self-contained script that generates random words
and repeatedly sorts and feed them to md5sum. It takes the number of
rows and the highest Unicode codepoint as arguments, and shows when the
checksums differ across consecutive invocations.

Here's a typical run showing how it goes wrong after the 14th sort:

$ bash repro-coll-windows.sh 40000 16383
NOTICE:  relation "random_words" already exists, skipping
CREATE TABLE
TRUNCATE TABLE
CREATE FUNCTION
DROP COLLATION
CREATE COLLATION
INSERT 0 40000
ANALYZE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
35050d858f4c590788132627e74f62c8 -> e746b626fcc848cbbc67570a7dde03bb
(iter=15)
16 
e746b626fcc848cbbc67570a7dde03bb -> 35050d858f4c590788132627e74f62c8
(iter=16)
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 
35050d858f4c590788132627e74f62c8 -> 6bf38563d1267339122154bd7d4fbfce
(iter=38)
39 
6bf38563d1267339122154bd7d4fbfce -> 35050d858f4c590788132627e74f62c8
(iter=39)
40 41 42 43 44 45 46 47 48 49 50 51 
35050d858f4c590788132627e74f62c8 -> 3d2072698054d0bd57beefea0248b7e6
(iter=51)
52 
3d2072698054d0bd57beefea0248b7e6 -> 35050d858f4c590788132627e74f62c8
(iter=52)
53 54 55 56 57 58 59 ^C

Would anyone be able to reproduce this? That might be a local problem
although there's nothing special installed AFAICS.
Initially I saw this with a larger dataset that I can't share, and the diffs
between outputs showed that only a few lines out of 2 million lines
were getting displaced across sorts.
It also happens on the same OS  with Pg15.3 (EDB build) and the default
libc collation, so I would not immediately suspect new code in Pg16.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment: repro-coll-windows.sh
Description: Binary data

Reply via email to