Hi, While trying pg16beta1 libc collations on Windows, I noticed that UTF-8 text sorts sometimes differently across invocations with the same locales, which is wrong since these collations are deterministic.
The OS is Windows 10 Home, version 10.0.19045 Build 19045, self-built 16beta1 with VS Community 2022, without ICU, default configuration in postgresql.conf. It seems to occur more or less randomly with all libc locales except C/POSIX, with the probability of getting differences being seemingly much higher when the data gets larger in number of rows and uses higher codepoints (like if all character are in [U+0001,U+0400] the sorts never differ with 40k rows, but they do if there are much more rows or if the range is [U+0001,U+2000]). Also, it does not occur at all if parallel scan is disabled. I've come up with a self-contained script that generates random words and repeatedly sorts and feed them to md5sum. It takes the number of rows and the highest Unicode codepoint as arguments, and shows when the checksums differ across consecutive invocations. Here's a typical run showing how it goes wrong after the 14th sort: $ bash repro-coll-windows.sh 40000 16383 NOTICE: relation "random_words" already exists, skipping CREATE TABLE TRUNCATE TABLE CREATE FUNCTION DROP COLLATION CREATE COLLATION INSERT 0 40000 ANALYZE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 35050d858f4c590788132627e74f62c8 -> e746b626fcc848cbbc67570a7dde03bb (iter=15) 16 e746b626fcc848cbbc67570a7dde03bb -> 35050d858f4c590788132627e74f62c8 (iter=16) 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 35050d858f4c590788132627e74f62c8 -> 6bf38563d1267339122154bd7d4fbfce (iter=38) 39 6bf38563d1267339122154bd7d4fbfce -> 35050d858f4c590788132627e74f62c8 (iter=39) 40 41 42 43 44 45 46 47 48 49 50 51 35050d858f4c590788132627e74f62c8 -> 3d2072698054d0bd57beefea0248b7e6 (iter=51) 52 3d2072698054d0bd57beefea0248b7e6 -> 35050d858f4c590788132627e74f62c8 (iter=52) 53 54 55 56 57 58 59 ^C Would anyone be able to reproduce this? That might be a local problem although there's nothing special installed AFAICS. Initially I saw this with a larger dataset that I can't share, and the diffs between outputs showed that only a few lines out of 2 million lines were getting displaced across sorts. It also happens on the same OS with Pg15.3 (EDB build) and the default libc collation, so I would not immediately suspect new code in Pg16. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
repro-coll-windows.sh
Description: Binary data