On Wed, Mar 25, 2026 at 6:09 PM Masahiko Sawada <[email protected]> wrote: > > On Wed, Mar 25, 2026 at 5:35 PM Tom Lane <[email protected]> wrote: > > > > Tomas Vondra <[email protected]> writes: > > > On 3/26/26 00:40, Tom Lane wrote: > > >> I believe what's happening there is that in cs_CZ locale, > > >> "V" doesn't follow simple ASCII sort ordering. > > > > > With cs_CZ all letters sort *before* numbers, while in en_US it's the > > > other way around. V is not special in any way. > > > > Ah, sorry, I should have researched a bit instead of relying on > > fading memory. The quirk I was thinking of is that in cs_CZ, > > "ch" sorts after "h": > > > > u8=# select 'h' < 'ch'::text collate "en_US"; > > ?column? > > ---------- > > f > > (1 row) > > > > u8=# select 'h' < 'ch'::text collate "cs_CZ"; > > ?column? > > ---------- > > t > > (1 row) > > > > Regular hex encoding isn't bitten by that because it doesn't > > use 'h' in the text form ... but this base32hex thingie does. > > > > However, your point is also correct: > > > > u8=# select '0' < 'C'::text ; > > ?column? > > ---------- > > t > > (1 row) > > > > u8=# select '0' < 'C'::text collate "cs_CZ"; > > ?column? > > ---------- > > f > > (1 row) > > > > and that breaks "text ordering matches numeric ordering" > > for both traditional hex and base32hex. So maybe this > > is not as big a deal as I first thought. We need a fix > > for the new test though. Probably adding COLLATE "C" > > would be enough. > > Thank you for the report and the analysis. > > I've reproduced the issue with "cs_CZ" collation and adding COLLATE > "C" to the query resolves it. It seems also a good idea to add a note > in the documentation too as users might face the same issue. For > example, > > To maintain the lexicographical sort order of the encoded data, ensure > that the text is sorted using the C collation (e.g., using COLLATE > "C"). Natural language collations may sort characters differently and > break the ordering. >
Attached the patch doing the above idea. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
From a64f3f64a9f04c1f5da9a51fe760c40480585fd4 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Wed, 25 Mar 2026 18:07:11 -0700 Subject: [PATCH] Fix UUID sortability tests in base32hex encoding. The recently added test for base32hex encoding of UUIDs failed on buildfarm member hippopotamus using natural language locales (such as cs_CZ). This happened because those collations may sort characters differently, which breaks the strict byte-wise lexicographical ordering expected by base32hex encoding. This commit fixes the regression tests by explicitly using the C collation. Additionally, add a note to the documentation to warm users that they must use the C collation if they want to maintain the lexicographical sort order of the encoded data. Per buildfarm member hippopotamus. Analyzed-by: Tom Lane <[email protected]> Discussion: https://postgr.es/m/[email protected] --- doc/src/sgml/func/func-binarystring.sgml | 9 +++++++++ src/test/regress/expected/uuid.out | 7 +++++-- src/test/regress/sql/uuid.sql | 7 +++++-- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/func/func-binarystring.sgml b/doc/src/sgml/func/func-binarystring.sgml index 0aaf9bc68f1..2ad2cdbea82 100644 --- a/doc/src/sgml/func/func-binarystring.sgml +++ b/doc/src/sgml/func/func-binarystring.sgml @@ -790,6 +790,15 @@ produces a 26-character string compared to the standard 36-character UUID representation. </para> + + <note> + <para> + To maintain the lexicographical sort order of the encoded data, + ensure that the text is sorted using the C collation + (e.g., using <literal>COLLATE "C"</literal>). Natural language + collations may sort characters differently and break the ordering. + </para> + </note> </listitem> </varlistentry> diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out index 142c529e693..9c5dda9e9ab 100644 --- a/src/test/regress/expected/uuid.out +++ b/src/test/regress/expected/uuid.out @@ -236,8 +236,11 @@ SELECT array_agg(id ORDER BY guid_field) FROM guid3; {1,2,3,4,5,6,7,8,9,10,11,12} (1 row) --- make sure base32hex encoding works with UUIDs and preserves ordering -SELECT array_agg(id ORDER BY guid_encoded) FROM guid3; +-- Test base32hex encoding of UUIDs and its lexicographical sorting property. +-- COLLATE "C" is required to prevent buildfarm failures in non-C locales +-- where natural language collations (such as cs_CZ) would break strict +-- byte-wise ordering. +SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3; array_agg ------------------------------ {1,2,3,4,5,6,7,8,9,10,11,12} diff --git a/src/test/regress/sql/uuid.sql b/src/test/regress/sql/uuid.sql index f2ff00f5ddd..8cc2ad40614 100644 --- a/src/test/regress/sql/uuid.sql +++ b/src/test/regress/sql/uuid.sql @@ -122,8 +122,11 @@ INSERT INTO guid3 (guid_field) SELECT uuidv7() FROM generate_series(1, 10); INSERT INTO guid3 (guid_field) VALUES ('ffffffff-ffff-ffff-ffff-ffffffffffff'::uuid); SELECT array_agg(id ORDER BY guid_field) FROM guid3; --- make sure base32hex encoding works with UUIDs and preserves ordering -SELECT array_agg(id ORDER BY guid_encoded) FROM guid3; +-- Test base32hex encoding of UUIDs and its lexicographical sorting property. +-- COLLATE "C" is required to prevent buildfarm failures in non-C locales +-- where natural language collations (such as cs_CZ) would break strict +-- byte-wise ordering. +SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3; -- Check the timestamp offsets for v7. -- -- 2.53.0
