On 11/11/24 01:27, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]  AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]:
https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com

I ran a check and found that this patch causes changes in upper casing of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
 LOCALE_PROVIDER builtin
 BUILTIN_LOCALE 'C.UTF-8'
 TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------


8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------


8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
               md5
----------------------------------
 9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing something.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to