Re: Built-in CTYPE provider

Daniel Verite Mon, 15 Jan 2024 06:30:36 -0800

        Jeff Davis wrote:

> New version attached.


[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version:        15.1
category_test: ICU Unicode version:             14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property       
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU      property       
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;;;;;N;;;;;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt 
0C04          ; Other_Alphabetic # Mn       TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

   Once a character is encoded, its properties may still be changed,
   but not in such a way as to change the fundamental identity of the
   character.

   The Consortium will endeavor to keep the values of the other
   properties as stable as possible, but some circumstances may arise
   that require changing them. Particularly in the situation where
   the Unicode Standard first encodes less well-documented characters
   and scripts, the exact character properties and behavior initially
   may not be well known.

   As more experience is gathered in implementing the characters,
   adjustments in the properties may become necessary. Examples of
   such properties include, but are not limited to, the following:

    - General_Category
    - Case mappings
    - Bidirectional properties
    [...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

results-category-tests-multiple-icu-versions.tar.bz2
Description: Binary data

Re: Built-in CTYPE provider

Reply via email to