Jeff Davis wrote: > New version attached.
[v16] Concerning the target category_test, it produces failures with versions of ICU with Unicode < 15. The first one I see with Ubuntu 22.04 (ICU 70.1) is: category_test: Postgres Unicode version: 15.1 category_test: ICU Unicode version: 14.0 category_test: FAILURE for codepoint 0x000c04 category_test: Postgres property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 1/0/0/0/0/0 category_test: ICU property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 0/0/0/0/0/0 U+0C04 is a codepoint added in Unicode 11. https://en.wikipedia.org/wiki/Telugu_(Unicode_block) In Unicode.txt: 0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;;;;;N;;;;; In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt $ grep 0C04 PropList.txt 0C04 ; Other_Alphabetic # Mn TELUGU SIGN COMBINING ANUSVARA ABOVE But in Unicode 14 it was not there. As a result its binary property UCHAR_ALPHABETIC has changed from false to true in ICU 72 vs previous versions. As I understand, the stability policy says that such things happen. From https://www.unicode.org/policies/stability_policy.html Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character. The Consortium will endeavor to keep the values of the other properties as stable as possible, but some circumstances may arise that require changing them. Particularly in the situation where the Unicode Standard first encodes less well-documented characters and scripts, the exact character properties and behavior initially may not be well known. As more experience is gathered in implementing the characters, adjustments in the properties may become necessary. Examples of such properties include, but are not limited to, the following: - General_Category - Case mappings - Bidirectional properties [...] I've commented the exit(1) in category_test to collect all errors, and built it with versions of ICU from 74 down to 60 (that is Unicode 10.0). Results are attached. As expected, the older the ICU version, the more differences are found against Unicode 15.1. I find these results interesting because they tell us what contents can break regexp-based check constraints on upgrades. But about category_test as a pass-or-fail kind of test, it can only be used when the Unicode version in ICU is the same as in Postgres. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
results-category-tests-multiple-icu-versions.tar.bz2
Description: Binary data