On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > I think you misunderstand Unicode normalization and equivalence. > There > is no standard Unicode `normalize()` that would cause the above > equality > predicate to be true. If you normalize to NFD (normal form > decomposed) > then a _prefix_ of those two strings will be equal, but that's > clearly > not what you're looking for.
>From [1]: "Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other. Depending on the particular Unicode Normalization Form, that equivalence can either be a canonical equivalence or a compatibility equivalence... A binary comparison of the transformed strings will then determine equivalence." NFC and NFD are based on Canonical Equivalence. "Canonical equivalence is a fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior." Can you explain why NFC (the default form of normalization used by the postgres normalize() function), followed by memcmp(), is not the right thing to use to determine Canonical Equivalence? Or are you saying that Canonical Equivalence is not a useful thing to test? What do you mean about the "prefix"? In Postgres today: SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1 SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1 SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81 SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81 which looks useful to me, but I assume you are saying that it doesn't generalize well to other cases? [1] https://unicode.org/reports/tr15/ > There are two ways to write 'á' in Unicode: one is pre-composed (one > codepoint) and the other is decomposed (two codepoints in this > specific > case), and it would be nice to be able to preserve input form when > storing strings but then still be able to index and match them > form-insensitively (in the case of 'á' both equivalent > representations > should be considered equal, and for UNIQUE indexes they should be > considered the same). Sometimes preserving input differences is a good thing, other times it's not, depending on the context. Almost any data type has some aspects of the input that might not be preserved -- leading zeros in a number, or whitespace in jsonb, etc. If text is stored as normalized with NFC, it could be frustrating if the retrieved string has a different binary representation than the source data. But it could also be frustrating to look at two strings made up of ordinary characters that look identical and for the database to consider them unequal. Regards, Jeff Davis