Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.

Thanks,
Diego

> On Jun 3, 2026, at 9:07 PM, Michael Paquier <[email protected]> wrote:
> 
> On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
>> In short, TCount actually counts 1 more than the number of T
>> syllables; this is so s % TCount == 0 implies that s has no T
>> syllable (because the 0th place represents the absence of a T
>> syllable), where s is the s-index of a precomposed Hangul
>> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
>> syllable, the composition algorithm yields a nonsense character when
>> 0x11A7 is put in the T position.
> 
> Oops.  Yes, including TBASE in the recomposition is incorrect, finding
> your quote here (TBase is set to one less..):
> https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
> 
> The character gets eaten by the normalization.  Pas glop.
> 
>> Let me know if this patch needs anything else. I can write a test
>> for this, but it looks like the current testing setup in
>> src/common/norm_test.c only runs the Unicode test suite and isn’t
>> built for writing custom tests. If that is something of interest,
>> though, I’m happy to add that to this patch.
> 
> We have a set of tests in src/test/regress/sql/unicode.sql that would
> fit nicely with what you want to address here.  For this specific
> problem, this would work:
> SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
> 
> How about adding more normalization check patterns, while on it?  I am
> finishing with the attached, all things combined.  Diego. what do you
> think?
> --
> Michael
> <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>



Reply via email to