Hi hackers

I was browsing the PostgreSQL’s Unicode normalization code and found an issue 
where the composition algorithm recognizes 0x11A7 as a T syllable and combines 
it with subsequent S and V syllables. Per the Unicode specification:

TBase is set to one less than the beginning of the range of trailing 
consonants, which starts at U+11A8. TCount is set to one more than the number 
of trailing consonants relevant to the decomposition algorithm: (11C216 - 
11A816 + 1) + 1.

In short, TCount actually counts 1 more than the number of T syllables; this is 
so s % TCount == 0 implies that s has no T syllable (because the 0th place 
represents the absence of a T syllable), where s is the s-index of a 
precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T 
syllable, the composition algorithm yields a nonsense character when 0x11A7 is 
put in the T position. See 
https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/src/normalize.rs#L218
 for a comment on this bug in Rust’s unicode-rs, and 
https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710
 for a similar contribution I made to JuliaStrings/utf8proc a few months ago.

Let me know if this patch needs anything else. I can write a test for this, but 
it looks like the current testing setup in src/common/norm_test.c only runs the 
Unicode test suite and isn’t built for writing custom tests. If that is 
something of interest, though, I’m happy to add that to this patch.

Best,
Diego

Attachment: v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch
Description: Binary data

Reply via email to