Hi hackers I was browsing the PostgreSQL’s Unicode normalization code and found an issue where the composition algorithm recognizes 0x11A7 as a T syllable and combines it with subsequent S and V syllables. Per the Unicode specification:
TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1. In short, TCount actually counts 1 more than the number of T syllables; this is so s % TCount == 0 implies that s has no T syllable (because the 0th place represents the absence of a T syllable), where s is the s-index of a precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T syllable, the composition algorithm yields a nonsense character when 0x11A7 is put in the T position. See https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/src/normalize.rs#L218 for a comment on this bug in Rust’s unicode-rs, and https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710 for a similar contribution I made to JuliaStrings/utf8proc a few months ago. Let me know if this patch needs anything else. I can write a test for this, but it looks like the current testing setup in src/common/norm_test.c only runs the Unicode test suite and isn’t built for writing custom tests. If that is something of interest, though, I’m happy to add that to this patch. Best, Diego
v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch
Description: Binary data
