Peter Eisentraut wrote: > Here is an updated patch set that now also implements the "quick check" > algorithm from UTR #15 for making IS NORMALIZED very fast in many cases, > which I had mentioned earlier in the thread.
I found a bug in unicode_is_normalized_quickcheck() which is triggered when the last codepoint of the string is beyond U+10000. On encountering it, it does: + if (is_supplementary_codepoint(ch)) + p++; When ch is the last codepoint, it makes p point to the ending zero, but the subsequent p++ done by the for loop makes it miss the exit and go into over-reading. But anyway, what's the reason for skipping the codepoint following a codepoint outside of the BMP? I've figured that it comes from porting the Java code in UAX#15: public int quickCheck(String source) { short lastCanonicalClass = 0; int result = YES; for (int i = 0; i < source.length(); ++i) { int ch = source.codepointAt(i); if (Character.isSupplementaryCodePoint(ch)) ++i; short canonicalClass = getCanonicalClass(ch); if (lastCanonicalClass > canonicalClass && canonicalClass != 0) { return NO; } int check = isAllowed(ch); if (check == NO) return NO; if (check == MAYBE) result = MAYBE; lastCanonicalClass = canonicalClass; } return result; } source.length() is the length in UTF-16 code units, in which a surrogate pair counts for 2. This would be why it does if (Character.isSupplementaryCodePoint(ch)) ++i; it's meant to skip the 2nd UTF-16 code of the pair. As this does not apply to the 32-bit pg_wchar, I think the two lines above in the C implementation should just be removed. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite