Andrew Dunstan <[EMAIL PROTECTED]> writes: > Ok, I have studied some more and I think I understand what's going on. > AIUI, we are switching from some expensive char-wise comparisons to > cheap byte-wise comparisons in the UTF8 case because we know that in > UTF8 the magic characters ('_', '%' and '\') aren't a part of any other > character sequence. Is that putting it too mildly? Do we need stronger > conditions than that? If it's correct, are there other MBCS for which > this is true?
I don't think this is a correct analysis. If it were correct then we could use the optimization for all backend charsets because none of them allow MB characters to contain non-high-bit-set bytes. But it was stated somewhere upthread that that doesn't actually work. Clearly it's a necessary property that we not falsely detect the magic pattern characters, but that's not sufficient. I think the real issue is that UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property, we have to use the slow way to ensure we don't make out-of-sync matches. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly