Re: [PATCHES] UTF8MatchText

Tom Lane Thu, 17 May 2007 10:36:13 -0700

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Ok, I have studied some more and I think I understand what's going on. 
> AIUI, we are switching from some expensive char-wise comparisons to 
> cheap byte-wise comparisons in the UTF8 case because we know that in 
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other 
> character sequence. Is that putting it too mildly? Do we need stronger 
> conditions than that? If it's correct, are there other MBCS for which 
> this is true?


I don't think this is a correct analysis.  If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes.  But it was
stated somewhere upthread that that doesn't actually work.  Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Re: [PATCHES] UTF8MatchText

Reply via email to