Tom Lane wrote:

* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters).  While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away.  I think BYTEEQ would make more sense in the backslash path.

Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?

So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.

I think that's correct.

On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.



One more thing - I'm thinking of rolling up the bytea matching routines as well as the text routines to eliminate all the duplication of logic. I can do it by a little type casting from bytea* to text* and back again, or if that's not acceptable by some preprocessor magic. I think the casting is likely to be safe enough in this case - I don't think a null byte will hurt us anywhere in this code - and presumably the varlena stuff is all the same. Does that sound reasonable?



