> Well, this is still working at the wrong level. The code > that's in pg_verifymbstr is mainly intended to enforce the > *system wide* assumption that multibyte characters must have > the high bit set in every byte. (We do not support encodings > without this property in the backend, because it breaks code > that looks for ASCII characters ... such as the main > parser/lexer ...) It's not really intended to check that the > multibyte character is actually legal in its encoding. >
Ok, point taken. > The "special UTF-8 check" was never more than a very > quick-n-dirty hack that was in the wrong place to start with. > We ought to be getting rid of it not institutionalizing it. > If you want an exact encoding-specific check on the > legitimacy of a multibyte sequence, I think the right way to > do it is to add another function pointer to pg_wchar_table > entries to let each encoding have its own check routine. > Perhaps this could be defined so as to avoid a separate call > to pg_mblen inside the loop, and thereby not add any new > overhead. I'm thinking about an API something like > > int validate_mbchar(const unsigned char *str, int len) > > with result +N if a valid character N bytes long is present > at *str, and -N if an invalid character is present at *str > and it would be appropriate to display N bytes in the complaint. > (N must be <= len in either case.) This would reduce the > main loop of pg_verifymbstr to a call of this function and an > error-case-handling block. > Sounds like a plan... > regards, tom lane > > Regards, John Hansen ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings