Re: [HACKERS] Bug in UTF8-Validation Code?

Andrew Dunstan Sat, 17 Mar 2007 07:49:22 -0800


Jeff Davis wrote:

On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:

On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:

Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?
What issues? I've upgraded several 8.0 database to 8.1. without having to useiconv. Did I miss something?

http://www.postgresql.org/docs/8.1/interactive/release-8-1.html

"Some users are having problems loading UTF-8 data into 8.1.X.  This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."


If the above quote were actually true, then Mario wouldn't be having a
problem. Instead, it's half-true: Invalid byte sequences are rejected in
some situations and accepted in others. If postgresql consistently
rejected or consistently accepted invalid byte sequences, that would not
cause problems with COPY (meaning problems with pg_dump, slony, etc.).

How can we fix this? Frankly, the statement in the docs warning aboutmaking sure that escaped sequences are valid in the server encoding is acop-out. We don't accept invalid data elsewhere, and this should be nodifferent IMNSHO. I don't see why this should be any different from,say, date or numeric data. For years people have sneered at MySQLbecause it accepted dates like Feb 31st, and rightly so. But this seemsto me to be like our own version of the same problem.


Last year Jeff suggested adding something like:

   pg_verifymbstr(string,strlen(string),0);

to each relevant input routine. Would that be an acceptable solution? Ifnot, what would be?


cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [HACKERS] Bug in UTF8-Validation Code?

Reply via email to