Re: [HACKERS] Errors in our encoding conversion tables

Albe Laurenz Fri, 27 Nov 2015 00:50:40 -0800

Tom Lane wrote:
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/[email protected]
> of an apparent error in our WIN1250 -> LATIN2 conversion.  I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification.  I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff.  (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
> 
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones.  The attached patch
> recomputes those from the Unicode data, too.
> 
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
> 
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
> 
> Anyway, what are we going to do about this?  I'm concerned that simply
> shoving in corrections may cause problems for users.  Almost certainly,
> we should not back-patch this kind of change.


Thanks for picking this up.

I agree with your proposed fix, the only thing that makes me feel uncomfortable
is that you get error messages like:
  ERROR:  character with byte sequence 0x96 in encoding "WIN1250" has no 
equivalent in encoding "MULE_INTERNAL"
which is a bit misleading.
But the main thing is that no corrupt data can be entered.

I can understand the reluctance to back-patch; nobody likes his
application to suddenly fail after a minor database upgrade.

However, the people who would fail if this were back-patched are
people who will certainly run into trouble if they
a) upgrade to a release where this is fixed or
b) try to convert their database to, say, UTF8.

The least thing we should do is stick a fat warning into the release notes
of the first version where this is fixed, along with some guidelines what
to do (though I am afraid that there is not much more helpful to say than
"If your database encoding is X and data have been entered with client_encoding 
Y,
fix your data in the old system").

But I think that this fix should be applied to 9.6.
PostgreSQL has a strong reputation for being strict about correct encoding
(not saying that everybody appreciates that), and I think we shouldn't mar
that reputation.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Errors in our encoding conversion tables

Reply via email to