Tom Lane wrote:
> There's a discussion over at
> of an apparent error in our WIN1250 -> LATIN2 conversion.  I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification.  I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff.  (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones.  The attached patch
> recomputes those from the Unicode data, too.
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
> Anyway, what are we going to do about this?  I'm concerned that simply
> shoving in corrections may cause problems for users.  Almost certainly,
> we should not back-patch this kind of change.

Thanks for picking this up.

I agree with your proposed fix, the only thing that makes me feel uncomfortable
is that you get error messages like:
  ERROR:  character with byte sequence 0x96 in encoding "WIN1250" has no 
equivalent in encoding "MULE_INTERNAL"
which is a bit misleading.
But the main thing is that no corrupt data can be entered.

I can understand the reluctance to back-patch; nobody likes his
application to suddenly fail after a minor database upgrade.

However, the people who would fail if this were back-patched are
people who will certainly run into trouble if they
a) upgrade to a release where this is fixed or
b) try to convert their database to, say, UTF8.

The least thing we should do is stick a fat warning into the release notes
of the first version where this is fixed, along with some guidelines what
to do (though I am afraid that there is not much more helpful to say than
"If your database encoding is X and data have been entered with client_encoding 
fix your data in the old system").

But I think that this fix should be applied to 9.6.
PostgreSQL has a strong reputation for being strict about correct encoding
(not saying that everybody appreciates that), and I think we shouldn't mar
that reputation.

Laurenz Albe

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to