Re: [HACKERS] Errors in our encoding conversion tables

Tatsuo Ishii Fri, 27 Nov 2015 17:55:09 -0800

> I wrote:
>> I have not attempted to reverify the files in utils/mb/Unicode against the
>> original Unicode Consortium data, but maybe we ought to do that before
>> taking any further steps here.
> 
> I downloaded the mapping files from unicode.org and attempted to verify
> that the Unicode/*.map files could be reproduced from the stated sources.
> Most of them are okay, but I failed to verify these:
> 
> euc_cn_to_utf8.map    utf8_to_euc_cn.map
> 
> Could not find the reference file GB2312.TXT; it is not at unicode.org
> 
> gb18030_to_utf8.map   utf8_to_gb18030.map
> 
> Could not find the reference file gb-18030-2000.xml, whose origin is
> unstated anyway.
> 
> euc_jp_to_utf8.map    utf8_to_euc_jp.map
> euc_kr_to_utf8.map    utf8_to_euc_kr.map
> johab_to_utf8.map     utf8_to_johab.map
> uhc_to_utf8.map               utf8_to_uhc.map
> 
> These four all have minor to significant differences from what I got by
> running the generation scripts.  See attached diffs.
> 
> utf8_to_sjis.map
> 
> It's very disturbing that this fails to verify when its allegedly inverse
> file does verify;
> either the script is broken or somebody did sloppy
> manual editing.


Manual editing.

I explain why the manual editing is necessary.

One of the most famous problems with Unicode is "wave dash"
(U+301C). According the Unicode consortium's Unicode/SJIS map, it
corresponds to 0x8160 of Shift_JIS. Unfortunately this was a mistake
in Unicode (the glyph of Shift_JIS and Unicode is slightly different -
looks like to be rotated in 90 degrees of wave dash in vertical
scripting. Probably they did not understand the Japanese vertical
writing at that time). So later on the Unicode consortium decided to
add another "wave dash" as U+FF5E which has a correct glyph of "wave
dash". However since Unicode already decided that U+301C corresponds
to 0x8160 of Shift_JIS, there's no Shift_JIS code corresponding to
U+FF5E. Unlike Unicode's definition, Microsoft defines that 0x8160
(wave dash) corresponds to U+FF5E. This is widely used in Japan. So I
decided to hire this for "wave dash". i.e.

0x8160 -> U+FF5E (sjis_to_utf8.map)

U+301C -> 0x8160 (utf_to_sjis.map)
U+FF5E -> 0x8160 (utf_to_sjis.map)

Another problem is vendor extension.

There are several standards for SJIS and EUC_JP in Japan. There is a
standard "Shift_JIS" defined by Japanese Government (probably the
Unicode consortium's map can be based on this, but I need to
verify). However several major vendors include IBM, NEC added their
own additional characters to Shift_JIS and they are widely used in
Japan. Unfortunately they are not compatible. So as a compromise I and
other developers decided to "merge" NEC and IBM extension part and
added to Shift_JIS. Same thing can be said to EUC_JP.

In short, there are number of reasons we cannot simply import the
consortium's mapping regarding SJIS (and EUC_JP).

> Anyway, this seems to mean that it's okay to go ahead with fixing the
> encoding conversion discrepancies I complained of yesterday; the data
> those proposed diffs are based on is solid.  But we've evidently got
> a number of other issues with these Far Eastern encodings.
> 
>                       regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Errors in our encoding conversion tables

Reply via email to