Re: Unicode, character ambiguities

Glenn Maynard Thu, 10 Jan 2002 01:17:20 -0800

On Wed, Jan 09, 2002 at 04:57:29PM +0900, Tomohiro KUBOTA wrote:
> Saying about round-trip compatibility, yes, round-trip compatibility
> for EUC-JP, EUC-KR, Big5, GB2312, GBK are guaranteed, i.e., Unicode
> is a superset of these encodings (character sets).  However,
> (1) there are no authorative mapping tables between these encodings
>     and Unicode and there are various private mapping tables.  This
>     can cause portability problem around round-trap compatibility.


How major a problem is this in practice, right now?

One temporary solution I could suggest is having specs (in this case,
Ogg tags) choose a specific vendor's translation tables for these, and
saying "until Unicode standardizes these tables, use these, not your
system's."  That would at least (try to) guarantee that until that
happens, if a user enters text on one system in SJIS, and moves it to
another via UTF-8, he'll get the same SJIS output.

The obvious problem is that these tables will inevitably stick around a
litle while after the tables are standardized, even if the system vendor
is quick and puts out an update in a week.  I think, however, that some
people just aren't going to update their system (and so will use the
obsolete vendor tables anyway), and the same people that wouldn't update
their system wouldn't update their editors.

When (hopefully not "if") the standardization happens, some users that
are locally using these other encodings (and only transparently using
UTF-8 in the file) will want the file updated, so the JIS (etc) they're
seeing was the same as it was before.  That becomes easier (add an
"upgrade transcoding" option or similar, for the encoding that's being
used).  It couldn't be done automatically (unless the fact that the
temporary translation table was in use for the tags was set, and then
removed and deprecated when the standard tables become used.)

It would mean editors would have to have their own transcoder for these
encodings until this happens.  That could be provided.

I assume only one such table for any given language would be needed.
Presumably JIS<->EUC-JP is well-standardized, so if an interim
Unicode<->JIS is given, Unicode->JIS->EUC-JP could be used to get that,
for example.  What other encodings could be avoided like this?  (I don't
know anything about Chinese or Korean encodings.)

Does anyone have any reasons why this would be a really bad idea?

If not, does anyone have any suggestions of tables to use for different
encodings?  It'd be nice to use ones that are likely to be as close as
possible to whatever becomes the eventual standard, but that might be
an impossible goal.

What other encodings (besides C, J and K ones) would need this?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to