Re: Unicode, character ambiguities

Glenn Maynard Sun, 13 Jan 2002 19:28:03 -0800

On Sun, Jan 13, 2002 at 08:26:57PM -0600, David Starner wrote:
> Is ISO-8859-1 not portable because you can't round trip CP932 through
> it? Why does CP932's lack of definition make Unicode unportable? People
> already pound Unicode for compromises with older systems; one more won't
> make people love it.


Um, ISO-8859-1 is completely irrelevant.  It doesn't claim to be a charset
for Japanese users; Unicode does.  If I take a CP932 document, convert
it to Unicode, and then back to CP932, I'd better get exactly what I
started with, or we don't have round-trip compatibility.  That had
better work across systems, too. This has nothing to do with any compromise
on Unicode's part; it's merely a matter of defining a table and using it.

(Incidentally, if programmers consistently distinguish CP932 from
Shift-JIS, this isn't a problem for that particular codeset; since it's
MS's charset, using MS's table is fine.  This is a problem for all of
the CJK encodings, not just CP932, however.  In practice, many Japanese
programmers may not know the difference and use a Shift-JIS translation.
Also, making sure all of the original CCS mappings line up is probably
more important, so if you go from CP932 to Unicode to EUC-JP to CP932
you end up with the same thing.)

> People are going to use whatever translation tables their system happens
> to use. Some systems are going to translate all strings to UTF-8 as
> standard practice - Java based systems, for example, and Gnome looks
> like it's heading that way. Others just aren't going to be interested in
> messing around with it - ANSIToUnicode, or iconv, or whatever the
> library call is already does it, why are they going to rewrite the
> wheel? 

The threat is that, if portable round-trip conversions arne't available,
some users (programmers) who value round-trip compatibility more than
Unicode will break spec and dump native charsets in the files.  (This
*did* happen with ID3 tags; this isn't a made-up threat.)  That's
probably the single worst case scenario, and must be avoided.

> What was your solution? I got that you expected systems to display the
> backslash as the yen sign under certain conditions. Right?

At one point; that doesn't really do anything to help the conversion
problems, though.  I've yet to see a reasonable solution that does.

Luckily, this doesn't affect Ogg, nor does it affect any file format or
protocol that doesn't treat \ as special; map 0x5C to U+00A5 and be done
with it.

> > It doesn't matter who's "fault" it is 
> 
> Actually, it does. Part of Unicode's success is that it's a simpler

It doesn't matter whose (oops) fault it is.  Whether it was MS's fault,
Unicode's fault, ISO 0201 Roman's fault or Santa's fault, the end result
is the same, and it still needs a solution.

> solution then dealing with dozens of charsets. If you import the bugs of
> dozens of charsets into Unicode, it loses part of that. 
> 
> Yes, Unicode should offer a unified translation table. Barring that, the
> tables available at http://www.w3.org/TR/japanese-xml/ could be
> referenced - accepting that some systems won't or can't follow the
> recommendations. But importing the quirks and problems of other charset
> (seperate from those inherant in the script) into Unicode won't help
> things in the long run.

Like I said, I'd definitely suggest using an existing table, not
make one up from scratch; that *would* exacerbate the problem.

Thanks for the link, by the way.  (Unfortunately, it leaves a lot of
things undefined; it lists ambiguities but doesn't seem to suggest
solutions.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to