Re: Unicode, character ambiguities

Glenn Maynard Sun, 13 Jan 2002 17:37:55 -0800

I'm not even certain where the conversation is now; there are two
distinct issues: 1; handling of CP932 0x5C, and 2: portable translation
tables.  (These only partially overlap.)  Since one of your mail readers
doesn't honor References, the threads get broken and are much harder to
follow.  So, if I mix responses to these issues, let me know.

On Sun, Jan 13, 2002 at 06:06:11PM -0600, David Starner wrote:
> > Because that's not portable.  Read
> > http://www.debian.or.jp/~kubota/unicode-symbols.html.
> 
> I know the problem. It still doesn't mean that every file format that
> includes Unicode should define its own solution.

So we should sit back, accept Unicode as nonportable, and provide things
like RFC2047 so people can use other encodings?  No thanks.

And if we simply say "use UTF-8", and people use whatever translation
tables their system happens to use, then it's a lot harder to fix things
if and when Unicode standardizes it.  If the file format uses a specific
set of translation tables, then as long as you can tell if the format is
using the old one, you can convert it to the new one automatically.  If
it doesn't do that, the file might have been converted with *any* table,
and it's quite impossible to fix existing data.

And file formats aren't going to wait to be used until Unicode fixes the
portability problems, especially since it's not even clear that they intend
to fix it at all.

> Yes? The main difference I see between my solution and yours, is that
> yours introduces "intelligent" parsers into every Unicode system,
> where's mine deals with at one place, where the conversion from
> CP932 happens.

I'm not advocating "intelligent" parsers at all.  (In fact, all of the
suggested solutions have their problems; I believe this particular
suggestion has by far the most.)

> Every application has to special case it under your situation, too.
> Under mine, only systems that plan to deal with CP932 have to special
> case it, and that code will eventually be removable.

Nope.  Using a specific translation table merely means changing your
iconv() call to one provided that uses them.  Using "intelligent
parsers" means you need to have different parsers for each data type,
so you can't use a simple interface like that.

> Apparently they have a hard time coexisting - poor semantics on CP932's
> fault, not Unicode's. I don't see transfering that bug to Unicode will
> help things in the long run.

It doesn't matter who's "fault" it is (I believe it would be JIS X 0201
Roman, where Tomohiro said CP932 got 0x5C.) It's in heavy use, and it
needs to be dealt with.

> ISO646-DE users did it. So did ISO646-DK, ISO646-ES and all the rest of
> the 7-bit codes. Why is it so different for CP932?

Considering that ISO646-DE puts a character on 0x5C that would be used
as a part of words (unlike CP932), I'd suspect the situation is different.
(It's one thing to not be able to use yen symbols in filenames in
Windows; it's quite another to lose a character.) I don't know anything
about their use; perhaps someone who does would enlighten us.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to