Re: Unicode, character ambiguities

Tomohiro KUBOTA Fri, 11 Jan 2002 23:48:54 -0800

Hi,

At Sat, 12 Jan 2002 02:10:59 -0500,
Glenn Maynard wrote:


> At least I now have an idea of *why* their Unicode fonts are like this.
> (Previously, I had no idea at all.)  Since CP932 users expect 0x5C to be
> a yen symbol, and Windows tables map 0x5C to U+005C, U+005C needed to be
> a yen symbol, too, since their font system probably converts everything
> displayed to Unicode.  (Not that this is an excuse; but it's nice to
> know *why* people do really annoying things.)

Mostly right.  You thought that CP932 table was defined at first 
and next Windows U+005C glyph is decided.  It is not true, I imagine.
CP932 is a Microsoft's name of Shift_JIS and Microsoft could decide
_both_ of CP932 mapping table _and_ U+005C glyph in Windows.


> A couple tables map 0x81 0x5F to halfwidth backslash.  Are those the
> coding systems that don't have a halfwidth backslash?

Please be careful about difference between coded character set and
encoding.  Shift_JIS and EUC-JP are encoding, while JIS X 0201 and
JIS X 0208 are coded character sets.  What we use is encoding and
an encoding consists of one or more coded character sets.

EUC-JP = ASCII + JIS X 0208 ( + JIS X 0201 Kana + JIS X 0212)
Shift_JIS = JIS X 0201 Roman + JIS X 0201 Kana + JIS X 0208
CP932 = JIS X 0201 Roman + JIS X 0201 Kana + JIS X 0208 
         + Microsoft private extension Kanji.

JIS X 0201 Roman is Japanese version of ISO 646, which is almost
same as ASCII but 0x5C is Yen sign.

Thus, Shift_JIS doesn't have halfwidth backslash.  CP932 also.
(Thus Japanese Windows-only users won't care about that Windows
doesn't have any halfwidth backslash glyph.)


> I think the only solution I've seen that can *work* for everybody, and
> doesn't have any showstoppers (that I can see), is your own suggestion
> of giving up and making backslash and yen two glyphs of U+005C.  I can
> see a few problems with that, but they're all within the bounds of
> compromise.  (And the bounds for this particular problem are very large ...)

Do you mean the usage of Variation Selector?  I think it is an
interesting suggestion and a good compromise.  However,
(1) the problem that Windows CP932 text file cannot be
    transcoded into Unicode automatically is not solved.
(2) I imagine Variation Selector is always needed for U+005C
    as Yen Sign.  I don't think Microsoft will accept this.
(3) glyph-selecting mechanism has to be implemented.  However,
    I think selecting-glyph-for-one-codepoint mechanism is
    needed anyway for specify one glyph of Han Ideogram.
    If this yen-sign problem urge developers to implement
    the mechanism, CJK people will be happy to be able to
    use glyph-selecting of Han Ideogram.

Note that the existance of problems doesn't mean the idea is bad,
because there cannot exist any ideas without problems.  We have
to seek better compromise and smaller nightmare, not to seek
perfect solution which cannot exist.


> By the way, you might want to update the links on
> http://www.debian.or.jp/~kubota/unicode-symbols.html.  While the nature
> of the problems you list is different, with Unicode obsoleting their own
> tables, it's still very useful information.

Yes, I think the mapping tables are useful and Unicode Consortium
should not obsolete them unless defining a new authorized mapping
table, just as I wrote in the document.

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to