RE: XTerm and ISO-2022

Karlsson Kent - keka Mon, 05 Feb 2001 03:56:27 -0800


> -----Original Message-----
> From: Tomohiro KUBOTA [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, February 03, 2001 2:19 AM
> To: [EMAIL PROTECTED]
> Subject: RE: XTerm and ISO-2022
> 
> 
> Hi,
> 
> At Fri, 2 Feb 2001 18:09:10 +0100 ,
> Karlsson Kent - keka <[EMAIL PROTECTED]> wrote:
> 
> >>'mojibake' - broken characters caused by encoding mismatching.
> >
> > I'm not sure what you mean.  Most of the "broken characters"
> > in 10646/Unicode come from East Asian legacy encodings, and
> > there are even more in Unicode 3.1 because of recent additions
> > of Han compatibility characters going into 10646-2.
> 
> Imagine you invoke xterm in utf8 mode and run a 8bit application.
> This is mojibake.  This is not related to character shortage.
> This is, as I wrote, encoding mismatching.

This has nothing to do with Unicode per se.  It is impossible for
Unicode to have the same encoding as every other encoding, since
the encodings are in conflict on how to interpret a byte or byte
sequence.  Your problem here is not related to Unicode at all.


> > Originally Unicode was intended only for characters in active use
> > these days.  But the scope has been extended to cover also historic
> > characters.  That is why about 43 000 new Han characters are
> > going into Unicode/10646, still with the same unification principles
> > as for the BMP.  Most are collected from various dictionaries,
> > only a smaller number are compatibility Han (Kanji) characters
> > (insisted on by Japan).
> 
> You misunderstand what I mean.  Ok, it is natural because you don't
> know CJK languages.  I am not talking about historic ideographs.
> I have to explain more.

No, that's not it.  What I was saying was that the reasons for the
unification was NOT that it had to "fit in a 16-bit encoding",
but rather that the unification had (and still has) merits in itself.

> In Unicode, CJK characters with same meaning and similar shape is
> unified.  For example, U+9AA8 (ideograph 'bone') unifies 0x3947 from
> GB2312 (Mainland China), 0x586C from CNS11643-1 (Taiwan), 0x397C from
> JISX0208 (Japan), and 0x4D69 from KSX1001 (Korea).  However, though
> these character share the common origin, today they have different
> shape and CJK people cannot tolerate.  Note that these all characters
> are not historic but used for daily use.  Also note that any future
> extention cannot fix this problem because already determined codepoint
> of Unicode will not be changed in future.  (And more, if it were
> changed, confusion will occur.)

The unification rules and actual unifications have been developed
by experts on Han ideographs, experts that come from Japan, China,
Taiwan, Korea.  

----------------

> At Sat, 03 Feb 2001 16:22:17 +0000,
> Markus Kuhn <[EMAIL PROTECTED]> wrote:
> 
> > The ISO 10646-1:2000 standard prints the entire CJK collection in 5
> > different ideographic fonts and comes with an appendix that 
> documents
> > the principle and rational behind the CJK unification in detail.
> 
> And what?  If I read it, will every Japanese people come to be able
> to read Chinese shape of U+76F4 ?  Will Japanese elementary school's
> policy change to teach Chinese and Korean characters also?

Nobody has requested that.  What makes you think there has been
such a request?  Nobody is requesting schools to start using
Fraktur letters in their teachings either.  Yet, many of the Latin
characters in Unicode can be used to encode text to be typeset
in Fraktur (which I will admit I cannot read, regardless of language).

> It is an
> undeniable fact that a part of Han Unification is unacceptable for
> average Japanese people, regardless of the principle and rationale
> of Han Unification.

So, you are saying that no matter what anyone tells you, you are
dead set on the unfounded opinion that ISO 2022 is better that
10646/Unicode?


----------------

> > I write in the same document (my thesis) English text in a 
> Roman font,
> > Latin insets in italic, and computer source code in a 
> Courier font. I do
> > this multi-linugal processing in ASCII and this is really 
> *EXACTLY* just
> > the same as a Japanese/Chinese multilingual text.
> 
> Again and again I have to say that difference between Chinese and
> Japanese characters is not like difference Roman and Italic glyphs
> but like difference between Latin and Greek characters.

Obviously, experts on the Han ideographic writing system disagrees
with you. Experts from Japan, China, ...

> And, it is not geeks but all average Japanese people who cannot read
> Chinese version of U+76F4.  CJK Han Unification is a problem for
> average Japanese people, who don't know about encodings.  They
> just read Japanese or Chinese texts, just feel characters are funny
> and just don't buy such products.  It is also a problem for foreigners
> who want to study Japanese language.  They may remember wrong 
> characters.
> 
> And more, complain about CJK Han Unification is NOT related 
> to ISO-2022.
> Unicode could be without unifying too many CJK Ideographs, just by
> assigning larger code space when the initial Unicode was developed.

As I said, this had nothing to do with what happened. Rather, it
was the other way around ("oh look, it still fits!" ;-). 

> Yes, I insisted ISO-2022 is a solution for CJK Han Unification, but
> it is a fact that we need a solution even if we don't use ISO-2022.

For plain text, you should select the font of your preference.
(I, for one, would opt to NOT use a Fraktur font for plain text...)

For marked-up text (that has language tagging; see the HTML 4
specification), either use a set of fonts per your preference or
let the system choose among "Chinese" fonts for "Chinese" subtext,
"Japanese" fonts for "Japanese" subtexts, etc.  Both later versions
of Internet Explorer and Netscape 6 attempt such "intelligent" font
switching (maybe not on Linux...). I don't know how good they are at
actually matching the reader's expectations.

Note that ISO-2022 does NOT make a difference between the same
character encoded in different coding systems; it just allows
you to switch between the coding systems.  It does not allow you
to switch between fonts (though some use it as a mock language
tagging, when proper language tagging is missing for a web page;
though then only one IANA-registered encoding per page; and then
use that mock language tag to affect the font selection).


        /kent k
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
RE: XTerm and ISO-2022

Reply via email to