There are, of course, problems in Unicode; unavoidable with a project that complex. Some were introduced for compatibility with the host of legacy code pages in the world (by our count in ICU, well over 700 unique code pages in current use), others could have been avoided had we known what we know now.
Unification is not one of them, however. It is not so much a feature of Unicode as a feature of human writing. We could have chosen to deunify every language's character; even every dialect's. A Swedish 'a' would be different than a French 'a', different than an English 'a', even different than a Yorkshire 'a' or a New York 'a'. After all, that would allow one to detect different languages, and sort or match differently based upon that. For that matter, we could have chosen to deunify fonts and styles; bold 'u' from italic, Helvetica from Times New Roman. But what a huge mess it would be. Rather than a relatively small number of confusible characters, we would have essentially all confusible characters. Applications would have to deal with a tremendous increase in the number of characters, drastically increasing the memory storage for the necessary character properties, and there would be an incredible number of problems for users because of the visual confusion of so many characters. Much too high a price, on balance, rather than dealing with matching issues in a simpler representation. The TC/SC problem can be dealt with by means of the registration of a small number of additional names. Little different in kind from registering both theatre.com and theater.com, or aarborg.com and a<ring>rborg.com. While theoretically someone could have 5-character name with 32 combinations of TC/SC, in practice nobody has proven this to be a real problem, or even provided any evidence whatsoever that clients will in fact be confused. Mark ————— � όλλ’ � πίστατο ἔργα, κακῶς δ’ � πίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Patrik Fältström" <[EMAIL PROTECTED]> To: "YangWoo Ko" <[EMAIL PROTECTED]>; "IETF-IDN" <[EMAIL PROTECTED]> Sent: Wednesday, January 23, 2002 05:18 Subject: Re: [idn] Prohibit CDN code points > --On 2002-01-23 21.47 +0900 YangWoo Ko <[EMAIL PROTECTED]> wrote: > > > Your last statement does not exactly describe TC/SC issue. Following may > > explain TC/SC issue better; > > > > "If one enter a string in Unicode, one may or may not know whether TC or > > SC was used. It depends both on his language in mind when entering that > > string and on his knowledge about characters." > > Correct. > > > Dear all members, > > > > What about having additional prefix(es) for extension like TC/SC issue ? > > For example, az-- for normal IDNA and bz-- for chinese-extension IDNA > > and so forth. It may serve as an context information or language tag. > > How do you match between one string which uses az--<foo>.com and > bz--<foo>.com where "<foo>" stands for the term "foo" but encoded? > > And, yes, you can do this, but as i have pointed out before, this means > every server needs to know about all matching algorithms. > > I.e. if you open the box of "problems" with Unicode, you will find that the > SC/TC problem is only one of them. Only one. I guess we have some 20-30 > other problems which are similar to the SC/TC, i.e. problems because of > unification or non-unification in Unicode. > > So, you will see an explosion of matching rules. > > Reason we see this about SC/TC is that that happen to be the problem space > we discuss at the moment. We could aswell discuss the problems with > adiaeresis in the countries and languages which uses it. > > My conclusion is the same, every server need to have knowledge about how to > handle all encodings. > > paf > > >
