Tomohiro KUBOTA: > Note that browsers cannot be free from "state" even if they use Unicode.
Not entirely, but Unicode makes it a whole lot easier, at least. > Thus, though it is true ISO-2022 is very complex, please note Unicode > is not so simple. If Unicode were less simpler than human natural > languages, it means that Unicode has defects. You are comparing apples and oranges here. The display issues you describe with Unicode you (can) get with ISO-2022 as well. The problem here is the recognition code. In ISO-2022 it's very hard to re-sync if you get lost (transmission failure or broken pages), with UTF-8 it is very easy. > Never. Before appearance of Unicode, these encodings were identical, Well, there is an (almost) 1-to-1 mapping between Shift-JIS, ISO-2022-JP and EUC-JP, but they are all relying on partly undefined parts of the JIS X 0208 standard, and that the fonts that were displaying the text had exactly those 0208 extensions that the text were using. You have the IBM extensions, the NEC extensions and some other, incompatible, extensions. For your text to be transmitted correctly you needed to make sure your recipient not only understood your encoding (Shift-JIS) but had a 0208 font with the correct extensions. This causes a lot of problems. > For example, Shift_JIS and CP932 is identical if we don't think about > conversion to/from Unicode. You're comparing apples and oranges again. Shift-JIS is and encoding of JIS X 0201 and JIS X 0208, which has been (mis)used by encoding the vendor extensions, that are not necessarily compatible among computers. CP932 defines another character set (which is based on JIS X 0201 and JIS X 0208, but is not identical), so there you do not have the problem, because the extensions are already predefined. > Most Japanese people even don't know the name of "CP932" and they > think they are using Shift_JIS. This is very similar to the western European problem where people say that they are using ISO 8859-1, but are using the Windows extensions in CP1252. The problem is that since "Shift-JIS" has been changed from an encoding of 0201 and 0208, it is not guaranteed what underlying character set it is encoding, so it is *very* hard to figure out what to convert it to if you convert it to another character encoding (except for EUC-JP and ISO-2022-JP, of course, since they also can be used to encode this "unspecified" character set) you get in trouble. This is not a problem of Unicode, because Unicode is well-defined, but of the used variants of Shift-JIS, because they are very badly defined. Even if you add all the extensions that are documented anywhere to your JIS X 0208 there is always "one more character" in used somewhere, just because a popular Japanese font manufacturer added it to their fonts and claimed it to be a 0208 font. Yes, I have seen these problems, and they are a real mess to implement. This is why I like to move to Unicode, because it is clearly defined what it contains. If you want a background to why I know so much about Japanese encodings (and Chinese, etc.), consider that I have been working with Opera and its Unicode adaptation for over a year, a year that co-incided with our delivery of Opera for QNX for IBM, a delivery that was targetted to Japan and China. So I have had my share of headaches with this. Fortunately, we decided from the start to go with Unicode only (and convert anything that comes in or goes out), I can only imagine the problems we would have had if we hadn't. -- \\// peter - http://www.softwolves.pp.se/ I do not read or respond to mail with HTML attachments. Statement concerning unsolicited e-mail according to Swedish law: http://www.softwolves.pp.se/peter/reklampost.html

