On Sat, Jan 12, 2002 at 02:51:39PM +0900, Tomohiro KUBOTA wrote: > > You have to assume that most Japanese systems will display \ as a Yen symbol, > > because they wlil. > > Japanese Windows system always displays \ (0x5c) (in CP932, > or, almost people call this as "Shift JIS") and U+005C with > Yen Symbol. However, most Linux/BSD/UNIX systems display > \ (0x5c) (in EUC-JP, which is the most popular encoding for > Linux/BSD/UNIX system) and U+005C in backslash even in Japan.
Right. Unicode U+005C is only a problem on Windows systems, and I would go so far as to say it should be ignored; it's extremely inconvenient for an application programmer to fix it at all, since MS's Japanese fonts don't have a halfwidth backslash at all. (This isn't the real problem, it's just a side effect.) At least I now have an idea of *why* their Unicode fonts are like this. (Previously, I had no idea at all.) Since CP932 users expect 0x5C to be a yen symbol, and Windows tables map 0x5C to U+005C, U+005C needed to be a yen symbol, too, since their font system probably converts everything displayed to Unicode. (Not that this is an excuse; but it's nice to know *why* people do really annoying things.) > > Now, translation tables for CP932 on these systems could translate > > backslash and the yen symbol both to the yen symbol; > > What is "both"? I think you are talking about both of backslash and > yen symbol. However, what do you think is the codepoints for them > in CP932? Answer: CP932 has the following yen sign and backslash > > > CP932 (Shift JIS) Unicode (mapped by CP932 table) > ------------------------------ ------------------------------- > 0x5C (yen sign) U+005C (yen sign glyph in Windows) > 0x81 0x5F (fullwidth backslash) U+FF3C (fullwidth backslash) > 0x81 0x8F (fullwidth yen sign) U+FFE5 (fullwidth yen sign) Right; I had originally mixed this up (clarified in the later post.) 0x81 0x8F doesn't seem to be a problem; almost everyone agrees that it maps to U+FFE5. A couple tables map 0x81 0x5F to halfwidth backslash. Are those the coding systems that don't have a halfwidth backslash? > note that CP932 0x5C (yen sign) is derived from JIS X 0201 and > CP932 0x81 0x5F and CP932 0x81 0x8F are derived from JIS X 0208. > > thus, if you modify CP932 table 0x5C -> U+00A5, it doesn't mean > breaking round-trip compatibility with CP932. Right. > In case of Ogg, I think this can be a solution, because the > strings are never parsed as filenames. However, this cannot > be a general solution. Right. I think the only solution I've seen that can *work* for everybody, and doesn't have any showstoppers (that I can see), is your own suggestion of giving up and making backslash and yen two glyphs of U+005C. I can see a few problems with that, but they're all within the bounds of compromise. (And the bounds for this particular problem are very large ...) By the way, you might want to update the links on http://www.debian.or.jp/~kubota/unicode-symbols.html. While the nature of the problems you list is different, with Unicode obsoleting their own tables, it's still very useful information. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
