* Lars Marius Garshol | | I've just discovered that it seems that Shift-JIS encodes a number | of User-Defined Characters in the 0xF040 to 0xFCFC range, and that | these
* Markus Scherer | | Yes, and every implementor may assign characters to them as they see | fit. I know. My problem is that people use them in web pages, and I need to display them the way those people expect. | The problem being that most likely they are all tagged as | charset="Shift_JIS", without distinguishing the variant of what's in | the Shift-JIS encoding. Unreliable tagging is very common. That's | one good reason why we all advocate Unicode... Sure, but none of that helps me in any way. People are publishing these pages and I need to support them "correctly". | Given how many Windows machines there are, and given that Shift-JIS | seems to be more popular on Windows than on Unixes, let's look at | the Shift-JIS<->Unicode mapping table for windows-932: | |http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup | (From our collection of mapping tables at | http://oss.software.ibm.com/icu/charset/) | | Shift-JIS F040..F9FC appears to be contiguously and linearly mapped | to U+E000..U+E757. I'm afraid that's not what users expect to see. I know it's the right solution in the general case, but it seems that I need to do whatever MSIE does, since that is in effect what users expect to see. | Other Shift-JIS variants from different platforms will use a | different assignment, but I would try the Windows variant first for | whatever web page you are looking at. As a receiver, maybe you can | figure out which platform generated the file, from a <meta> tag or | an http server identification. In general that's impossible. The pages will generally not be labeled at all, and if they are labeled they will be labeled with anything from "shift-jis" to "x-sjis" or even "iso-8859-1". Some pages even do "helpful" things like sticking comments with Shift-JIS-typical byte signatures need the top of the pages to help auto-detection, rather than actually reveal what charset they used. So while there are many different Shift-JISes my chances of finding out which one was used in each case is essentially nil. I need to find the most common one and support that. | As a recommendation, if you _have_ to _generate_ Shift-JIS web | pages, you should avoid UDCs and instead use NCRs (with Unicode | non-PUA[!] code points). I'm not generating these pages, I'm trying to display them. | The W3C has a page about the problems with Japanese charset | identifiers and mapping tables. That was a good lead. They give four different tables for Shift-JIS: - x-sjis-unicode-0.9 this is the one I use already - x-sjis-jisx0221-1995 this one I haven't been able to find - x-sjis-cp932 the unicode.org version has the 0xFA40 - 0xFC4B range, an ICU version seems to cover the same range, with mapping to the PUA for the rest - x-sjis-jdk1.1.7 I found a JDK 1.3 version of this, but it had nothing I now have some more pieces of the puzzle, but I still don't have all of them. Or do I? Is it just the 0xFA40 - 0xFC4B range that has real characters in it? --Lars M.