Jungshik Shin wrote: > It's impossible to infer the document encoding from 'lang' tag.
Indeed, yes, I presented the URL inserted by jmaiorana to the W3C HTML validator and it could not make any sense out of it. Still, when I set Mozilla to 'autodetect Japanese' it correctly found it to be shift-jis. So it is possible "in a way"; after all, there are many text utilities (for Japanese only) that can guess (or autodetect) encodings. Aahh.. somethings now dawns on me: perhaps charset applies to the WHOLE document and must be determined before any processing is done, while lang can apply to individual sections? That is why Mozilla does not 'trust' lang for determining/autodetecting the encoding? It will (and can) autodetect, but only when told to do so by the user, not by the document. So probably jmaiorana (who said the page displayed correctly) had autodetect Japanese ON. > The value of 'lang' plays a role ONLY after the identity of > characters in documents are determined. See below. Right. Yes, this is quite clear to me now (finally!). The Mozilla algorithm is: 1. determine the encoding (for the whole document) from the 'charset' attribute, or by auto-detection as specified by the user. 2. determine the font (for the section concerned, which may be the whole "body") from the 'lang' attribute. If the attributes are missing, there are several fallback options and defaults, but this is the rule in principle. One default seems to be 'the language group is Western'. I can put two fragments of Russian in an UTF-8 document, one with no special marking and one with lang=ru. The Western font is Times New Roman (which includes Cyrillic characters). I set the 'Cyrillic' font to MS Comic (which also has Cyrillic). Then only the 'lang=ru' marked fragment is displayed in MS Comic, the other one in T.N.Roman. More or less the same effect as in the second URL posted by jmaiorana. I must still do a few more experiments to find out what the rule is when no lang is specified but the UTF-8 character does not occur in the Western font. (and also what the rules are which are used by Xprint..) > BTW, as you know, GB18030 is another UTF so that even without > resorting to NCRs (&#xhhhh(hh); or &#dddd..;) it can cover the > full range of Unicode. No, I did not know this; I had assumed it was one of those Chinese legacy things like eten or big5. Now I Googled a bit and found that it is a Chinese government Unicode standard. What was wrong with UTF-8 one wonders (rhetorical question, donÂt really want to know the answer because it is probably very complicated). Regards, Jan -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
