Jungshik Shin wrote:

> It's impossible to infer the document encoding from 'lang' tag.

Indeed, yes, I presented the URL inserted by jmaiorana to the W3C
HTML validator and it could not make any sense out of it. Still,
when I set Mozilla to 'autodetect Japanese' it correctly found it
to be shift-jis. So it is possible "in a way"; after all, there
are many text utilities (for Japanese only) that can guess (or
autodetect) encodings.

Aahh.. somethings now dawns on me: perhaps charset applies to the
WHOLE document and must be determined before any processing is
done, while lang can apply to individual sections? That is why
Mozilla does not 'trust' lang for determining/autodetecting the
encoding? It will (and can) autodetect, but only when told to do
so by the user, not by the document. So probably jmaiorana (who
said the page displayed correctly) had autodetect Japanese ON.

> The value of 'lang' plays a role ONLY after the identity of 
> characters in documents are determined. See below.

Right. Yes, this is quite clear to me now (finally!). The Mozilla
algorithm is:

1. determine the encoding (for the whole document) from the
   'charset' attribute, or by auto-detection as specified by the
   user.
2. determine the font (for the section concerned, which may be the
   whole "body") from the 'lang' attribute.

If the attributes are missing, there are several fallback options
and defaults, but this is the rule in principle. One default seems
to be 'the language group is Western'. I can put two fragments of
Russian in an UTF-8 document, one with no special marking and one
with lang=ru. The Western font is Times New Roman (which includes
Cyrillic characters). I set the 'Cyrillic' font to MS Comic (which
also has Cyrillic). Then only the 'lang=ru' marked fragment is
displayed in MS Comic, the other one in T.N.Roman. More or less
the same effect as in the second URL posted by jmaiorana.

I must still do a few more experiments to find out what the rule
is when no lang is specified but the UTF-8 character does not
occur in the Western font. (and also what the rules are which are
used by Xprint..)

> BTW, as you know, GB18030 is another UTF  so that even without
> resorting to NCRs (&#xhhhh(hh); or &#dddd..;) it can cover the
> full range of Unicode.

No, I did not know this; I had assumed it was one of those Chinese
legacy things like eten or big5. Now I Googled a bit and found
that it is a Chinese government Unicode standard. What was wrong
with UTF-8 one wonders (rhetorical question, donÂt really want to
know the answer because it is probably very complicated).

Regards, Jan

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to