Re: [Fonts]Automatic 'lang' determination

Pablo Saratxaga Sat, 29 Jun 2002 16:25:08 -0700

Kaixo!

On Sat, Jun 29, 2002 at 01:20:34PM -0700, Keith Packard wrote:
 
> > A font is suited for a given language when it covers *ALL* of the codepoints
> > needed for that language.
> 
> Yes, that's obviously true, but the problem is that I don't have tables for
> each language indicating the required codepoints, all I have are tables
> listing Unicode values in encodings traditionally used for each language.
> These tables almost always include a few (1-5) glyphs which many fonts are
> missing.


What are those glyphs?
(I'm quite surprised, I would have expected the opposite: fonts generally
have more glyphs than the standard encodings of the sio-8859 family
for example)

>> So, the tests for CJK languages and for other languages are clearly different,
>> only CJK languages can go with testing only a "signifiant fraction",
>> for all other languages all chars must be tested.
> 
> Yes, the tolerance value given for the Han languages is 500 codepoints 
> while the value for non-Han languages is two orders of magnitude smaller.

No, the tolerance for missing glyphs in CJK tests should be the
same or even smaller.
The difference is that it isn't needed to test all the glyphs for CJK
coverages; testing only a set of 256 choose glyphs would be enough
(if they are correctly choosen, testing that 256 glyphs are present in
a font is enough to assure, with 99.99% of confidence, that it covers
a given CJK language).

That cannot be done for the 8bit latin/cyrillic encodings because
there is too much overlapping between them (in the case of
iso-8859-1/iso-8859-15 the overlapping is of 97% for example).
While there is also a lot of overlapping between CJK encodings, there
are large plages of non overlaping chars, chars that appear only in
the japanese encoding, or only in gb2312, or only in big5 etc. (I mean
by "only": "not in any other widely used legacy encoding", so explicitely
excluding unicode that of course includes them all). As those "exclusive"
chars are numerous enough it is possbile to test for the presence of
some of them in a font and determine a language coverage from there.

Of course, complete checking can also be done, but I wonder if it is
actually useful (I mean, is there a font suitable for simplified chinese
out there that doesn't encode all the characters of gb2312? It would be   
like a font for English that is missing the "r" letter).
"Big5" is a bit more problematic, as there is no such a thing as a well
defined "Big5" encoding, but rather, in the pure Microsoftian tradition
(big5 comes after all from that side) a number of revisions all named
the same, that adds some characters, and an older font can miss some
chars that a newer one has (according to a newer definition of "big5"). 

But to handle such case, I think it would be better to choose a given
definition of "big5" (or several of them) and stick to it, rather than
allowing a so tremendously big hole as 500 possible missing chars.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/           PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
_______________________________________________
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

Reply via email to