Re: [Fonts]Automatic 'lang' determination

Pablo Saratxaga Sun, 30 Jun 2002 01:05:07 -0700

Kaixo!

On Sat, Jun 29, 2002 at 05:17:04PM -0700, Keith Packard wrote:
 
> > What are those glyphs? (I'm quite surprised, I would have expected the
> > opposite: fonts generally have more glyphs than the standard encodings of
> > the sio-8859 family for example)
> 
> My definition of language tag is coloured by the OS/2 table codePageRange 
> bits from which is was originally defined in fontconfig.  Those bits are 
> defined to map to specific Windows code pages; the Latin-1 case doesn't 
> map to ISO 8859-1, but rather to code page 1252 for which many fonts are 
> missing a few random entries.


But what characters are those?
It is possible that they are the onesthat have been added to cp1252
and that didn't existed some years ago?
I think the matching should be done against the lowest denominator
and be strict; or to give different weights to the miss of *letters*
or other symbols (it may be more or less acceptable to get quotation
marks from another font; bUt lEttErs frOm A dIffErEnt fOnts Is vErY UglY).

> > No, the tolerance for missing glyphs in CJK tests should be the same or
> > even smaller. The difference is that it isn't needed to test all the glyphs
> > for CJK coverages; testing only a set of 256 choose glyphs would be enough
> > (if they are correctly choosen, testing that 256 glyphs are present in a
> > font is enough to assure, with 99.99% of confidence, that it covers a given
> > CJK language).
> 
> I'm not confident enough of this approach; I fear that any set of 256 
> glyphs that must appear in a simplified Chinese font may well appear in 
> many traditional Chinese (or even Japanese) fonts.

Most do, of course, but there are a lot that don't.
I only dealt with a ~10-15 ttf CJK fonts, but never had false positives
using that method.

>> out there that doesn't encode all the characters of gb2312?
> 
> It seems that this must be the case -- I set the '500' number so high 
> because all of the fonts which I have that advertise support for 
> simplified Chinese are missing over 200 glyphs from GB2312.  I got
> similar results for Japanese fonts, Korean Wansung fonts and traditional 
> Chinese fonts.

But what characters are those missing?
Could it be that those are semi-graphic ones, or scripts used by other
languages (eg: cyrillic, greek, japanese kana in chinese font, etc).
Here too, different weights should be used, it is not a big problem if
a CJK font is missing cyrillic, a font designed for russian will be a much
better choice to render cyrillic anyway; but it may be a big problem if
some needed characters are missing.

And I'm really surprised by such a high number as 200.
Are you sure you tested against gb2312 and not agains the Microsoft
codepage based on it (that surely adds several extra characters) ?

>> But to handle such case, I think it would be better to choose a given
>> definition of "big5" (or several of them) and stick to it, rather than
>> allowing a so tremendously big hole as 500 possible missing chars.
> 
> Missing 500 from a repertoire of nearly 20000 doesn't seem to render most 
> of these fonts unusable.

It could, it depends on what glyphs are missing.


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/           PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
_______________________________________________
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

Reply via email to