Re: metric for block coverage

Frédéric Grosshans via Unicode Thu, 08 Mar 2018 06:22:34 -0800


Hi !

I’ll just add two points to the various points raised in theprevious conversation about block coverage :



Le 17/02/2018 à 23:18, Adam Borowski via Unicode a écrit :

Hi!
As a part of Debian fonts team work, we're trying to improve fonts review:
ways to organize them, add metadata, pick which fonts are installed by
default and/or recommended to users, etc.

I'm looking for a way to determine a font's coverage of available scripts.
It's probably reasonable to do this per Unicode block.  [...]

A naïve way would be to count codepoints present in the font vs the number
of all codepoints in the block.  Alas, there's way too much chaff for such
an approach to be reasonable: þ or ą count the same as LATIN TURNED CAPITAL
LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.

A slightly less naïve way would be to take care of when the code-pointsere added to Unicode, with the rough idea that the most widespread usecharacters were added first. It also adds the nice feature that thismetric is less ambiguous for the blocks which are not yet completed.


For example, if you have a 100% coverage of

Armenian for Unicode 10.0 (which I’ll call Armenian10.0 for short), itonly implies a coverage of 89/91=97.8% of Armenian11.0, which will seethe addition of two characters used in Armenian dialectology (ARMENIANSMALL LETTER TURNED AYB and YI WITH STROKE).If you look at the history of Armenian Block (e.g. herehttps://en.wikipedia.org/wiki/Armenian_(Unicode_block)),Most (84) characters where added in 1.0, A ligature was added in 1.0,ARMENIAN HYPHEN was added in 3.0, a currency symbol in 6.1, twodecorative symbols in 7.0 and two characters used in dialectology areplanned in 11.0. I guess this roughly correspond to a ranking of thecharacters from the most used to the least used.

To take your examples, both þ and ą are in unicode since 1.1 (and, Iguess 1.0), while LATIN TURNED CAPITALLETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON is notencoded yet, so,they are not the same according to this metric... Toknow what this means for othe Latin example, you can watch the LatinExtende-D block (history herehttps://en.wikipedia.org/wiki/Latin_Extended-D ) whith new characters in5.0, 5.1, 6.1, 7.0, 8.0, 9.0 and some accepted for 11.0 (SMALL CAPITALQ, CAPITAL/SMALL LETTER U WITH STROKE), and later (15, for Egyptology,Assyriology, medieval English and historical Pinyin)

Of course, this measure is only rough. A counter example is in themonetary symbol block, where € U+20AC EURO SIGN (in Unicode since 2.1)is much more used than ₣ U+20A3 FRENCH FRANC SIGN encode since Unicode1.1 (1.0?) but that I never saw, despite living in France for more thanfour decades.

[...]

I don't think I'm the first to have this question.  Any suggestions?

For the Han (CJK) script, the IRG (Ideographic Rapporteur Group) defineda set of less than 10k essential Han characters, IICore (InternationalIdeographs Core,https://en.wikipedia.org/wiki/International_Ideographs_Core). This isdescribed in the Unihan database in the Unihan_IRGSources.txt file,kIICore field (https://www.unicode.org/reports/tr38/#kIICore ). Thisfield also includes a letter (A,B or C) indicating a priority value andsome regional information. For Unicode 10.0, a simple grep tells thatthere are 9810 IICore characters, 7772 of hitch pritority A, 417priority B and 1621 priority C.

Note that IICore has been stable (as version 2.2) since 2004, but KenLunde, from Adobe, has recently proposed an update to it(https://www.unicode.org/L2/L2018/18066-iicore-changes.pdf), but only inthe region tags, neither on the priorities nor on the list ofcharacters. However, reading the associated blog post of Ken Lunde, itseems a few characters could be added to IICore in the future.


   Cheers,

            French

Re: metric for block coverage

Reply via email to