Hi !

   I’ll just add two points to the various points raised in the previous conversation about block coverage :


Le 17/02/2018 à 23:18, Adam Borowski via Unicode a écrit :
Hi!
As a part of Debian fonts team work, we're trying to improve fonts review:
ways to organize them, add metadata, pick which fonts are installed by
default and/or recommended to users, etc.

I'm looking for a way to determine a font's coverage of available scripts.
It's probably reasonable to do this per Unicode block.  [...]

A naïve way would be to count codepoints present in the font vs the number
of all codepoints in the block.  Alas, there's way too much chaff for such
an approach to be reasonable: þ or ą count the same as LATIN TURNED CAPITAL
LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.
A slightly less naïve way would be to take care of when the code-points ere added to Unicode, with the rough idea that the most widespread use characters were added first. It also adds the nice feature that this metric is less ambiguous for the blocks which are not yet completed.

For example, if you have a 100% coverage of
Armenian for Unicode 10.0 (which I’ll call Armenian10.0 for short), it only implies a coverage of 89/91=97.8% of Armenian11.0, which will see the addition of two characters used in Armenian dialectology (ARMENIAN SMALL LETTER TURNED AYB and YI WITH STROKE). If you look at the history of Armenian Block (e.g. here https://en.wikipedia.org/wiki/Armenian_(Unicode_block)), Most (84) characters where added in 1.0, A ligature was added in 1.0, ARMENIAN HYPHEN was added in 3.0, a currency symbol in 6.1, two decorative symbols in 7.0 and two characters used in dialectology are planned in 11.0. I guess this roughly correspond to a ranking of the characters from the most used to the least used.


To take your examples, both þ and ą are in unicode since 1.1 (and, I guess 1.0), while LATIN TURNED CAPITAL LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON is not encoded yet, so,they are not the same according to this metric...  To know what this means for othe Latin example, you can watch the Latin Extende-D block (history here https://en.wikipedia.org/wiki/Latin_Extended-D ) whith new characters in 5.0, 5.1, 6.1, 7.0, 8.0, 9.0 and some accepted for 11.0 (SMALL CAPITAL Q, CAPITAL/SMALL LETTER U WITH STROKE), and later (15, for  Egyptology, Assyriology, medieval English and historical Pinyin)

Of course, this measure is only rough. A counter example is in the monetary symbol block, where € U+20AC EURO SIGN (in Unicode since 2.1) is much more used than ₣ U+20A3 FRENCH FRANC SIGN encode since Unicode 1.1 (1.0?) but that I never saw, despite living in France for more than four decades.
[...]

I don't think I'm the first to have this question.  Any suggestions?

For the Han (CJK) script, the IRG (Ideographic Rapporteur Group) defined a set of less than 10k essential Han characters, IICore (International Ideographs Core, https://en.wikipedia.org/wiki/International_Ideographs_Core). This is described in the Unihan database in the Unihan_IRGSources.txt file, kIICore field (https://www.unicode.org/reports/tr38/#kIICore ). This field also includes a letter (A,B or C) indicating a priority value and some regional information. For Unicode 10.0, a simple grep tells that there are 9810 IICore characters, 7772 of hitch pritority A, 417 priority B and 1621 priority C.

Note that IICore has been stable (as version 2.2) since 2004, but Ken Lunde, from Adobe, has recently proposed an update to it (https://www.unicode.org/L2/L2018/18066-iicore-changes.pdf), but only in the region tags, neither on the priorities nor on the list of characters. However, reading the associated blog post of Ken Lunde, it seems a few characters could be added to IICore in the future.

   Cheers,

            French

Reply via email to