For my own project, I needed to implement mapping from IETF language tags to OpenType language system tags. I ended up writing some code to generate the mapping and then comparing the results with HarfBuzz. For each case where there was a discrepancy, I did enough research to convince myself of the right result. The HB source refers to a recent Microsoft draft, from which some entries have been added; I skipped these entries (which I assume are similar to the ones in the ISO 3rd ed WD 5, which I found here http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition ).
I documented the research here https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md As a result I have a lot of comments about HarfBuzz's implementation. First some stuff that is just typos. "ber" should be mapped to BBR not BER. There's a duplicate entry for "hz" not in sort order. The entries for "sck", "vls", "wo" are not in sort order. The tag for "tmh" is in lower case instead of upper case. Some tags are missing a final zero. The ISO WD adds some 4-character tags, whose last character is a zero. There are four cases where these have been added, but the final zero was incorrectly omitted: kab -> KAB0, ksh -> KSH0, kg -> KON0, pap -> PAP0, sn -> SNA0. The following entries appear in the spec, but are missing from HarfBuzz, and they seem uncontroversial to me. wlc CMR Mwali Comorian wni CMR Ndzwani Comorian zdj CMR Ngazidja Comorian caf CRR Southern Carrier co COS Corsican The last is probably missing because it was omitted from the ISO WD; I suspect this is a bug in the ISO WD. HarfBuzz (and the OT spec) are inconsistent in their handling of macrolanguages. Sometimes when an IETF macrolanguage is mapped to an OT lang, they also map the individual languages encompassed by the macrolanguage to that OT tag and sometimes they don't. I would suggest that the consistent and reasonable policy is always to map the individual languages to the same OT tag as the macrolanguage, unless the individual language is separately mapped to a more specific OT tag. I created a file with the additional entries that would be needed to implement this policy in HarfBuzz: https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt The rest of my comments are not self-evident. You will need to refer to the notes I linked to above for my reasoning. My first set of removal/additions is in accordance with the ISO 639 codes in the spec. I suggest removing these mappings: eot BTI Beti (Côte d'Ivoire) kvd KUI Kui (Indonesia) mdc MLE Male (Papua New Guinea) mlq MNK Western Maninkakan nco SIB Sibe ril RIA Riang (India) xom KMO Komo (Sudan) yso NIS Nisi (China) and adding these: sjo SIB Xibe pro PRO Old Provencal rmz ARK Marma The next set is not in the spec. Remove: xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired) and add: njz NIS Nyishi tgj NIS Tagin beb BTI Bebele bum BTI Bulu (Cameroon) bxp BTI Bebil eto BTI Eton (Cameroon) ewo BTI Ewondo fan BTI Fang (Equatorial Guinea) mct BTI Mengisa Finally I have suggestions the commented out entries in the source: /*{"ahg/awn/xan?", HB_TAG('A','G','W',' ')},*/ /* Agaw */ "ahg", "awn" /*{"gsw?/gsw-FR?", HB_TAG('A','L','S',' ')},*/ /* Alsatian */ "gsw" /*{"krc", HB_TAG('B','A','L',' ')},*/ /* Balkar */ Leave unmapped /*{"??", HB_TAG('B','C','R',' ')},*/ /* Bible Cree */ Leave unmapped /*{"zh?", HB_TAG('C','H','N',' ')},*/ /* Chinese (seen in Microsoft fonts) */ ??? /*{"acf/gcf?", HB_TAG('F','A','N',' ')},*/ /* French Antillean */ "acf", "gcf" /*{"enf?/yrk?", HB_TAG('F','N','E',' ')},*/ /* Forest Nenets */ Leave unmapped /*{"fuf?", HB_TAG('F','T','A',' ')},*/ /* Futa */ "fuf" /*{"ar-Syrc?", HB_TAG('G','A','R',' ')},*/ /* Garshuni */ "ar-Syrc" /*{"cfm/rnl?", HB_TAG('H','A','L',' ')},*/ /* Halam */ "cfm" /*{"fonipa", HB_TAG('I','P','P','H')},*/ /* Phonetic transcription—IPA conventions */ "und-fonipa", or better map anything with a variant of "fonipa" /*{"ga-Latg?/Latg?", HB_TAG('I','R','T',' ')},*/ /* Irish Traditional */ "ga-Latg" /*{"krc", HB_TAG('K','A','R',' ')},*/ /* Karachay */ "krc" /*{"alw?/ktb?", HB_TAG('K','E','B',' ')},*/ /* Kebena */ "alw" /*{"Geok", HB_TAG('K','G','E',' ')},*/ /* Khutsuri Georgian */ "ka-Geok" (Georgian written with the Khutsuri script) /*{"kca", HB_TAG('K','H','K',' ')},*/ /* Khanty-Kazim */ "kca" /*{"kca", HB_TAG('K','H','S',' ')},*/ /* Khanty-Shurishkar */ Leave unmapped /*{"kca", HB_TAG('K','H','V',' ')},*/ /* Khanty-Vakhi */ Leave unmapped /*{"guz?/kqs?/kss?", HB_TAG('K','I','S',' ')},*/ /* Kisii */ "guz" /*{"kfa/kfi?/kpb?/xua?/xuj?", HB_TAG('K','O','D',' ')},*/ /* Kodagu */ "kfa" /*{"okm?/oko?", HB_TAG('K','O','H',' ')},*/ /* Korean Old Hangul */ "okm" /*{"kon?/ktu?/...", HB_TAG('K','O','N',' ')},*/ /* Kikongo */ "ktu" /*{"kfx?", HB_TAG('K','U','L',' ')},*/ /* Kulvi */ "kfx" /*{"??", HB_TAG('L','A','H',' ')},*/ /* Lahuli */ "lbf", "lae", "bfu" /*{"??", HB_TAG('L','C','R',' ')},*/ /* L-Cree */ Leave unmapped /*{"??", HB_TAG('M','A','L',' ')},*/ /* Malayalam Traditional */ Leave unmapped /*{"mnk?/mlq?/...", HB_TAG('M','L','N',' ')},*/ /* Malinke */ "mlq" /*{"??", HB_TAG('N','C','R',' ')},*/ /* N-Cree */ "csw" /*{"??", HB_TAG('N','H','C',' ')},*/ /* Norway House Cree */ Leave unmapped /*{"jpa?/sam?", HB_TAG('P','A','A',' ')},*/ /* Palestinian Aramaic */ "jpa", "sam" /*{"polyton", HB_TAG('P','G','R',' ')},*/ /* Polytonic Greek */ "el-polyton" /*{"??", HB_TAG('Q','I','N',' ')},*/ /* Asho Chin */ "tbq" (The spec says Chin not Asho Chin.) /*{"??", HB_TAG('R','C','R',' ')},*/ /* R-Cree */ "atj" /*{"chp?", HB_TAG('S','A','Y',' ')},*/ /* Sayisi */ Leave unmapped /*{"xan?", HB_TAG('S','E','K',' ')},*/ /* Sekota */ "xan" /*{"ngo?", HB_TAG('S','X','T',' ')},*/ /* Sutu */ Leave unmapped /*{"??", HB_TAG('T','C','R',' ')},*/ /* TH-Cree */ Leave unmapped /*{"tnz?/tog?/toi?", HB_TAG('T','N','G',' ')},*/ /* Tonga */ "toi" /*{"enh?/yrk?", HB_TAG('T','N','E',' ')},*/ /* Tundra Nenets */ "yrk" /*{"??", HB_TAG('W','C','R',' ')},*/ /* West-Cree */ Leave unmapped /*{"cre?", HB_TAG('Y','C','R',' ')},*/ /* Y-Cree */ "crk" /*{"??", HB_TAG('Y','I','C',' ')},*/ /* Yi Classic */ Leave unmapped /*{"ii?/Yiii?", HB_TAG('Y','I','M',' ')},*/ /* Yi Modern */ "ii" It would also be desirable to map otherwise unmapped languages in the Yi script (ie with with a script code of Yiii) to YIM. /*{"??", HB_TAG('Z','H','P',' ')},*/ /* Chinese Phonetic */ "zh-Latn" I'll have some more general comments later. James
_______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
