Bumping this old thread. Lets merge James's work with the new OpenType 1.7 investigation and update the mapping.
Thanks everyone! behdad On 14-03-08 03:17 AM, James Clark wrote: > For my own project, I needed to implement mapping from IETF language tags to > OpenType language system tags. I ended up writing some code to generate the > mapping and then comparing the results with HarfBuzz. For each case where > there was a discrepancy, I did enough research to convince myself of the right > result. The HB source refers to a recent Microsoft draft, from which some > entries have been added; I skipped these entries (which I assume are similar > to the ones in the ISO 3rd ed WD 5, which I found > here > http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition). > > I documented the research here > > https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md > > As a result I have a lot of comments about HarfBuzz's implementation. > > First some stuff that is just typos. > > "ber" should be mapped to BBR not BER. > > There's a duplicate entry for "hz" not in sort order. > > The entries for "sck", "vls", "wo" are not in sort order. > > The tag for "tmh" is in lower case instead of upper case. > > Some tags are missing a final zero. The ISO WD adds some 4-character tags, > whose last character is a zero. There are four cases where these have been > added, but the final zero was incorrectly omitted: kab -> KAB0, ksh -> KSH0, > kg -> KON0, pap -> PAP0, sn -> SNA0. > > The following entries appear in the spec, but are missing from HarfBuzz, and > they seem uncontroversial to me. > > wlc CMR Mwali Comorian > wni CMR Ndzwani Comorian > zdj CMR Ngazidja Comorian > caf CRR Southern Carrier > co COS Corsican > > The last is probably missing because it was omitted from the ISO WD; I suspect > this is a bug in the ISO WD. > > HarfBuzz (and the OT spec) are inconsistent in their handling of > macrolanguages. Sometimes when an IETF macrolanguage is mapped to an OT lang, > they also map the individual languages encompassed by the macrolanguage to > that OT tag and sometimes they don't. I would suggest that the consistent and > reasonable policy is always to map the individual languages to the same OT tag > as the macrolanguage, unless the individual language is separately mapped to a > more specific OT tag. I created a file with the additional entries that would > be needed to implement this policy in HarfBuzz: > > https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt > > The rest of my comments are not self-evident. You will need to refer to the > notes I linked to above for my reasoning. > > My first set of removal/additions is in accordance with the ISO 639 codes in > the spec. I suggest removing these mappings: > > eot BTI Beti (Côte d'Ivoire) > kvd KUI Kui (Indonesia) > mdc MLE Male (Papua New Guinea) > mlq MNK Western Maninkakan > nco SIB Sibe > ril RIA Riang (India) > xom KMO Komo (Sudan) > yso NIS Nisi (China) > > and adding these: > > sjo SIB Xibe > pro PRO Old Provencal > rmz ARK Marma > > The next set is not in the spec. Remove: > > xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired) > > and add: > > njz NIS Nyishi > tgj NIS Tagin > beb BTI Bebele > bum BTI Bulu (Cameroon) > bxp BTI Bebil > eto BTI Eton (Cameroon) > ewo BTI Ewondo > fan BTI Fang (Equatorial Guinea) > mct BTI Mengisa > > Finally I have suggestions the commented out entries in the source: > > /*{"ahg/awn/xan?",HB_TAG('A','G','W',' ')},*//* Agaw */ > > "ahg", "awn" > > /*{"gsw?/gsw-FR?",HB_TAG('A','L','S',' ')},*//* Alsatian */ > > "gsw" > > /*{"krc",HB_TAG('B','A','L',' ')},*//* Balkar */ > > Leave unmapped > > /*{"??",HB_TAG('B','C','R',' ')},*//* Bible Cree */ > > Leave unmapped > > /*{"zh?",HB_TAG('C','H','N',' ')},*//* Chinese (seen in Microsoft fonts) */ > > ??? > > /*{"acf/gcf?",HB_TAG('F','A','N',' ')},*//* French Antillean */ > > "acf", "gcf" > > /*{"enf?/yrk?",HB_TAG('F','N','E',' ')},*//* Forest Nenets */ > > Leave unmapped > > /*{"fuf?",HB_TAG('F','T','A',' ')},*//* Futa */ > > "fuf" > > /*{"ar-Syrc?",HB_TAG('G','A','R',' ')},*//* Garshuni */ > > "ar-Syrc" > > /*{"cfm/rnl?",HB_TAG('H','A','L',' ')},*//* Halam */ > > "cfm" > > /*{"fonipa",HB_TAG('I','P','P','H')},*//* Phonetic transcription—IPA > conventions */ > > "und-fonipa", or better map anything with a variant of "fonipa" > > /*{"ga-Latg?/Latg?",HB_TAG('I','R','T',' ')},*//* Irish Traditional */ > > "ga-Latg" > > /*{"krc",HB_TAG('K','A','R',' ')},*//* Karachay */ > > "krc" > > /*{"alw?/ktb?",HB_TAG('K','E','B',' ')},*//* Kebena */ > > "alw" > > /*{"Geok",HB_TAG('K','G','E',' ')},*//* Khutsuri Georgian */ > > "ka-Geok" (Georgian written with the Khutsuri script) > > /*{"kca",HB_TAG('K','H','K',' ')},*//* Khanty-Kazim */ > > "kca" > > /*{"kca",HB_TAG('K','H','S',' ')},*//* Khanty-Shurishkar */ > > Leave unmapped > > /*{"kca",HB_TAG('K','H','V',' ')},*//* Khanty-Vakhi */ > > Leave unmapped > > /*{"guz?/kqs?/kss?",HB_TAG('K','I','S',' ')},*//* Kisii */ > > "guz" > > /*{"kfa/kfi?/kpb?/xua?/xuj?",HB_TAG('K','O','D',' ')},*//* Kodagu */ > > "kfa" > > /*{"okm?/oko?",HB_TAG('K','O','H',' ')},*//* Korean Old Hangul */ > > "okm" > > /*{"kon?/ktu?/...",HB_TAG('K','O','N',' ')},*//* Kikongo */ > > "ktu" > > /*{"kfx?",HB_TAG('K','U','L',' ')},*//* Kulvi */ > > "kfx" > > /*{"??",HB_TAG('L','A','H',' ')},*//* Lahuli */ > > "lbf", "lae", "bfu" > > /*{"??",HB_TAG('L','C','R',' ')},*//* L-Cree */ > > Leave unmapped > > /*{"??",HB_TAG('M','A','L',' ')},*//* Malayalam Traditional */ > > Leave unmapped > > /*{"mnk?/mlq?/...",HB_TAG('M','L','N',' ')},*//* Malinke */ > > "mlq" > > /*{"??",HB_TAG('N','C','R',' ')},*//* N-Cree */ > > "csw" > > /*{"??",HB_TAG('N','H','C',' ')},*//* Norway House Cree */ > > Leave unmapped > > /*{"jpa?/sam?",HB_TAG('P','A','A',' ')},*//* Palestinian Aramaic */ > > "jpa", "sam" > > /*{"polyton",HB_TAG('P','G','R',' ')},*//* Polytonic Greek */ > > "el-polyton" > > /*{"??",HB_TAG('Q','I','N',' ')},*//* Asho Chin */ > > "tbq" > > (The spec says Chin not Asho Chin.) > > /*{"??",HB_TAG('R','C','R',' ')},*//* R-Cree */ > > "atj" > > /*{"chp?",HB_TAG('S','A','Y',' ')},*//* Sayisi */ > > Leave unmapped > > /*{"xan?",HB_TAG('S','E','K',' ')},*//* Sekota */ > > "xan" > > /*{"ngo?",HB_TAG('S','X','T',' ')},*//* Sutu */ > > Leave unmapped > > /*{"??",HB_TAG('T','C','R',' ')},*//* TH-Cree */ > > Leave unmapped > > /*{"tnz?/tog?/toi?",HB_TAG('T','N','G',' ')},*//* Tonga */ > > "toi" > > /*{"enh?/yrk?",HB_TAG('T','N','E',' ')},*//* Tundra Nenets */ > > "yrk" > > /*{"??",HB_TAG('W','C','R',' ')},*//* West-Cree */ > > Leave unmapped > > /*{"cre?",HB_TAG('Y','C','R',' ')},*//* Y-Cree */ > > "crk" > > /*{"??",HB_TAG('Y','I','C',' ')},*//* Yi Classic */ > > Leave unmapped > > /*{"ii?/Yiii?",HB_TAG('Y','I','M',' ')},*//* Yi Modern */ > > "ii" > > It would also be desirable to map otherwise unmapped languages in the > Yi script (ie with with a script code of Yiii) to YIM. > > /*{"??",HB_TAG('Z','H','P',' ')},*//* Chinese Phonetic */ > > "zh-Latn" > > I'll have some more general comments later. > > James > > > > _______________________________________________ > HarfBuzz mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/harfbuzz > _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
