[luatex] glyph to Unicode code point

maxwell Wed, 28 Mar 2018 16:31:18 -0700

I'm using a version of the code in the answer here
    https://tex.stackexchange.com/questions/228312/

to convert a LuaTeX node structure into a list of characters in thatstructure.

My code at present traverses a Node, recursing on nodes that are hlists(or vlists); when it comes to a node which is a glyph (node.id() = 29 inthe table on p99 of the LuaTeX reference manual, version 1.0.4 of April2017), it converts the node.char to what is (hopefully) a Unicode codepoint:

    unicode.utf8.char(node.char)

I say hopefully, because this conversion relies on the glyph beingassigned a slot in in a particular font that has the same number as theUnicode code point for that character. This works ok for many simplefonts, particularly some Latin fonts. It also works for some simpleArabic script fonts which encode the various glyph variants (likeinitial, final, medial and isolated) as being the Arabic PresentationForm code points (which can easily be converted to code points in thenormal Arabic block). Unfortunately, it does not work for glyphs thatare assigned to other slots in a font, corresponding to a Private UseArea code point in Unicode, or sometimes not corresponding to a validUnicode code point at all.

The cmap table in an Open Type font (or a True Type font) provides amapping between Unicode code points and glyph slots. Somewhere underthe hood, LuaTex is presumably using this table to choose an appropriateglyph. It seems like it should be possible to do the reverse mapping,i.e. to map from a particular glyph to the corresponding Unicode codepoint. (In the case of ligatures, this will be a one-to-many mapping.)The LuaTeX reference appears to discuss this on p68-69; if I have acharacter's hash (from a font table), I can apparently extract the'tounicode' value, which IIUC is the Unicode code point I'm looking for.

My problem is that I don't know how to go from the glyph's slot number(which is apparently what node.char is giving me, for nodes thatrepresent a glyph) to the character hash in the font table, or even howto find the font table from the font number. The node.char elements arenumbers like 1583 (which appears to be 0x62F, and makes sense as thecode point for Arabic Dal) and 983159 (0xF0077, which would not be avalid Unicode character, but might be a glyph in some font).

How do I go from these node.char numbers and a node.font number (anumber like 29, which apparently points to a font) to a character hashin the font's table? I'm guessing I need a function that maps from thenode.font to a font table, and then a function that maps the number fromnode.char plus a font table to a character hash. Something like

   if node.id == 29 then

UnicodeChar = unicode.utf8.char(Node2CharHash(node.char,Node2FontTable(node.font)).tounicode)where Node2CharHash and Node2FontTable are the functions I'm lookingfor, if my guess is right. (My syntax is probably wrong, I'm used toPython...)


   Mike Maxwell

[luatex] glyph to Unicode code point

Reply via email to