I'm using a version of the code in the answer here
    https://tex.stackexchange.com/questions/228312/
to convert a LuaTeX node structure into a list of characters in that structure.

My code at present traverses a Node, recursing on nodes that are hlists (or vlists); when it comes to a node which is a glyph (node.id() = 29 in the table on p99 of the LuaTeX reference manual, version 1.0.4 of April 2017), it converts the node.char to what is (hopefully) a Unicode code point:
    unicode.utf8.char(node.char)

I say hopefully, because this conversion relies on the glyph being assigned a slot in in a particular font that has the same number as the Unicode code point for that character. This works ok for many simple fonts, particularly some Latin fonts. It also works for some simple Arabic script fonts which encode the various glyph variants (like initial, final, medial and isolated) as being the Arabic Presentation Form code points (which can easily be converted to code points in the normal Arabic block). Unfortunately, it does not work for glyphs that are assigned to other slots in a font, corresponding to a Private Use Area code point in Unicode, or sometimes not corresponding to a valid Unicode code point at all.

The cmap table in an Open Type font (or a True Type font) provides a mapping between Unicode code points and glyph slots. Somewhere under the hood, LuaTex is presumably using this table to choose an appropriate glyph. It seems like it should be possible to do the reverse mapping, i.e. to map from a particular glyph to the corresponding Unicode code point. (In the case of ligatures, this will be a one-to-many mapping.) The LuaTeX reference appears to discuss this on p68-69; if I have a character's hash (from a font table), I can apparently extract the 'tounicode' value, which IIUC is the Unicode code point I'm looking for.

My problem is that I don't know how to go from the glyph's slot number (which is apparently what node.char is giving me, for nodes that represent a glyph) to the character hash in the font table, or even how to find the font table from the font number. The node.char elements are numbers like 1583 (which appears to be 0x62F, and makes sense as the code point for Arabic Dal) and 983159 (0xF0077, which would not be a valid Unicode character, but might be a glyph in some font).

How do I go from these node.char numbers and a node.font number (a number like 29, which apparently points to a font) to a character hash in the font's table? I'm guessing I need a function that maps from the node.font to a font table, and then a function that maps the number from node.char plus a font table to a character hash. Something like
   if node.id == 29 then
UnicodeChar = unicode.utf8.char(Node2CharHash(node.char, Node2FontTable(node.font)).tounicode) where Node2CharHash and Node2FontTable are the functions I'm looking for, if my guess is right. (My syntax is probably wrong, I'm used to Python...)

   Mike Maxwell

Reply via email to