Hi Peter, At 2022-09-17T17:35:02-0400, Peter Schaffter wrote: > Source documents written in Cyrillic and processed with mom/pdfmom > break the PDF outline: > > 1. The text of titles and headings is not displayed in the outline. [...] > At a guess, it looks as if gropdf or pdfmark isn't recognizing > Cyrillic characters as valid input for creating pdf bookmarks. I'm at > a loss as to how to overcome this. Ideas?
I have a hunch that this is our old friend "can't output node in transparent throughput". Five years into my tenure as a groff developer, I think I finally understand what this inscrutable error message is on about. However, I recently disabled these diagnostics by default in groff Git. Try regenerating the document with GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 (actually, you can set the variable to anything) in your environment. The problem, I think, is that PDF bookmark generation, like some other advanced features, relies upon use of device control escape sequences. That means '\X' stuff. In device-independent output ("grout", as I term it), these become "x X" commands, and the arguments to the escape sequence are, you'd think, passed through as-is. The trouble comes with the assumption people make about what "as-is" means. The problem is this: what if we want to represent a non-ASCII character in the device control escape sequence? groff's device-independent output is, up to a point, strictly ISO Basic Latin, a property we inherited from AT&T troff. We have the same problem with the requests that write to the standard error stream, like `tm`. I'm not sure that problem is worth solving; groff's own diagnostic messages are not i18n-ed. Even if it is worth solving, teaching device control commands how to interpret more kinds of "node" seems like a higher priority. We don't have any infrastructure for handling any character encoding but the default for input. That's ISO Latin-1 for most platforms, but IBM code page 1047 for OS/390 Unix (I think--no one who runs groff on such a machine has ever spoken with me of their experiences). And in practice GNU troff doesn't, as far as I can tell, ever write anything but the 94 graphical code points in ASCII, spaces, and newlines to its output. I imagine a lot of people's first instinct to fix this is to say, "just give groff full Unicode support and enable input and output of UTF-8"! That's a huge ask. A shorter pole might be to establish a protocol for communication of Unicode code points within device control commands. Portability isn't much of an issue here: as far as I know there has been no effort to achieve interoperation of device control escape sequences among troffs. That convention even _could_ be UTF-8, but my initial instinct is _not_ to go that way. I like the 7-bit cleanliness of GNU troff output, and when I've mused about solving The Big Unicode Problem I have given strong consideration to preserving it, or enabling tricked-out UTF-8 "grout" only via an option for the kids who really like to watch their chrome rims spin. I realize that Heirloom and neatroff can both boast of this, but how many people _really_ look at device-independent troff output? A few curious people, and the poor saps who are stuck developing and debugging the implementations, like me. For the latter community, a modest and well-behaved format saves a lot of time. Concretely, when I run the following command: GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 ./test-groff -Z -mom -Tpdf -pet \ -Kutf8 ../contrib/mom/examples/mon_premier_doc.mom I get the following diagnostics familiar to all who have build groff 1.22.4 from source. troff:../contrib/mom/examples/mon_premier_doc.mom:30: error: can't translate character code 233 to special character ''e' in transparent throughput troff:../contrib/mom/examples/mon_premier_doc.mom:108: error: can't translate character code 233 to special character ''e' in transparent throughput troff:../contrib/mom/examples/mon_premier_doc.mom:136: error: can't translate character code 232 to special character '`e' in transparent throughput More tellingly, if I page the foregoing output with "less -R", I see non-ASCII code points screaming out their rage in reverse video. x X ps:exec [/Author (Cic<E9>ron,) /DOCINFO pdfmark x X ps:exec [/Dest /pdf:bm4 /Title (1. Les diff<E9>rentes versions) /Level 2 /OUT pdfmark x X ps:exec [/Dest /evolution /Title (2. Les <E9>volutions du Lorem) /Level 2 /OUT pdfmark x X ps:exec [/Dest /pdf:bm8 /Title (Table des mati<E8>res) /Level 1 /OUT pdfmark It therefore appears to me that the pdfmark extension to PostScript, or PostScript itself, happily eats Latin-1...but that means that it eats _only_ Latin-1, which forecloses the use of Cyrillic code points. I'm a little concerned that we're blindly _feeding_ the device control commands characters with the eighth bit set. It's obviously a useful expedient for documents like mon_premier_doc.mom. I am curious to know why instead of getting no text for headings and titles in the Cyrillic PDF outline, you didn't get horrendous mojibake garbage--but plainly Latin-1 garbage at that. Anyway, some type of mode switching or alternative notation within the PostScript command stream is required for us to be able to encode Cyrillic code points. And once we've figured out what that is, maybe we can teach GNU troff something about it. The answer might be to do just whatever works for PostScript and PDF, since I assume this problem has been solved already, but it also might mean having our own escaping protocol, which the output drivers then interpret. Regards, Branden
signature.asc
Description: PGP signature