Re: PDF outline not capturing Cyrillic text

G. Branden Robinson Sun, 18 Sep 2022 07:37:53 -0700

Hi Peter,

At 2022-09-17T17:35:02-0400, Peter Schaffter wrote:
> Source documents written in Cyrillic and processed with mom/pdfmom
> break the PDF outline:
> 
> 1. The text of titles and headings is not displayed in the outline.
[...]
> At a guess, it looks as if gropdf or pdfmark isn't recognizing
> Cyrillic characters as valid input for creating pdf bookmarks.  I'm at
> a loss as to how to overcome this.  Ideas?


I have a hunch that this is our old friend "can't output node in
transparent throughput".

Five years into my tenure as a groff developer, I think I finally
understand what this inscrutable error message is on about.  However, I
recently disabled these diagnostics by default in groff Git.

Try regenerating the document with

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1

(actually, you can set the variable to anything) in your environment.

The problem, I think, is that PDF bookmark generation, like some other
advanced features, relies upon use of device control escape sequences.
That means '\X' stuff.

In device-independent output ("grout", as I term it), these become "x X"
commands, and the arguments to the escape sequence are, you'd think,
passed through as-is.

The trouble comes with the assumption people make about what "as-is"
means.

The problem is this: what if we want to represent a non-ASCII character
in the device control escape sequence?

groff's device-independent output is, up to a point, strictly ISO Basic
Latin, a property we inherited from AT&T troff.

We have the same problem with the requests that write to the standard
error stream, like `tm`.  I'm not sure that problem is worth solving;
groff's own diagnostic messages are not i18n-ed.  Even if it is worth
solving, teaching device control commands how to interpret more kinds of
"node" seems like a higher priority.

We don't have any infrastructure for handling any character encoding but
the default for input.  That's ISO Latin-1 for most platforms, but IBM
code page 1047 for OS/390 Unix (I think--no one who runs groff on such a
machine has ever spoken with me of their experiences).  And in practice
GNU troff doesn't, as far as I can tell, ever write anything but the 94
graphical code points in ASCII, spaces, and newlines to its output.

I imagine a lot of people's first instinct to fix this is to say, "just
give groff full Unicode support and enable input and output of UTF-8"!

That's a huge ask.

A shorter pole might be to establish a protocol for communication of
Unicode code points within device control commands.  Portability isn't
much of an issue here: as far as I know there has been no effort to
achieve interoperation of device control escape sequences among troffs.

That convention even _could_ be UTF-8, but my initial instinct is _not_
to go that way.  I like the 7-bit cleanliness of GNU troff output, and
when I've mused about solving The Big Unicode Problem I have given
strong consideration to preserving it, or enabling tricked-out UTF-8
"grout" only via an option for the kids who really like to watch their
chrome rims spin.  I realize that Heirloom and neatroff can both boast
of this, but how many people _really_ look at device-independent troff
output?  A few curious people, and the poor saps who are stuck
developing and debugging the implementations, like me.  For the latter
community, a modest and well-behaved format saves a lot of time.

Concretely, when I run the following command:

  GROFF_ENABLE_TRANSPARENCY_WARNINGS=1 ./test-groff -Z -mom -Tpdf -pet \
  -Kutf8 ../contrib/mom/examples/mon_premier_doc.mom

I get the following diagnostics familiar to all who have build groff
1.22.4 from source.

troff:../contrib/mom/examples/mon_premier_doc.mom:30: error: can't translate 
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:108: error: can't translate 
character code 233 to special character ''e' in transparent throughput
troff:../contrib/mom/examples/mon_premier_doc.mom:136: error: can't translate 
character code 232 to special character '`e' in transparent throughput

More tellingly, if I page the foregoing output with "less -R", I see
non-ASCII code points screaming out their rage in reverse video.

x X ps:exec [/Author (Cic<E9>ron,) /DOCINFO pdfmark

x X ps:exec [/Dest /pdf:bm4 /Title (1. Les diff<E9>rentes versions) /Level 2 
/OUT pdfmark

x X ps:exec [/Dest /evolution /Title (2. Les <E9>volutions du Lorem) /Level 2 
/OUT pdfmark

x X ps:exec [/Dest /pdf:bm8 /Title (Table des mati<E8>res) /Level 1 /OUT pdfmark

It therefore appears to me that the pdfmark extension to PostScript, or
PostScript itself, happily eats Latin-1...but that means that it eats
_only_ Latin-1, which forecloses the use of Cyrillic code points.

I'm a little concerned that we're blindly _feeding_ the device control
commands characters with the eighth bit set.  It's obviously a useful
expedient for documents like mon_premier_doc.mom.  I am curious to know
why instead of getting no text for headings and titles in the Cyrillic
PDF outline, you didn't get horrendous mojibake garbage--but plainly
Latin-1 garbage at that.

Anyway, some type of mode switching or alternative notation within the
PostScript command stream is required for us to be able to encode
Cyrillic code points.

And once we've figured out what that is, maybe we can teach GNU troff
something about it.  The answer might be to do just whatever works for
PostScript and PDF, since I assume this problem has been solved already,
but it also might mean having our own escaping protocol, which the
output drivers then interpret.

Regards,
Branden

signature.asc
Description: PGP signature

Re: PDF outline not capturing Cyrillic text

Reply via email to