Re: [bug #64360] [PATCH] [gropdf] does not correctly handle white space after 'w' command

G. Branden Robinson Thu, 05 Feb 2026 09:58:18 -0800

[much context preserved because this is such a sprawling argument]

At 2026-02-05T11:46:29-0500, Deri James wrote:
> Follow-up Comment #36, bug #64360 (group groff):
> 
> On Thursday, 5 February 2026 02:55:02 GMT G. Branden Robinson wrote:
> > Follow-up Comment #35, bug #64360 (group groff):
> >
> > [comment #34 comment #34:]
> >
> >> Discussion about this issue has temporarily jumped ship to bug #67992.
> >> Hopefully it finds its way back here, as it has fsck-wall to do with
> >> #67992's topic.
> >
> > Not for very long.  It pretty much jumped back.  Bug #67992 still
> > mostly has to do with the issue I just posted to the _groff_ list
> > about.
> >
> > https://lists.gnu.org/archive/html/groff/2026-02/msg00025.html
> >
> > As noted there, it's now a documentation issue.
> >
> > _This_ issue is still awaiting Deri's feedback.
> >
> > Here's the part that jumped (from comment 15 to bug #67992).  The
> > material not prefixed with ">" is me.
> >
> >> Follow-up Comment #14, bug #67992 (group groff):
> >>
> >> On Sunday, 1 February 2026 04:51:11 GMT G. Branden Robinson wrote:
> >>> Follow-up Comment #11, bug #67992 (group groff):
> >> [...]
> >>
> >>> Deri, for example, is rigid in his expectations of GNU troff's
> >>> output format ("grout", as I term it.)  Where documentation
> >>> doesn't support his rigidity, he points to the implementation as a
> >>> specification from which no deviation should be permitted; see bug
> >>> #63544.
> >>
> >> I provided a one line perler which achieved your desire for more
> >> readable grout (the "wish" of that bug):-
> >>
> >> perl -pe 's/(.)(.*)/$1\n$2/ if m/^w/; s/^(.)(\S.*)/$1 $2/mg' zfile
> >>
> >> I hope you are still using it.
> >
> > I am not.  I do not see the value in filtering GNU troff's output to
> > pre-digest it into a form gropdf requires but which no other groff
> > output drivers do.
> 
> I see your misunderstanding of the purpose of the perl one liner. The
> wish bug was to produce grout more readable by humans,


Yes.

> which is not a benefit to most users,

Agreed; most users won't ever inspect "grout".

> but may be a help for developers searching for bugs.

Yes.

> So I wrote the perl one liner so that a human would see the grout as
> you would like to see it without actually changing the current format
> of the grout.

That's an anti-goal, in my opinion.  If most users aren't looking, but
developers _are_, then groff's default should be to produce grout that
is more readable _by developers_, especially if doing so does not
significantly impact CPU load, memory use, or I/O throughput.

Kernighan's original "trout" format (my term) was even more terse than
one might expect for anything emerging from a workplace that included
Ken Thompson.  That's because the impact on I/O throughput to the
typesetter device and memory usage upon it _was_ significant!

CSTR #97 explains this clearly, in my opinion.

> It is not intended as a filter for gropdf,

Right--I don't see how it could be, since gropdf as currently written
would be unable to interpret it.

> merely a simple way of achieving the purported reason for the ticket.
> I was hoping that would be sufficient to close the ticket.

No.  I don't want a tool that can make the wrong thing look like the
right thing when I can get the right thing in the first place.

(Much as I like to hack at the clay feet of idolized Unix personages,
in my opinion "go fmt" was and is an excellent initiative.  So props to
Griesemer, Pike, and Thompson for that!)

> Since it is difficult to compare versions of pdfs,

That would be a defect in the file format.

> when a visual difference is spotted, the easiest way of spotting
> changes between the versions is to compare the two grout files.

...and those differences will be more easily spotted with a cosmetic
arrangement of grout that is less challenging to interpret.

itsmoreefficienttowriteenglishsentenceslikethisbutmostreaderswouldprefer
thatisticktoidiomaticpractices

> If the format of the grout file is altered between versions it would
> make comparison more difficult.

These scenarios are necessarily a subset of all inspections of grout.
_That subset_ is when you bust out the Perl one-liner to "normalize" the
grout.

> Once the two grouts are confirmed compatible by diffing them, any
> visual difference in the pdf must be due to changes I have introduced
> in gropdf.

This technique can be expected to be fruitful for detecting bugs in the
formatter and macro files--the driver support file "pdf.tmac" is likely
of most interest here.

What it _can't_ do is help you detect or correct bugs in gropdf, the
Perl program, itself.

I've started looking (only superficially, so far) at qpdf to see if it
might be something we can use to write automated tests for gropdf.

> > See:
> > https://cgit.git. ... s.sh?h=1.24.0.rc2

I went to the Savannah ticket tracker to see what that URL was, and wow!
Its URL abbreviator is now so brutal that it trashes the hyperlink--even
on its own web pages!--rather than just abbreviating the link text.

Not useful.

> > (Whoops--stale comment in there.  I'll fix that.)
> 
> I'm confused,

I was saying that:

"# Ensure that characters are mapped to glyphs normatively."

...was irrelevant to the test script
"src/devices/grotty/tests/h-option-works.sh".

> >> I am expecting changes to grout and groff fonts (v2) when you
> >> complete full utf8 throughput.
> >
> > I don't plan "full UTF-8 throughput".
> 
> I was being too succinct to be understandable!
> 
> Input is UTF8

That's the plan, yes.

> (and may include groff named characters, eg \[cq], \[u03B9]).

Already supported for many years, yes.  The former since day one, or as
close as the historical record permits us to infer.

> Each character is then converted to its Unicode Code Point (UCP)
> char32_t (good choice)

Not _every_ character.  We're helpless with \[ru] and \[bs], which have
no corresponding Unicode assignments.

But almost all, yes.  Imagine a Dirichlet indicator function joke here.

> and becomes a text node

Not the correct term, but I think I get what you're saying.  (And I'm
liable to rename some stuff in this area anyway.  "composite_node" is
high on my (s)hit list.)

> (I'd also store the actual input utf8 or groff char name as a wchar_t
> string in the text node as well - makes asciify a doddle).

I thought about that too.  I was a little flabbergasted to observe that
this was overlooked in the first place when I realized that we had a
whole struct backing up every single glyph _already_.

> This also means that only text nodes carry document stream text.

This may be an overstatement.  But maybe it could become true with some
revision to the class hierarchy.  Our "friend" `composite_node` rears
its head again.  Not only does it have nothing to do with composite
special character escape sequences--like `[e aa]`, which is a more
scrupulous way of saying \['e] or é--but the layer or stage at which it
should undergo transformation to text has proved to be a fraught
question.[1]  Dave and I went ten rounds over that in some Savannah
ticket I can't find right this second.

> > I don't plan to use UTF-8 for GNU troff's internal storage of
> > formattable character objects (rather, I plan to use `char32_t`) and
> > moreover, even I did use UTF-8 internally, the input would still
> > have to undergo some form of Unicode normalization; probably
> > Normalization Form D given the program's orientation toward
> > typesetting and support for glyph composition by overstriking.
> > groff_char(7) has cautioned the reader to prepare for normalization
> > of input for several years.
> 
> NFD is used for searching and sorting INPUT characters, it
> specifically has nothing to do with typesetting. This ticket:-
> 
> https://savannah.gnu.org/bugs/?67244
> 
> (which I can't access at the moment, so I can't check the current
> state of play).

Within the past hour or so I got an email that says your problem should
be resolved--you had guessed the cause correctly.

Anyway, are you sure?  It is the case that font designers overwhemingly
provide precomposed glyphs?

I thought we had complex systems like Harfbuzz, and font file formats
more advanced than Type 1, to solve challenging problems in this domain.
As I understand it, these handle multiple diacritics or context-
dependent glyph shapes (initial, medial, and final forms, for instance),
because fonts, especially post-ISO 8859 ones with huge coverage, need to
be highly adaptable and have logic in them to attractively compose
glyphs.  If you want to stack a dot and a hacek above an "s", you can,
and you don't have to worry about that composed glyph being present in
the font as a single indexed entity.

If I'm right, then we _do_ want to apply NFD, or something like it, to
UTF-8 sequences on the input stream.

> Documents an issue with using any unicode normalisation forms. The
> problem, in this case involves characters which have different forms
> depending on context. This is a bit like ligatures, 'f' has a UCP but
> when it is followed by another 'f' a different UCP is appropriate IF
> the font being used supports both UCPs (f=0066, ff=FB00). If you are
> searching text for 'ff' you would hope to find both 'ff' and 'ﬀ'. The
> same is true for the iota character, it has different forms depending
> on context, but for searching/sorting, which normalisation  is
> concerned with, both forms mean iota so they are mrked as equivalent
> input characters, but groff is concerned about the output typesetting
> (for grops/gropdf at least), but the difference in output form is
> lost.

Isn't this the purpose, or one of them, of PDF's CMap feature?

I have _no_ objection to adding support for a datum inside the formatter
that keeps track of the input UTF-8 sequence prior to decomposition, so
that this information can be smuggled to the output device via a device
extension.  We've already got a flexible "container" system in our node
class hierarchy.  Ironically, "composite node" applies better, I think,
to this phenomenon than to "user-defined characters", which is its
current association.

What I'm not convinced of is that we can get along without doing
decomposition at all.

Anyway, much of the second half of this discussion needs to land in
Savannah #40720.  And neither of these issues ("grout" readability and
decomposition of Unicode input) is a 1.24.0 gate.

Regards,
Branden

[1] And neither of these are synonymous with "composite character
    mappings" configured with the `composite` request, and (as of groff
    1.24.0) reported with `pcomposite`.  (Composite character mappings
    _are_ implicated in composite special character escape sequences,
    however.)  We express our admiration for Dennis Ritchie's genius by
    imitating his achievement with C's `static` keyword, which variously
    implicates symbol visibility (linkage) and storage allocation
    depending on what it's applied to.  Remember, language designers
    everywhere--once you've found a keyword you like, use it for as many
    different purposes as possible.  Only the truly elite hackers that
    will be your most feared and admired exponents will never be
    confused--or, rather, they will succeed best at fooling people into
    thinking they never are, which is as good or better.

>   {savane: user = 108747; tracker = bugs; item = 64360}

signature.asc
Description: PGP signature

Re: [bug #64360] [PATCH] [gropdf] does not correctly handle white space after 'w' command

Reply via email to