[bug #67372] [troff] \A and \w behavior change from 1.22.4 -> 1.23.0

G. Branden Robinson Tue, 29 Jul 2025 08:38:27 -0700

Follow-up Comment #15, bug #67372 (group groff):

At 2025-07-29T09:29:06-0400, Ingo Schwarze wrote:
> [comment #11 comment #11:]
>
>> Here's a proof-of-concept patch, fixing up only the `\A` escape
>> sequence.
>
> Your patch is absolutely not OK.
> In fact, it significantly exacerbates the regression.


I reject your characterization that it's a regression.  As I see it, the
input you were feeding to GNU troff was syntactically invalid and your
test cases dependent on *unspecified* formatter behavior.

> Here is what i tested.
> 1. I replaced s/want_att_compat/compatible_flag/ in your patch.
> 2. I added the resulting modified patch to the OpenBSD port.
> 3. The patch, configure, build, fake, and package stages all went
> through with no suspicious output.
> 4. I installed the modified port.
> 5. Nothing changes in the rest of the test suite, which is good as
> far as that goes.
>
> But while the roff/esc/A1 test exposes two regressions in \A handling
> in 1.23, with \A otherwise still in working order, with your patch
>
> 1. These two regressions become significantly worse:
> With your patch, groff now prints
> string expansion: 0xwordxu
> register expansion: 0121u
> instead of the expected
> string expansion: 1u
> register expansion: 1u
> which is much worse than without the patch because it now leaks \A
> arguments to the output
> rather than only losing a single suffix character.
>
> 2. The scope of the regression expands massively.
> In addition to not fixing the problem (and making it worse),
> the \A sequence is almost entirely broken with the patch.
> With one exception (the final "unterminated sequence" test),
> the patch causes every single one of the "Argument delimiter"
> tests to fail very badly:
> most of them switching value from 1 to 0 *and* leaking arguments
> to the output.
>
> 3. The regression of the first test, the \A\r test, is particularly
> dire.  That test now causes this output (observe the almost blank
> line with the "u", and the ruined section title):
> u
> Argument delimitersword
> unsupported \r: 0
> instead of the expected
>
> Argument delimiters
> unsupported \r: 1u

I don't know that your tests are sound given the grammar of GNU _troff_.

First I should quote the manual again, though.


   The following escape sequences don't take arguments and thus are
allowed as delimiters: '\<SPC>', '\%', '\|', '\^', '\{', '\}', '\'',
'\`', '\-', '\_', '\!', '\?', '\)', '\/', '\,', '\&', '\:', '\~', '\0',
'\a', '\c', '\d', '\e', '\E', '\p', '\r', '\t', and '\u'.  However,
using them this way is discouraged; they can make the input confusing to
read.


And lest you suspect that the foregoing language is some contrivance of
mine, here's the _groff_ 1.22.3 Texinfo manual:


...but it is better not to use this feature to avoid confusion.

   The following escapes sequences (which are handled similarly to
characters since they don't take a parameter) are also allowed as
delimiters: '\%', '\ ', '\|', '\^', '\{', '\}', '\'', '\`', '\-', '\_',
'\!', '\?', '\)', '\/', '\,', '\&', '\:', '\~', '\0', '\a', '\c', '\d',
'\e', '\E', '\p', '\r', '\t', and '\u'.  Again, don't use these if
possible.


How old _is_ this language?  A few "git blames" later, I see:


a7cac213d3 doc/groff.texinfo (Werner LEMBERG      2000-03-28 09:26:07 +0000
5870) The following escapes sequences (which are handled similarly to
a7cac213d3 doc/groff.texinfo (Werner LEMBERG      2000-03-28 09:26:07 +0000
5871) characters since they don't take a parameter) are also allowed as
a7cac213d3 doc/groff.texinfo (Werner LEMBERG      2000-03-28 09:26:07 +0000
5872) delimiters: @code{\%}, @w{@samp{\ }}, @code{\|}, @code{\^}, @code{\@{},
a7cac213d3 doc/groff.texinfo (Werner LEMBERG      2000-03-28 09:26:07 +0000
5873) @code{\@}}, @code{\'}, @code{\`}, @code{\-}, @code{\_}, @code{\!},
6603276718 doc/groff.texinfo (Werner LEMBERG      2007-09-15 08:10:46 +0000
5874) @code{\?}, @code{\)}, @code{\/}, @code{\,}, @code{\&}, @code{\:},
6603276718 doc/groff.texinfo (Werner LEMBERG      2007-09-15 08:10:46 +0000
5875) @code{\~}, @code{\0}, @code{\a}, @code{\c}, @code{\d}, @code{\e},
7a267a0726 doc/groff.texinfo (Werner LEMBERG      2008-09-25 07:48:15 +0000
5876) @code{\E}, @code{\p}, @code{\r}, @code{\t}, and @code{\u}.  Again, don't
7a267a0726 doc/groff.texinfo (Werner LEMBERG      2008-09-25 07:48:15 +0000
5877) use these if possible.


Notice that the foregoing necessarily implies (by set complementation)
that *escape sequences that take arguments aren't allowed as
delimiters*.

Given that, let's have a look.  I'll annotate your cases with roff-style
comments.


$ cat ~/src/CVS/mandoc/regress/roff/esc/A1.in
.\" $OpenBSD: A1.in,v 1.1 2022/06/08 13:08:00 schwarze Exp $
.Dd $Mdocdate: June 8 2022 $
.Dt ESC-A1 1
.Os
.Sh NAME
.Nm esc-A1
.Nd the roff escape A sequence: identifier syntax validation
.Sh DESCRIPTION
empty: \A'' \" valid as input, delimited contents not valid as an identifier
.br
letters: \A'word' \" valid as input, delimited contents also valid as
identifier
.br
blank: \A'two words' \" valid as input, delimited contents not valid as an
identifier
.br
ASCII non-letters: \AA!"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~A \" valid as
input, delimited contents also valid as an identifier
.br
invalid escapes: \A'\\\.\G' \" valid as input, delimited contents not valid as
an identifier
.Ss Argument delimiters
unsupported \er: \A\rword\ru \" valid as input, delimited contents also valid
as identifier
.br
ignored \e&: \A\&word\&u \" valid as input, delimited contents also valid as
identifier
.br
useless \e.: \A\.word.u \" UNSPECIFIED: `\.` is not, technically, an escape
sequence
.br
invalid \eG: \A\GwordGu \" UNSPECIFIED: `\G` is not a defined escape sequence
.br
special \e-: \A\-word\-u \" valid as input, delimited contents also valid as
identifier
.br
break \ep: \A\pword\pu \" valid as input, delimited contents also valid as
identifier
.br
nospace \ec: \A\cword\cu \" valid as input, delimited contents also valid as
identifier
.\".br
.\"skipchar \ez: \A\zword\zu \" ERROR: escape sequences that take arguments
are not valid as escape sequence delimiters
.br
.ds mystr xwordxu
string expansion: \A\*[mystr] \" ERROR: escape sequences that take arguments
are not valid as escape sequence delimiters
.br
.nr myreg 121
register expansion: \A\n[myreg]u \" ERROR: escape sequences that take
arguments are not valid as escape sequence delimiters
.br
ignored \eON: \A\O1word\O2u \" ERROR: escape sequences that take arguments are
not valid as escape sequence delimiters
.br
special character: \A\(hyword\(hyu \" ERROR: escape sequences that take
arguments are not valid as escape sequence delimiters
.br
ignored \eZ\(aqstr\(aq: \A\Z'foo'word\Z'bar'u \" ERROR: escape sequences that
take arguments are not valid as escape sequence delimiters
.br
horizontal motion: \A\h'1'word\h'3'u \" ERROR: escape sequences that take
arguments are not valid as escape sequence delimiters
.br
horizontal line: \A\l'4'word\l'2'u \" ERROR: escape sequences that take
arguments are not valid as escape sequence delimiters
.br
overstrike: \A\o'ab'word\o'cd'u \" ERROR: escape sequences that take arguments
are not valid as escape sequence delimiters
.br
unterminated: \A'word \" ERROR: unterminated delimited argument[*]


[*] GNU _troff_ has, from 1.22.3 to Git HEAD at least, merely warned
about this input; what usually happens is that the remainder of the
input line becomes the argument to the escape sequence, and the parser
state resets upon encountering a newline (or end of input).  As far as I
know, it's unspecified whether input line continuation is honored in
this context, and I can imagine a case being made either way.

I advise you to fix your test cases.

I've opened bug #67375 to address the underspecification problem.

> I have no idea why you are bringing compat mode into the discussion.

**BECAUSE GNU TROFF CHANGED THE WAY INTERPOLATIONS WERE HANDLED COMPARED
TO AT&T TROFF STARTING OVER 35 YEARS AGO, AND YOUR TEST CASES ARE
EXPECTING AT&T-LIKE HANDLING OF THE INPUT.**

Unfortunately GNU _troff_'s change to the grammar appears not to have
been comprehensively applied.  You have found some corner cases and
elevated the formatter's treatment of, per its manual, *prohibited
input* to the status of a specification, such that you term
nonconformance with your expectations as "regression".

> This regression has absolutely nothing to do with compat mode.

It does, inasmuch as you happen to have crafted your tests to expect
AT&T _troff_-like interpolation behavior.

> You cannot be suggesting that manual pages should be rendered in
> compat mode, or that the mandoc test suite should run in compat mode?

Not necessarily.  But I will *have* to suggest that remedy if you refuse
to come to grips with GNU _troff_'s "interpolation depth" concept.

And I kind of do expect you to refuse that, because I fear you'll have
to change _mandoc_'s parser in ways you don't want to if you accept that
concept.

Or, you could absorb what the GNU _troff_ manual has said for 17-25
years and stop employing grammatical constructs that _groff_'s Texinfo
manual has, since before _mdocml_[*] was written, said are not accepted.
Or you could not be so particular about how the formatter recovers from
syntactically invalid input.  You might run those cases through `groff
-Wall -wdelim` and look for non-empty output to the standard error
stream.

[*] _mandoc_'s previous name, for those reading this dispute for
(dubious) entertainment.

> I have never used compat mode for anything and i think its very
> existence is a terrible idea.

You can propose its removal to the _groff_ mailing list, of course.

> Software always changes, and the idea that this continuous evolution
> can be captured by a single bit of information is delusional.  Which
> point in time does compat mode target for compatibility?

That's a fair point, but given how imprecisely specified many aspects of
*roff are in the first place, a "compatibility mode" that satisfactorily
renders historical AT&T documents according to veteran practitioners is,
probably, "good enough".

> So i had to look up how to even enable compat mode because i never
> used or tested it...  Putting ".cp 1" as the first line into the test
> file changes absolutely nothing.  Neither does setting the -C command
> line option.
>
> Putting ".cp 1" after the mdoc(7) preamble (right before the ".Sh
> NAME" line results in this printout on standard output, *instead* of
> the NAME header:
> $] .tm Usage: .Sh section_name ... (#0.c]) c-macro-name Sh.
> and then two blank lines,
> but all parts of the (exacerbated) regression still occur with no
> changes.
>
> Putting ".cp 1" after the ".Ss Argument delimiters" section header
> only prevents the *new* regressions that your patch causes,
> but the *original* regression that this ticket is about still occurs.
>
> To summarize, your patch is ineffective for the purposes of this
> ticket and causes new regressions that are even worse.

It sounds to me that, like Bjarni's obsession with compiler warnings and
similar, you're refusing to consider the _context_ of what it is you're
asking the formatter to do because you're regarding an automated test
failure as inherently defective, as opposed to being a diagnostic tool.

> I think your development methodology is not quite right.
>
> It started off well enough:
>
> 1. A regression is reported.
> 2. You identified the commit that caused it (thanks for that!).
> (I did not yet check whether reverting that commit, or the relevent
> part(s) of it, indeed fixes the regression.)
> 3. That commit was *not* intended to change behaviour.
> Indeed, the commit message says "refactor".
>
> At this point, the logical next step is to _just fix the bloody
> regression_

No, the logical next step is to _understand the behavior change_.  That
includes a consideration of whether the input was well-formed in the
first place.

> by reverting the defective commit, or the defective parts of it,
> then consider what can be done to reduce the probability of future
> refactoring commits causing regressions in a similar way.

I explained at length and with citations to the GNU _troff_ manual why
your test cases are invalid inputs.

> Instead you chose to
>
> 4. Research historic behaviour ab initio.

Yup.  No *roff in the world has ever been written for the purpose of
passing _mandoc_'s test suite.

> 5. The result was (at least for \w; i don't think discussing \b for
> nroff(1) makes much sense,

\b is perfectly well defined for nroff mode.  It stacks glyphs
vertically as in troff mode (with coarser device resolution).  You're
flaunting your ignorance with this statement.  (That said, it _is_ a
weird feature.)

> and i didn't check what's going on for \o yet since the roff/esc/o
> test does not indicate any regression) that \w behaviour has been
> consistent from AT&T all the way through to groff-1.22.4, which
> impressively confirms what we already knew after steps 2 & 3, i.e.
> that there is a regression.

It does nothing of the sort.

> Now instead of fixing the regression, you take off on a breathtaking
> tangent about an unrelated topic, namely, compatibility mode.  That
> doesn't make sense to me.

I'm talking about compatibility mode because what you seem to desire and
expect is AT&T _troff_-style interpolation semantics, not GNU
_troff_-style.

But you don't seem to understand that, because your own oxygen-hungry
tangent fails to mention, let alone grapple with, the concept of "input
level" (a.k.a "interpolation depth") which is fundamental to the
operation of GNU _troff_'s parser.  The term is right there in the code
you're wanting to patch (`input_stack::get_level()`).

GNU _troff_ has long had defects in the sense that it hasn't
aggressively validated its input stream.  Not a surprise--C and C++
developers are not Haskell programmers.  Over the years, I've tightened
up some of the gaps.  But I might not ever make it perfectly rigorous.

When not in compatibility mode, the character that follows an escape
function selector (that accepts a delimiter) MUST be a valid delimiter
_prior_ to application of any interpolation operations to the stream, or
the input is invalid.

When in compatibility mode, the character that follows the escape
function selector can be anything, and will be greedily, recursively
interpolated.  If what ends up after the escape function selector is a
valid delimiter character, the formatter marches merrily on.

> Sure, i'm not saying we cannot discuss compatibility mode (even though
> i consider compat mode mostly a waste of time and am not particularly
> keen on even considering it at all), but when a regression is found,
> normal development methodology is:
>
> 1. Fix the regression by reverting the defective change, or the
> defective part of it.

Show me a document--a practical document, not a test case--that fails to
render correctly due to this behavior change.  There's a reason we've
gone more than two years without anyone noticing it.

> 2. Consider which safeguards are needed to prevent repeat incidents,
> in particular in case a regression not only got committed, but
> released.

Test cases will do.  I'm happy to write them in conjunction with this
bug fix.  Or, rather, a bug fix as _I_ conceive it, which I expect to be
more long the lines of my patch in comment #11 than yours in
comment #13.

> 3. Only after that can other changes to the code in question be
> considered, and if those other changes change behaviour (like the
> patch you posted here does) that needs a full rationale and a
> discussion completely separate from the regression fix.

I see that you're a big fan of pipeline stall.

> When a regression is found, it is certainly not acceptable to pile
> more commits on top without fixing the regression first, and it is
> methodologically also not acceptable to discuss other changes in the
> context of the regression fix.

Your test case(s) exhibited proscribed input.  Get off your soapbox.

> But can we please focus this ticket exclusively on the regression fix
> and discuss any other matters elsewhere?

No, because what you identified wasn't a regression.

Where did you come up with these test cases?  Did you excerpt them from
real-world documents?  Where are these documents?

They look contrived to me--which can be fine, if done with a language
grammar ready at hand.  Unfortunately, *roff may not even be a decidable
language, so no BNF-style grammar for it has ever existed, to the best
of my knowledge.[*]  Instead we have a manual in English, which you
evidently did not consult.

[*] I don't actually know for a fact that all grammars expressible in
    BNF are decidable.

> Reacting to regressions by piling more changes on top instead of just
> reverting and learning from the mishap is *the* most effective way to
> make software unstable, accumulate bugs, and lose control of what one
> is doing.

You're preaching now.  That time would be better spent reading the GNU
_troff_ manual to develop a better understanding of the language.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?67372>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #67372] [troff] \A and \w behavior change from 1.22.4 -> 1.23.0

Reply via email to