Hi Deri, [somewhat rearranged] At 2026-01-28T20:49:25+0000, Deri wrote: > I agree with Bruno, the fix can wait. I'm not sure about this one > though, > > [derij@pip build (master)]$ echo "\X'pdf: xrev'"|groff -Tpdf -ms -Z > x T pdf > x res 72000 1 1 > x init > p1 > troff: src/roff/troff/input.cpp:3107: const char* > token::description(): Assertion `0 == "unhandled case of `type` > (token)"' failed. > groff: error: troff: Aborted (core dumped)
Good catch--I wasn't aware of this. > it seems to be only in current groff:- That much is not a surprise. Here's the commit (ec856178ff) that added the assertion. diff --git a/ChangeLog b/ChangeLog index 383f3263b..e86c69426 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,12 @@ +2025-11-28 G. Branden Robinson <[email protected]> + + * src/roff/troff/input.cpp (token::description): Add assertion; + every token type should have a human-readable description. In + the event that's not the case and `NDEBUG` is defined, describe + the anomalous token as "an undescribed token" rather than "a + magic token", to make it clearer that the problem results from + developer oversight. + 2025-11-28 G. Branden Robinson <[email protected]> * src/roff/troff/token.h: Add new inline member function diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp index 0ff52efd1..35224c502 100644 --- a/src/roff/troff/input.cpp +++ b/src/roff/troff/input.cpp @@ -3031,9 +3031,9 @@ const char *token::description() case TOKEN_EOF: return "end of input"; default: - break; + assert(0 == "unhandled case of `type` (token)"); + return "an undescribed token"; } - return "a magic token"; } void skip_line() This assertion is tripping when `token::description()`, a member function that is called only by diagnostic routines to tell the user (on the standard error stream) that something has gone wrong, hits a "this should never happen situation". Civilized languages like Haskell and (in this respect) Rust force the programmer to consider every possibility in switch/case-style control flow. Another approach, long seen in Pascal and Ada, is to have honest-to-God real enumerated types that cannot take on undefined values.[1] This being C/C++, an `enum` is mostly superfluous window dressing around a machine word, which is the only data type Real Programmers care about. Anyway, my dissatisfaction with C/C++'s proud tradition of slovenly data typing aside, let us continue by tracing the provenance of "a magic token". $ git blame ec856178ff^ -- src/roff/troff/input.cpp | grep -C3 '"a magic token"' ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3033) default: ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3034) break; ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3035) } ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3036) return "a magic token"; ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3037) } ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3038) ^351da0dcd troff/input.c (James Clark 1991-06-02 04:20:34 -0500 3039) void skip_line() Well, that didn't take long. The assertion tripping is my doing, but it's also the sort of thing I _wanted_ to catch. Or thought I did. What does groff 1.23.0 do? $ echo "\X'pdf: xrev'"|~/groff-1.23.0/bin/groff -Tpdf -ms -Z x T pdf x res 72000 1 1 x init p1 V84000 H72000 x font 5 TR f5 s10000 V84000 H72000 md DFd x X pdf: xrev n12000 0 V768000 H540000 n12000 0 x trailer V792000 x stop The foregoing seems okay. There is therefore a mystery here and I will dig into it. Thanks for the report. > It only dumps if the -ms is included. It does not matter what text > appears in the \X command. Those two facts make this behavior _extra_ mysterious to me. There's no mechanism for redefining an escape sequence, so WTF? I must love a challenge. ...one brief GDB session later: ##(gdb) list token::description ... 3000 static char buf[bufsz]; 3001 (void) memset(buf, 0, bufsz); 3002 switch (type) { 3003 case TOKEN_EMPTY: 3004 return "an indeterminate token (at start of input?)"; 3005 case TOKEN_BACKSPACE: 3006 return "a backspace character"; 3007 case TOKEN_CHAR: 3008 if (INPUT_DELETE == c) 3009 return "a delete character"; ##(gdb) p type $1 = token::TOKEN_BEGIN_TRAP Hmmmmmm! That would explain why loading the (full-service) macro package provoked the problem; it set up (proper) traps (cf. the "implicit page trap"). Hypothesis: the input stream pointer is beyond where I thought it was. I beat the ever-living heck out of `\X` escape sequence handling, at the lexical level, for this release cycle, as recorded in the epic bug #63074. So it's highly plausible that I goofed here. Will investigate and advise. Regards, Branden Please find below my irregularly scheduled sarcastic jeremiad against brogrammers, past and present. (And, implicitly, their "velocity"-obsessed managers.) [1] Pascal somewhat notoriously had its compilers inject bounds checks upon, reputedly, _every_ assignment to a subrange type,[2] which folks like Kernighan seized upon as potentially wasteful. Kernighan's admirers eagerly latched upon his criticism, repeating it rotely and typically without ever bothering to perform any empirical measurement of the impact themselves. (If they did, somehow they never remember to cite any.) Ada had seen this tradeoff coming at least as far back as the late 1970s, and mandated that the compiler undertake static analysis and inject bounds checks only if it could not prove that the code wasn't admitting out-of-range values in the first place. The aforementioned Kernighan admirers responded by (a) proclaiming that it was too hard to write an Ada compiler (anything approaching formal methods being too difficult for Unix nerds); and (b) studiously ignoring Ada except whenever arose an opportunity to denigrate it as bloated DoD-ware for people who wore crew cuts. Nowadays, both GCC and LLVM have multiple, large sophisticated systems for doing semantic, control flow, data flow, and memory-safety analysis, and are even considered sexy for this reason (except LLVM is sexier because it's not copylefted). But Ada is still wrong and stupid and irrelevant for having these things too soon when they weren't cool. [2] In Pascal, enumerated and subrange types looked respectively as follows. type Day = (Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday); type Weekday = Monday .. Friday; As I understand, even terrible old Pascal never needed bounds checks on assignments to a variable of an enumerated type, because you could statically check for a valid assignment. Subrange types _were_ idiomatically used for array bounds. type MinesweeperField = array [1 .. 20, 1 .. 20] of Boolean; The foregoing could be handled statically too, because the valid array indices were from a static range, but the following was not handleable statically and I think it's what aggrieved Kernighan, who badly wanted variable-length strings, and Pascal's single biggest blunder was not having a good story for them. Wirth's examples even in the "Report" version of Pascal that formed the basis of ISO 7185 Standard Pascal, were pretty cringe, clearly stuck in the fixed-form tradition of punched card-based records (as was FORTRAN 77). Read(Inputfile, N); type UserName = array [1 .. N] of Char; (I'm not sure the foregoing is a conforming [piece of a] Standard Pascal program; in its official form, Pascal was even stricter than traditional and ANSI C about the lexical organization of blocks. You had to define your constants first, then types, then variables; then _declare_ any procedures and functions referenced within the block; and only then could you write statements.) ISO C struggles to this day with variable-length arrays and flexible array members, which should suggest to C partisans that they don't completely have their story straight in this department. But it doesn't. The advantage of Pascal's run-time bounds checks was that they prevented entire classes of undefined behavior. In the 1980s, C hackers circulated copies of Kernighan's "Why Pascal Is Not My Favorite Programming Language" like samizdat, and referred to it--without necessarily having read it--as an authoritative case _against performing run-time bounds checks at all in any context_. I don't think Kernighan would have approved of this reckless and gigantic generalization of his point, but I also don't think it would have mattered if he had objected vociferously. A Real Programmer cites authorities when they support whatever it is one wanted to do in the first place, and ignores them otherwise. Thus did an entire sector of the software industry, centered on Unix and C, gleefully introduce countless vectors for security exploits. Their code ran faster! I suppose NSA loves C and Unix because they can easily penetrate any system employing them. No wonder Bob Morris was hired straight out of Bell Labs in 1986 to become its chief scientist. He had seen what the burgeoning field was doing for intelligence and counter- intelligence work. Remember, now, Real Programmers aren't _reflexively_ opposed to the U.S. federal government: NSA good--DoD bad. NSA won't make you cut your hair or your beard. One just have to pass an FBI background check, sign away one's freedom of speech for the rest of one's life, and, equipped with classified knowledge that you encourage people to infer is immensely valuable, walk around acting superior to everyone. But one was already well-practiced at that, no? In case it need be said, when you apply run-time bounds checks intelligently, you are, as when Jules Winnfield handed over the contents of his wallet to the gun-toting "Ringo", _buying_ something with your money, and that's a degree of protection from undefined behavior and security vulnerabilities. Is the benefit worth the cost? To answer that, one sometimes needs to undertake empirical analysis. Ain't no Real Programmer got time for that.
signature.asc
Description: PGP signature
