Re: Need background on design of groff character classes

G. Branden Robinson Sat, 10 Jan 2026 13:52:45 -0800

Hi Dave,

At 2026-01-04T04:35:20-0600, Dave Kemper wrote:
> I think most of the points here are academic now, but I do want to
> respond to a few.

Good show.

> On Tue, Dec 2, 2025 at 12:45 AM G. Branden Robinson
> <[email protected]> wrote:
> > Not at all.  To set a new character flag on an arbitrary character
> > x, one just does it.  If one really means for `\[em]` to both end a
> > sentence (if followed by appropriate whitespace) and for the line to
> > be eligible for breaking after it, one says so.
> 
> Sloppy wording on my part; I should have said "set a new flag without
> clobbering existing ones."  OR-ing, mathematically.

Okay.  I don't see the necessity for this feature.

Also I'm not sure the semantics of it have been completely considered in
conjunction with the possibility of character or class redefinition or
removal.[1]

> > Also, if one wants to know what flags a character possesses, one can
> > ask with the new `pchar` request.  Not programmatically, no--there's
> > no mechanism for returning a character's assigned flags in a
> > register, say--but this is a more easily obtained insight than has
> > been available heretofore.
> 
> Granted, but not being able to do it programmatically still puts
> needless burden on the document author.  (This point I think is
> academic because we now seem to agree that character classes are
> designed to OR in new flags without affecting existing ones.)

Yes, but I'd be happy to remove that property.

> > Do people really not know some of the "character flag" properties
> > they desire for a special character, but not others?  I think most
> > groff users don't mess with character flags at all.
> 
> (I presume one of those "not"s was a typo.)

Yes, thanks.

> I think the roff language should be designed, as far as is practical,
> with power users in mind as much as ordinary ones.

I agree.

> "Most users don't do this" should not imply "therefore we can make it
> needlessly difficult for those who do."

I _do_ want to make it difficult, or even impossible, for power users to
twist the formatter into an undefined state.

> > All of these save two entail breaking or sentence termination
> > properties.  I assert that these are mutually entailing because
> > breaking implies (potential) hyphenation, and hyphenation is never
> > applicable after the end of a sentence.
> 
> Breaking does not necessarily imply hyphenation.

...which is why I said "potential".

> Consider the standard example I've been using, the em dash.  In
> American English, it often appears between words without surrounding
> space—like this—and can traditionally have a break point after it
> (character flag 4) where a hyphen is never added.

Yes.

> However, a character, especially a mark of punctuation, can have
> different requirements in different contexts.  Using em dashes in a
> text as above does not preclude also using them in sentence-ending
> contexts (character flag 1), for example—  Well, I can't think of an
> example right now.  But you get the idea.

(That instance was so knowingly self-negating it's almost a koan.
Props!)

We can call them Emily Dickinson dashes.[2]

> So flags 1 and 4 defy your assertion of mutual entailiness.

I reject your conclusion, because I _did_ say "potentially".
Automatic hyphenation can always be shut off entirely.  I'm acutely
aware of this fact due to the large number of man(7) (and mdoc(7))
document readers who insist that this be the case, to the point of
seeking to apply their preference globally to everyone else.

> My point has nothing to do with a new release of groff, which is the
> only time a NEWS file comes into play.  The situation can arise even
> for a user who installs groff 1.24 the day of release and will never
> upgrade.  It has to do with maintenance of that user's own documents,
> not of adapting to a potentially changing groff.  I'm not sure there's
> much point to clarifying what I meant here, though, since there is no
> longer a proposal to change groff 1.24 from its predecessors in this
> regard.

No, but it may come back around once I have time to tackle Savannah
#67703.[1]

> > $ git grep -w cflags contrib tmac
> > contrib/mom/ChangeLog:  o Added .cflags 4 /\(en -- was driving me nuts that 
> > lines wouldn't
> > contrib/mom/NEWS:Added .cflags 4 /\(em to om.tmac.  By default, mom now 
> > obligingly
> > contrib/mom/om.tmac:.cflags 4 /\[en]      \" So slash and en-dashes get 
> > broken
> 
> This does highlight a typo in mom's NEWS file ("em" where it should
> say "en").

That's one of the files I don't touch--my understanding is that it's
behind Peter's fence.

> > So our churn rate for cflags changes for a given macro package is
> > 1-3 per somewhere between 10 years and never.[...]
> 
> We should avoid the hubris of assuming macro packages distributed with
> groff are the only macro packages in the world.

We should also avoid being unrealistic about the breadth of groff's
deployment for any purpose but rendering man pages.  I'm not happy about
that disproportion, but I'll celebrate it if it gives us more latitude
to make improvements to the formatter's language such that it will
become a more appealing vehicle for typesetting in general.  A small
(applicable) userbase gives a developer flexibility that they're wise to
exercise, as Stuart Feldman of Bell Labs learned after he'd lost it.[3]

I'm curious to experimentally revert (in my working copy) Werner's
optimization commit on `cflags` and see if, on modern systems, any
measurable delay remains when loading the `ja` or `zh` localization
macro files.  At some point when the groff 1.25 development cycle opens,
I'm eager to give that try.  And also curious to see if the optimization
created any (presumably unintentional) behavior changes.

Your case against changing anything about `cflags`'s interaction with
`class` leans heavily on inferences about the intentions of people who
are, in 2 out of 3 cases, apparently deliberately NOT showing up to shed
light on the matter.  I was a little worried that something had happened
to Werner, but he's still active,[4] as is Colin Watson.[5]  The third
contributor of/to that feature, Daiki Ueno, hasn't been seen on this
list in over ten years.[6]

(I assume Werner and Colin haven't spoken up because they can't dredge
up any memories of decisions taken on these points a decade ago.  I
doubt I would do any better, which is one reason I try to explain myself
in commit messages.  As with code comments, statements of rationale are
for my own benefit as much as anyone else's.)

To summarize:

The rationale for the `class` request was to facilitate groff usage by
those composing documents in Chinese or Japanese.  To do so well means
applying character flags to the characters used by thse languages'
scripts.  Such characters are numerous.  Here's a sample.

tmac/zh.tmac:
    .\" Chinese glyphs.
    .class [CJKnormal] \
      \[u4E00]-\[u9FFF]

    .cflags 512 \C'[CJKnormal]'

That's--let's see...

$ echo 'ibase=16;(FFF+1)*(9-5)+2*(FF+1)' | bc
16896

(Derivation available on request.)

...almost 17,000 characters affected for Chinese.

Nobody wants a zh.tmac macro file that contains 17,000 requests.  Nobody
wants to pay for the disk space or more importantly, the maintainence
cost.  (Yes, you probably only programmatically construct this once and
then forget about it forever, or at least double-check the ranges only
at Unicode revisions, but it still makes the _file_ an eyeball-burning
blast that savages any who dare to look at it.)

We don't need any fancy OR-ing of flags to satisfy the motivating case.

All we need is to interpret \[u4E00]-\[u9FFF] as a range of groff
special characters, walk the items in that range, and update the
properties of each character.

Done.

Anything further is, I submit, a case of you, uniquely as far as the
record shows, arguing in defense of Hyrum's Law.[8]

Maybe `class` was overdesigned, and all we needed in the first place was
a "rangey" version of the `cflags` request.

Regards,
Branden

[1] https://savannah.gnu.org/bugs/?67703
[2] 
https://www.emilydickinsonmuseum.org/emily-dickinson/poetry/tips-for-reading/major-characteristics-of-dickinsons-poetry/

[3] "After getting myself snarled up with my first stab at Lex, I just
    did something simple with the pattern newline-tab. It worked, it
    stayed.  And then a few weeks later I had a user population of about
    a dozen, most of them friends, and I didn't want to screw up my
    embedded base.  The rest, sadly, is history." -- Stuart Feldman

[4] https://lists.gnu.org/archive/html/lilypond-devel/2026-01/msg00006.html
[5] https://www.chiark.greenend.org.uk/~cjwatson/blog/
[6] https://lists.gnu.org/archive/html/groff/2015-04/msg00011.html

[7] I admit, I was tempted to perform this calculation in dc(1) instead.
    But (a) I'm not au fait enough with it to casually toss off an RPN
    version of that algebra--I'd need to check the man page to see how
    to change the input base, which may be the 'i' command but maybe
    not--and (b) this discussion is probably already esoteric enough
    without such a gratuitous display.

[8] https://www.hyrumslaw.com/

signature.asc
Description: PGP signature

Re: Need background on design of groff character classes

Reply via email to