Hi Morten,

I'll need a little more time to get back to your previous reply, but
this one looks easier.  :)

At 2026-02-22T11:00:56+0100, Morten Bo Johansen wrote:
> On 2026-02-21 Morten Bo Johansen wrote:
> > Yes, that is a possibility. However, if I use word boundaries, it
> > would match only macros that are written without the leading dot and
> > then we're back to square one: In groff_mm I would get the matches I
> > want, in groff_ms not. Remember I call the manual page with the
> > search expression from a function, so it must work with a syntax I
> > have decided. Having to type in the expression afterwards in the
> > pager at least partially defeats the purpose of convenience.
> 
> I should add that pagers use different regexp engines, in fact less(1)
> alone uses different regexp libraries.

Yes.  My understanding is that the plethora of regex languages we face
in Unix-like environments arises originally from a combination of 2
factors, one theoretical and one practical.

1.  Regex engine implementation was still an area of active research in
    the 1970s.  What we now call "basic" regexes (BREs)could be
    implemented quickly and without excessive memory consumption.  They
    weren't as general as "extended" regular expressions (EREs) but they
    were still pretty powerful.

    https://en.wikipedia.org/wiki/Thompson%27s_construction

2.  Thompson put his aforementioned "construction" to work in the ed(1)
    text editor, a flagship and extremely heavily used application for
    as long as typewriters were in wide use as computer terminals.  When
    good and fast ERE implementations came along, it seems (a surmise of
    mine--I've never seen even a whisper of this on the record) he
    refused to update ed(1) to use them.  This in turn could have been
    for at least two reasons, both understandable up to a point.

    First, at one time the ed editor had many users and just as in the
    later vi and emacs, regex-based search-and-replace operations were a
    killer feature that had inculcated corresponding muscle memory that
    Thompson (and anyone tasked with "maintaining" ed) might have been
    loath to break.  Worse (but also a good feature), ed can be employed
    as a script processor for automated editing.  Nowadays sed(1) has
    largely displaced it in that role, but it's still there and still
    works for this purpose.  Without any features for "API level" or
    syntax dialect negotiation, changing the regex syntax would mean
    breaking compatibility for more than just human muscle memory.

    Second, since Thompson had gotten published in the literature with
    his innovation, one can understand some personal reluctance to evict
    it from its best known practical deployment subsequently.

Unfortunately the persistence of these two similar but distinct regex
syntaxes has added some drag to the momentum and learnability of Unix
systems.  My view is that we should be using EREs all the time
everywhere.  Let old tools like ed(1) and sed(1) grow command-line
options and/or environment variable recognition to support this, and let
them furthermore grow a command to assert an interface identity.  (Once
you've introduced a command to the editors' language for this, it can
accept and verify whatever arguments it needs to, aborting
interpretation if it sees anything it doesn't expect, and falling back
to "legacy" behavior if the new command, command-line option, and
environment variable are all absent.)

The most hazardous APIs are those we fail to recognize as such as soon
as they're born.  Regex dialects are an example.

All of that said, Perl probably would have done its thing--once you have
backtracking in your regexes, you have fallen from the RE pure faith
into pushdown automaton damnation, so why not go hog wild with things
like negative lookahead assertions?--even if Bell Labs had cut the
Gordian Knot and settled on one regex dialect prior to the release of
Seventh Edition Unix in 1979.  Thus in retrospect, the counterfactual
battle over that wouldn't have been worth fighting anyway.

> Yours, Branden, uses posix, mine uses pcre2 and some incarnations of
> less(1) are built with no regexp capabilities at all.  And then there
> is more(1) and most(1) ...
> 
> Apropos lack of consistency!

Yeah, I don't have a solution for this, but since pagers aren't
scriptable (I pray), and "less" at least permits reconfiguration of its
key bindings, this seems a more manageable issue than groff
documentation, which gets presented to all readers pretty much the same.

> Therefore having the leading dot in the macro descriptions in all the
> manual pages would do away with the need for regular expressions.

...but prompts the question of why registers aren't documented

\n[like-this]

and strings

\*[like-this].

groff_ms(7) actually does that and I'm not happy with it.

For an alternative approach, see Deri's recent contribution to the
gropdf(1) man page in the table in its "Parameters" subsection.  (You
need a recent release candidate for this--1.24.0.rc{3,4} I think.)

> Now, of course this should not be done just to cater to me and my
> little editor extension which I am probably the only person on the
> planet using, anyway, but just for the sake of consistency.

A man(7) improvement I'd like to land sometime after groff 1.24 is
"automatic tagging" of tagged paragraph labels.

What does that mean?  It means that every man(7) document that uses a
tagged paragraph...

.TH foo 1 2026-02-22 "groff test suite"
.\" ...
.TP
bazqux
Such as this.

...would automatically generate a hyperlink anchor or PDF bookmark with
a name computed from the man page's name, section, and the text of the
paragraph tag.  So for the foregoing, the anchor/bookmark ID would be
something like "foo/1/bazqux".  (The precise syntax for separating
components of the constructed ID needs careful consideration, which is
one reason I haven't done this work yet.  Man page identifiers and
section "numbers" are _usually_ pretty well behaved,[1] but paragraph
tags could be just about anything, including arbitrary punctuation.)

The upshot of this improvement would be that, with unique anchors for
every tagged paragraph in a (collection of) man page document(s)
available, one can unambiguously navigate to them.  Instead of searching
by a regex match on text--which, while it's not applicable to your
scenario, can be defeated by a frustratingly placed line break--you
could search based on "tag IDs".  The tool/pager/browser you're using
could even interactively present you with a menu of multiple matches if
your search query produced multiple results.

But for the simplest (and, I predict, most common) case of searching a
single man page for a definition of one macro, C function, command-line
option, environment variable, etc., there would almost always be only
one match, and a tag-based search would take you straight to it.

To recall my earlier example, instead of doing a regex search in less(1)
of groff_mm(7)'s text for "\<LB\>" and navigating to the second match, I
could do a tag search for "LB" and I'd be taken to the correct place on
the first attempt.  This would prove to be a superior solution to
sticking "." in front of "LB" in the event we ever had examples of "LB"
macro usage in that same man page.  That problem is an active hazard in
groff_man_style(7) today.

Regards,
Branden

[1] I seem to remember a conversation involving Colin Watson (man-db
    maintainer) and Ingo Schwarze (mandoc(1) maintainer) that discussed
    deprecation of a "DWIM" form of the man(1) command wherein, if a
    slash occurred in an argument (and no "-l" option was specified),
    then the program worked as if "-l" were present anyway, and bypassed
    a man page database lookup in favor of treating the argument as a
    file specification ("path name").  I _think_ Ingo found an example
    of some obscure man page using a slash in its identifier.  It was,
    structurally, something like this:

    .TH foo/bar 1 2026-02-22 "groff test suite"

    ...or, equivalently:

    .Dd 2026-02-22
    .Dt foo/bar 1
    .Os "groff test suite"

    ...which broke the "DWIM" feature, and which is now slated for
    removal if not already killed off.

    This sort of unwelcome surprise is why automatic construction of
    hyperlink tag IDs for (typographically) tagged paragraphs demands
    careful consideration.  I dislike URL-style translation of
    "forbidden" characters into hexadecimal "escapes" like "%7E", but a
    conversion along those lines might prove...inescapable, as it were.
    There's no Unicode Basic Latin character that can't validly appear
    in a paragraph tag.

Attachment: signature.asc
Description: PGP signature

Reply via email to