Bug#1041731: Hyphens in man pages

2023-10-16 Thread Colin Watson
My plan, as indicated in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041731#62, had been
to leave things much as they are for most of the period while trixie is
in development, and then put the ".char - \-" etc. workarounds back in
place for nroff output for trixie's release; this would have meant a
higher chance of more manual page authoring tools being updated to
handle the groff input language more strictly (although this isn't
always easy, as Russ has indicated, since sometimes the input languages
of those tools are less rich than groff).

However, after wading through an enormous amount of inordinately verbose
stuff in my inbox about this, I'm afraid I've now lost patience with the
whole thing and am definitely not willing to put up with it for another
year or more, so I'm putting the workaround back in place now.  Sorry to
anyone who will end up dissatisfied by non-terminal printed output as a
result.

  https://salsa.debian.org/debian/groff/-/commit/d5394c68d7

It is still true that being strict about the use of the "\-", "\[aq]",
"\[ga]", "\[ha]", and "\[ti]" escape sequences (as opposed to "-", "'",
"`", "^", and "~" respectively) will produce better printed output.

-- 
Colin Watson (he/him)  [cjwat...@debian.org]



Bug#1041731: Hyphens in man pages

2023-10-15 Thread Trent W. Buck
On Sun 15 Oct 2023 17:33:07 +0200, Iustin Pop wrote:
> At least you're not lazy. I am, so what I did many times is add a
> build-depends on pandoc, and write the man page in rst or md. I think
> that's a worse solution (pandoc is really heavy), but at least, I don't
> have to go back to *roff.

FWIW, there are lighter alternatives than pandoc:

pandoc:After this operation, 174 MB of 
additional disk space will be used.
sphinx-doc (sphinx-build -b man):  After this operation, 140 MB of 
additional disk space will be used.
rst2man (python3-docutils):After this operation, 37.6 MB of 
additional disk space will be used.
pod2man (perl):perl is already the newest version 
(5.36.0-9).

I'm not going to bother measuring docbook ;-)

If you are writing manpages by hand, this is an excellent overview:

https://manpages.debian.org/bookworm/manpages/man.7.en.html

See also:

https://www.oreilly.com/library/view/mastering-perl/9780596527242/ch15.html 
(POD)
https://www.docutils.org/docs/user/manpage.html#todo-open-issues


signature.asc
Description: PGP signature


Bug#1041731: Hyphens in man pages

2023-10-15 Thread Russ Allbery
"G. Branden Robinson"  writes:

> How about this?

>  \- Minus sign.  \- produces the basic Latin hyphen‐minus
> specifying Unix command‐line options and frequently used in
> file names.  “-” is a hyphen in roff; some output devices
> replace it with U+2010 (hyphen) or similar.

Sorry for my original message, which was very poorly worded and probably
incredibly confusing.  Let me try to make less of a hash of it.  I think
what I'm proposing is something like:

\-   Basic Latin hyphen­minus (U+002D) or ASCII hyphen.  This is the
 character used for Unix command­line options and frequently in file
 names.  It is non-breaking; roff will not wrap lines at this
 character.  "-" (without the "\") is a true hyphen in roff, which is
 a different character; some output devices replace it with U+2010
 (hyphen) or similar.

What I was trying to get at but didn't express very well was to include
the specific Unicode code point and to avoid the term "minus sign" because
this character is not a minus sign in typography at all (although it is
used that way in code).  A minus sign is U+2212 and looks substantially
different because it is designed to match the appearance of the plus sign.
(For example, the line is often at a different height.)  I don't know if
*roff has a way of producing that character apart from providing it as
Unicode.

The above also explicitly says that it's non-breaking (I believe that's
the case, although please tell me if I got that wrong) and is more
(perhaps excessively) explicit about distinguishing it from "-" because of
all the confusion about this.

-- 
Russ Allbery (r...@debian.org)  



Bug#1041731: Hyphens in man pages

2023-10-15 Thread G. Branden Robinson
Hi Russ,

At 2023-10-15T12:06:14-0700, Russ Allbery wrote:
> Minor point, but since you posted it

No worries!

> "G. Branden Robinson"  writes:
> 
> > ...
> 
> >  \- Minus sign or basic Latin hyphen‐minus.  \- produces the
> > Unix command‐line option dash in the output.  “-” is a
> > hyphen in the roff language; some output devices replace it
> > with U+2010 (hyphen) or similar.
> 
> The official name of "the Unix command-line option dash" is the
> hyphen-minus character (U+002D).  Given how much confusion there is
> about this, and particularly given how ambiguous the word "dash" is in
> typography (the hyphen-minus is one of 25 dashes in Unicode), you may
> want to say that explicitly in addition to saying that it's the
> character used in UNIX command-line options (and, arguably as
> importantly, in UNIX command names).

How about this?

 \- Minus sign.  \- produces the basic Latin hyphen‐minus
specifying Unix command‐line options and frequently used in
file names.  “-” is a hyphen in roff; some output devices
replace it with U+2010 (hyphen) or similar.

Regards,
Branden


signature.asc
Description: PGP signature


Bug#1041731: Hyphens in man pages

2023-10-15 Thread G. Branden Robinson
At 2023-10-15T10:01:20-0700, Russ Allbery wrote:
> I think my position at this point as pod2man maintainer (not yet
> implemented in podlators) is that every occurrence of - in POD source
> will be translated into \-, rather than using the current heuristics,
> and people who meant to use ‐ should type it directly in the POD
> source.  pod2man now supports Unicode fairly well and will pass that
> along to *roff, which presumably will do the right thing with it after
> character set translation.

It will, as long as something (like preconv(1)) translates the UTF-8
into something GNU troff can understand.  One of the most painful
decisions James Clark made was to follow AT's example and use "char"
as the fundamental character type, instead of throwing his elbows with
an "int" (or better yet, an int-sized C++ type, since C++ had real type
checking in 1989, while K C was still in vogue and scoffed at such
gratuities).[1]  I took a stab at changing this about 3 years ago but
it was too big a bite.  I didn't know enough yet about how the formatter
worked.  If I have n months to set aside I suspect I can get it done on
a second attempt.

Anyway, to illustrate.  (UTF-8 follows.)

$ for n in $(seq 8); do printf 'abc\\[u2010]defgh '; done | nroff | cat -s
abc‐defgh  abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐
defgh abc‐defgh


> Currently, pod2man uses an extensive set of heuristics, but I think
> this is a lost cause.  I cannot think of any heuristic that will
> understand that the - in apt-get should be U+002D (so that one can
> search for the command as it is typed), but the - in apt-like should
> be apt­like, since this is an English hyphenated expression talking
> about programs that are similar to apt.  This is simply not
> information that POD has available to it unless the user writing the
> document uses Unicode hyphens.

Yes.  This is the same point I was trying to make with my mg(1) man
page example.

> I believe the primary formatting degredation will be for very long
> hyphenated phrases like super-long-adjectival-phrase-intended-as-a-
> joke, because *roff will now not break on those hyphens that have been
> turned into \-.  People will have to rewrite them using proper Unicode
> hyphens to get proper formatting.

Even that can be overcome.  You can tell groff that a line can be broken
after a minus sign.  But I'm going to stone-facedly require people to
RTFM for that.  The character remapping in the PROBLEMS file is the
prescribed band-aid for those who can't or don't care to fix bad
typography in man pages, and I'd prefer not to see additional cargo cult
techniques piled on top of it.

https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n82

Regards,
Branden

[1] Just like the omission of bounds checks on array types.  What a
brilliant efficiency that was.  Jean Ichbiah saw Dennis Ritchie
coming a mile away in the 1970s, and Ada 83 did the right thing--in
countless respects.  Compiler authors squealed like pigs in hot oil
at the idea of doing any amount of static analysis of input--this is
back when compilers would not _automatically_ pass anything in
registers at all (_everything_ hit the stack) and common
subexpression elimination was regarded as a state-of-the-art
optimization--and spent over a decade slandering Ada's name in every
forum available to them.  Nowadays, static analysis is cool and
compiler engineers make big, big bucks developing its techniques
professionally.  And I'll bet you those who have even heard of Ada
still turn their noses up at it.

Stick around, and I'll share the secret legacy of the hated IA-64...


signature.asc
Description: PGP signature


Bug#1041731: Hyphens in man pages

2023-10-15 Thread Russ Allbery
Minor point, but since you posted it

"G. Branden Robinson"  writes:

> ...

>  \- Minus sign or basic Latin hyphen‐minus.  \- produces the
> Unix command‐line option dash in the output.  “-” is a
> hyphen in the roff language; some output devices replace it
> with U+2010 (hyphen) or similar.

The official name of "the Unix command-line option dash" is the
hyphen-minus character (U+002D).  Given how much confusion there is about
this, and particularly given how ambiguous the word "dash" is in
typography (the hyphen-minus is one of 25 dashes in Unicode), you may want
to say that explicitly in addition to saying that it's the character used
in UNIX command-line options (and, arguably as importantly, in UNIX
command names).

-- 
Russ Allbery (r...@debian.org)  



Bug#1041731: Hyphens in man pages

2023-10-15 Thread G. Branden Robinson
Hi Wookey,

At 2023-10-15T16:08:32+0100, Wookey wrote:
> OK. So I read all that, and learned a whole load of stuff I was quite
> happy not knowing about.
> 
> However despite reading it all, and especially this bit:
> > Whenever I've maintained man pages in roff I tend to be precise in
> > the usage of - and \-, but TBH this has seemed like a lost battle,
> 
> I was left not actually know what - and \- represent, nor which one I
> _should_ be using in my man pages. And that seems to be the one thing
> we should be telling the 'average maintainer'.
> 
> I think you can consider me representative of the typical maintainer
> who's intereaction with *roff languages almost entirely takes the
> form: 'Oh bloody hell I really ought to write a man page for this
> because upstream is too youthful to have done so - now how the hell
> does roff/nroff/groff work again' (no I'm not sure which it is I'm
> actually using, nor how any of this machinery really works, nor where
> to look for good practice, so I mostly copy existing stuff and DDG for
> answers, which is less than ideal when it comes to details like this).
> 
> So this message is mostly a reminder that most people have not been
> following along at all, so just referring people to bugs like this,
> which discuss the issue in some detail, is not sufficient for
> maintainers to stop doign unhelpful things.
> 
> (Yes I realise I could look it up, but I get the impression that
> everyone involved in this discusssion assumes people know what '-' and
> '\-' are so if they are just told to 'use the right one' will do so,
> and I thought it worth pointing out that that's not correct). Info for
> your average maintainer needs to go one step back and say "use stringA
> in this circumstance and stringB in this circumstance.  what they represent>. The reason why it matters is: stuff about hyphen
> and minus being different and minus being used in commands and
> cut+pasting being important"

Yes, I appreciate your popping of the context stack.

Andreas and Russ provided good, quick answers.  One can reasonably
wonder where to find the same answer in groff's documentation.

Subsection "Fundamental character set" of the groff_char(7) man page
covers the matter, but like the bug report we've Cced, it goes into
great detail.

Subsection "Portability" of groff_man_style(7) (or groff_man(7) in groff
1.22.4) covers the subject in a more practical, how-to manner.

[UTF-8 follows.]

groff_man_style(7):
 ... Some escape sequences
 are however required for correct typesetting even in man pages and
 usually do not cause portability problems.

 Several of these render glyphs corresponding to punctuation code
 points in the Unicode basic Latin range (U+–U+007F) that are
 handled specially in roff input; the escape sequences below must be
 used to render them correctly and portably when documenting
 material that uses them syntactically—namely, any of the set ' - \
 ^ ` ~ (apostrophe, dash or minus, backslash, caret, grave accent,
 tilde).

...

 \- Minus sign or basic Latin hyphen‐minus.  \- produces the
Unix command‐line option dash in the output.  “-” is a
hyphen in the roff language; some output devices replace it
with U+2010 (hyphen) or similar.

 \(aq   Basic Latin neutral apostrophe.  Some output devices format
“'” as a right single quotation mark.

...

 \(ga   Basic Latin grave accent.  Some output devices format “`” as
a left single quotation mark.

 \(ha   Basic Latin circumflex accent (“hat”).  Some output devices
format “^” as U+02C6 (modifier letter circumflex accent) or
similar.

 \(rs   Reverse solidus (backslash).  The backslash is the default
escape character in the roff language, so it does not
represent itself in output.  Also see \e above.

 \(ti   Basic Latin tilde.  Some output devices format “~” as U+02DC
(small tilde) or similar.

> Hope that's helpful.

I hope this message goes some way toward relieving your frustration.

Regards,
Branden


signature.asc
Description: PGP signature


Bug#1041731: Hyphens in man pages

2023-10-15 Thread Russ Allbery
Wookey  writes:

> I was left not actually know what - and \- represent, nor which one I
> _should_ be using in my man pages. And that seems to be the one thing we
> should be telling the 'average maintainer'.

- turns into a real hyphen (­, U+2010).  \- turns into the ASCII
hyphen-minus that we use for options, programming, and so forth (U+002D).

I think my position at this point as pod2man maintainer (not yet
implemented in podlators) is that every occurrence of - in POD source will
be translated into \-, rather than using the current heuristics, and
people who meant to use ‐ should type it directly in the POD source.
pod2man now supports Unicode fairly well and will pass that along to
*roff, which presumably will do the right thing with it after character
set translation.

Currently, pod2man uses an extensive set of heuristics, but I think this
is a lost cause.  I cannot think of any heuristic that will understand
that the - in apt-get should be U+002D (so that one can search for the
command as it is typed), but the - in apt-like should be apt­like, since
this is an English hyphenated expression talking about programs that are
similar to apt.  This is simply not information that POD has available to
it unless the user writing the document uses Unicode hyphens.

I believe the primary formatting degredation will be for very long
hyphenated phrases like super-long-adjectival-phrase-intended-as-a-joke,
because *roff will now not break on those hyphens that have been turned
into \-.  People will have to rewrite them using proper Unicode hyphens to
get proper formatting.

-- 
Russ Allbery (r...@debian.org)  



Bug#1041731: Hyphens in man pages

2023-10-15 Thread Andreas Metzler
On 2023-10-15 Wookey  wrote:
[...]
> OK. So I read all that, and learned a whole load of stuff I was quite
> happy not knowing about.

> However despite reading it all, and especially this bit:
> "Whenever I've maintained man pages in roff I tend to be precise in
> > the usage of - and \-, but TBH this has seemed like a lost battle,"

> I was left not actually know what - and \- represent, nor which one I
> _should_ be using in my man pages. And that seems to be the one thing
> we should be telling the 'average maintainer'.
[...]

Hello,

a pretty good guidance[1] is to

use "\-" whenever it refers to option ("-h", --help"), argument ("find
-mmin -2") or something else that is not natural language but something
that might be pasted, like a command-name ("ssh-add" or "dpkg-source")

and "-" everywhere else.

cu Andreas

[1] Well it is "guidance": pasting will work, but there might still be
places in the prose where a dash would be typographically correct. Some
of these typographical conventions are langauage specific. All this
familiar to LaTeX users.
-- 
`What a good friend you are to him, Dr. Maturin. His other friends are
so grateful to you.'
`I sew his ears on from time to time, sure'



Bug#1041731: Hyphens in man pages

2023-10-15 Thread Iustin Pop
On 2023-10-15 16:08:32, Wookey wrote:
> I think you can consider me representative of the typical maintainer
> who's intereaction with *roff languages almost entirely takes the
> form: 'Oh bloody hell I really ought to write a man page for this
> because upstream is too youthful to have done so - now how the hell
> does roff/nroff/groff work again' (no I'm not sure which it is I'm
> actually using, nor how any of this machinery really works, nor where
> to look for good practice, so I mostly copy existing stuff and DDG for
> answers, which is less than ideal when it comes to details like this).

At least you're not lazy. I am, so what I did many times is add a
build-depends on pandoc, and write the man page in rst or md. I think
that's a worse solution (pandoc is really heavy), but at least, I don't
have to go back to *roff.

regards,
iustin



Bug#1041731: Hyphens in man pages

2023-10-15 Thread Wookey
On 2023-10-15 01:30 -0500, G. Branden Robinson wrote:
> At 2023-10-14T20:51:27-0600, Antonio Russo wrote:
> 
> Quick background: in the context of Unix usage as documented by
> nroff/troff, the dash used at the shell prompt, in text editors, and in
> programming language source code is a "minus sign".  troff has an em
> dash special character as well since the mid-1970s; groff adds an en
> dash as well, and furthermore supports user definition of characters
> providing access to any other sort of dash that comes down the Unicode
> pike.  (Not that doing so is a good idea in a man page; see below
> regarding a "restricted dialect" of man(7).)
> 
> > Now, depending on your email client and settings, the above will
> > appear to be the ravings of an unhinged lunatic who wrote the same
> > thing twice, or an unhinged lunatic who slammed their fists onto the
> > keyboard.
> 
> This issue does indeed have a history of provoking unhinged lunacy.
> 
> Before we proceed, you might wish to be aware of
>  and its
> proposed remedy.

OK. So I read all that, and learned a whole load of stuff I was quite
happy not knowing about.

However despite reading it all, and especially this bit:
"Whenever I've maintained man pages in roff I tend to be precise in
> the usage of - and \-, but TBH this has seemed like a lost battle,"

I was left not actually know what - and \- represent, nor which one I
_should_ be using in my man pages. And that seems to be the one thing
we should be telling the 'average maintainer'.

I think you can consider me representative of the typical maintainer
who's intereaction with *roff languages almost entirely takes the
form: 'Oh bloody hell I really ought to write a man page for this
because upstream is too youthful to have done so - now how the hell
does roff/nroff/groff work again' (no I'm not sure which it is I'm
actually using, nor how any of this machinery really works, nor where
to look for good practice, so I mostly copy existing stuff and DDG for
answers, which is less than ideal when it comes to details like this).

So this message is mostly a reminder that most people have not been
following along at all, so just referring people to bugs like this,
which discuss the issue in some detail, is not sufficient for
maintainers to stop doign unhelpful things.

(Yes I realise I could look it up, but I get the impression that
everyone involved in this discusssion assumes people know what '-' and
'\-' are so if they are just told to 'use the right one' will do so,
and I thought it worth pointing out that that's not correct). Info for
your average maintainer needs to go one step back and say "use stringA
in this circumstance and stringB in this circumstance. . The reason why it matters is: stuff about hyphen
and minus being different and minus being used in commands and
cut+pasting being important"

Hope that's helpful.

Wookey
-- 
Principal hats:  Debian, Wookware, ARM
http://wookware.org/


signature.asc
Description: PGP signature


Bug#1041731: Hyphens in man pages

2023-10-15 Thread G. Branden Robinson
At 2023-10-14T20:51:27-0600, Antonio Russo wrote:
> I discovered a new pet peeve today: if you search for a command in a
> manual page, say -e in man 1 zgrep, it's a crapshot whether just
> searching for '-e' will find the command or not.  The reason is that
> "-" may been accidentally encoded as ‐ instead of -.

You can blame me for this.

https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1.23.0#n206

...me, and man page authors who don't think about whether they intend
a hyphen or a minus sign when they strike the '-' key...

Quick background: in the context of Unix usage as documented by
nroff/troff, the dash used at the shell prompt, in text editors, and in
programming language source code is a "minus sign".  troff has an em
dash special character as well since the mid-1970s; groff adds an en
dash as well, and furthermore supports user definition of characters
providing access to any other sort of dash that comes down the Unicode
pike.  (Not that doing so is a good idea in a man page; see below
regarding a "restricted dialect" of man(7).)

> Now, depending on your email client and settings, the above will
> appear to be the ravings of an unhinged lunatic who wrote the same
> thing twice, or an unhinged lunatic who slammed their fists onto the
> keyboard.

This issue does indeed have a history of provoking unhinged lunacy.

Before we proceed, you might wish to be aware of
 and its
proposed remedy.

> The reason is that man(1) convert bare dashes (0x2D) to hyphens
> (U+2010).  These are not the same symbol: searching for one does not
> find the other without some kind of normalization, pasting commands
> with one vs. the other does different things.  New users who do not
> understand this will be discouraged trying to read manual pages.
> Chances are, they will fill forums with mundane questions that could
> and should have been addressed by a simple search of a manual page.

I run into this problem, too, since I dogfood my own changes.  When
irritated by this, I try the search again, replacing '-' with '.', which
has yet to fail me (and produces false positives surprisingly rarely).

For example, I've recently been playing with the mg(1) editor, and
observed extremely poor discipline in this area.  So I forked it on
GitHub and have been preparing a bunch of revisions.  I wrote a sed
script to fix its numerous hyphen/dash problems.[1]

> I recently fixed a ton of these in another upstream package with this
> vim "one-liner":
> 
> :%s/--\([a-z]\+\)\(-[a-z]\+\)*/\=substitute(submatch(0), '-', '\\-', 'g')/g

My Vimscript is not very sophisticated, but it looks like you're
replacing only hyphens that appear in long option names here.  That's
good, as you're unlikely to clobber any hyphens that should _not_ become
minus signs.

Such discernment is important.  Many people who want to "solve" this
issue forget (or ignore) that not every '-' is a minus sign.  Some are
actual hyphens, as in "long-term effects" and "word-aligned struct
members".  Trying to infer a distinction from white space adjacency also
won't work.  Consider the phrases "word- or byte-sized caching" and
"object-based vs. -oriented programming".  While sophistication with
compound hyphenated affixes is seldom seen in man pages, we most often
find it where a man page author has taken considerable care with their
technical writing.  Such pages are less likely than most to require
revision with blunt instruments like regular expression-based global
search and replace operations.

> However, this requires manual review

Surprisingly often, the composition of high-quality technical
documentation requires the engagement of a human brain.

> and does not fix the '-e' example from zgrep.

Mapping all hyphens and minus signs to a single character, as people
whose blood pressure spikes over this issue tend to promote as a first
resort, is an ineluctably information-discarding operation.  In my
opinion, man page source documents are not the correct place to discard
that information.

(I acknowledge that you didn't propose such a crude remedy; I write to
anticipate the inevitable follow-ups from people who will.)

Doing so at rendering time is much more defensible, and happens anyway
on devices that do not distinguish these characters in the first place.

> There are also a whole host of this kind of problem, e.g., dashes in
> URLs that get naievely pasted into man pages (another live example I
> just addressed).

Yes, people commonly type URLs and email addresses into man page sources
as they would into an MUA or browser navigation bar.  Since U+2010 is
difficult to encode in such things, the man(7) package could help by
performing an automatic character translation in this area.  However,
(1) no one's actually asked for this and (2) it would address only a
tiny part of the problem.  The means of "help" I have in mind is
employment of the groff man(7) extension macros `UR`/`UE` and