At 2023-10-15T10:01:20-0700, Russ Allbery wrote:
> I think my position at this point as pod2man maintainer (not yet
> implemented in podlators) is that every occurrence of - in POD source
> will be translated into \-, rather than using the current heuristics,
> and people who meant to use ‐ should type it directly in the POD
> source.  pod2man now supports Unicode fairly well and will pass that
> along to *roff, which presumably will do the right thing with it after
> character set translation.

It will, as long as something (like preconv(1)) translates the UTF-8
into something GNU troff can understand.  One of the most painful
decisions James Clark made was to follow AT&T's example and use "char"
as the fundamental character type, instead of throwing his elbows with
an "int" (or better yet, an int-sized C++ type, since C++ had real type
checking in 1989, while K&R C was still in vogue and scoffed at such
gratuities).[1]  I took a stab at changing this about 3 years ago but
it was too big a bite.  I didn't know enough yet about how the formatter
worked.  If I have n months to set aside I suspect I can get it done on
a second attempt.

Anyway, to illustrate.  (UTF-8 follows.)

$ for n in $(seq 8); do printf 'abc\\[u2010]defgh '; done | nroff | cat -s
abc‐defgh  abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐defgh abc‐
defgh abc‐defgh


> Currently, pod2man uses an extensive set of heuristics, but I think
> this is a lost cause.  I cannot think of any heuristic that will
> understand that the - in apt-get should be U+002D (so that one can
> search for the command as it is typed), but the - in apt-like should
> be apt­like, since this is an English hyphenated expression talking
> about programs that are similar to apt.  This is simply not
> information that POD has available to it unless the user writing the
> document uses Unicode hyphens.

Yes.  This is the same point I was trying to make with my mg(1) man
page example.

> I believe the primary formatting degredation will be for very long
> hyphenated phrases like super-long-adjectival-phrase-intended-as-a-
> joke, because *roff will now not break on those hyphens that have been
> turned into \-.  People will have to rewrite them using proper Unicode
> hyphens to get proper formatting.

Even that can be overcome.  You can tell groff that a line can be broken
after a minus sign.  But I'm going to stone-facedly require people to
RTFM for that.  The character remapping in the PROBLEMS file is the
prescribed band-aid for those who can't or don't care to fix bad
typography in man pages, and I'd prefer not to see additional cargo cult
techniques piled on top of it.

https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n82

Regards,
Branden

[1] Just like the omission of bounds checks on array types.  What a
    brilliant efficiency that was.  Jean Ichbiah saw Dennis Ritchie
    coming a mile away in the 1970s, and Ada 83 did the right thing--in
    countless respects.  Compiler authors squealed like pigs in hot oil
    at the idea of doing any amount of static analysis of input--this is
    back when compilers would not _automatically_ pass anything in
    registers at all (_everything_ hit the stack) and common
    subexpression elimination was regarded as a state-of-the-art
    optimization--and spent over a decade slandering Ada's name in every
    forum available to them.  Nowadays, static analysis is cool and
    compiler engineers make big, big bucks developing its techniques
    professionally.  And I'll bet you those who have even heard of Ada
    still turn their noses up at it.

    Stick around, and I'll share the secret legacy of the hated IA-64...

Attachment: signature.asc
Description: PGP signature

Reply via email to