[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

G. Branden Robinson Thu, 03 Apr 2025 04:08:52 -0700

Follow-up Comment #29, bug #66919 (group groff):

[comment #27 comment #27:]
> [comment #26 comment #26:]
>> Here's another clash of terminology we have.  I assert that:
>> 
>> .hcode \[~o] õ
>> 
>> ...*is not a reflexive hyphenation code assignment*.
> 
> This, I believe, is the clash at the root of this ticket.
> 
>> It can't be, because the meaning of "õ" depends on the
>> character encoding in a way that "\[~o]" does not.
> 
> Clearly, "can't" overstates this, because it _was_ regarded as one in past
> groffs.


...unless you were on an EBCDIC system, or loaded "latin2.tmac".  I know the
latter group of people exists because they periodically write to the _groff_
list and other forums looking for better font coverage and support for Czech,
Polish, and so on. 

> Additionally, in my example input, at the time that .hcode is run, the
> character encoding is known,

Not to the formatter.  The character set localization files are simply piles
of `trin` and `hcode` requests.  Nothing tells the formatter "hey, you're
using Latin-9".

Yes, a user-defined register or string could be created to track this
information, but as a rule, the formatter, by which I mean the GNU _troff_
executable, does not alter its behavior based on the values of registers or
strings that the user (here defined to include startup files) that defines.
 
> so it _could_ still be interpreted as reflexive.

I think that's a big stretch in this scenario.  We seem to be concerned about
different minority groups.

1.  You're concerned about regressing (or altering) the hyphenation behavior
of Latin-1/English documents that are sensitive to the automatic hyphen
processes applied to words that use letters _not occurring in the English
language_.  I admit I'm pretty skeptical of this usage, and wonder why authors
of such documents with this much sensitivity haven't been using `hw` all
along.

2.  I'm concerned about unjustifiably extending assumptions that are valid
(only) in Latin-1 and English to other character encodings and other
languages.  We know these people exist because they've gone to the trouble of
submitting macro files preparing GNU _troff_ to correctly handle these
encodings and languages.

> These facts suggest two possible ways forward.
> * Use character-encoding knowledge, if available at the time .hcode is run,
> to interpret any 8-bit arguments; or

I don't think this information is available to the formatter, per the above.

> * Decide this is a back-compatibility-breaking change.  This should then
> require two other things, and suggest an optional third:
> ** A justification for what larger purpose is being served by the behavior
> change;

See item 2 above.

> ** Clear notice in NEWS that the behavior has changed;

I invite you to draft this notice, scoped carefully to the impact of the
change.

> ** Optionally, a warning when the user invokes an .hcode that previously did
> one thing and now does another.

Again, the formatter doesn't have this information.  All it sees is, say,
character code 245.  It doesn't know what that _means_.


.hcode \[~o] õ \" Latin-1
.hcode \[~o] 5 \" EBCDIC
.hcode \[~o] ő \" Latin-2


The "source" arguments are all encoded identically on disk and in memory.
 
> I get the resistance to warn about something that's not actually incorrect.
> (Though, arguably, that's exactly what a "warning" is.)  But in this
> instance, the effect of an .hcode behavior change will typically only be
> visible in a document at some remove from its cause.

Fair point.

> Users who might otherwise spend significant and frustrating time tracking
> down the source of a change could instead be given a helpful arrow pointing
> directly to it.

Yes.  I don't disagree with documenting the change in NEWS, but I'd like your
help in constructing it.  Collaborating on this language will also give me the
opportunity to correct lingering misunderstandings on both our parts.

> For hyphenation in particular, even the smallest changes (in wording, in
> paper or margin size, in any number of other things) can affect where words
> are broken throughout a document.  So it's just as plausible that after a
> groff upgrade, a user will see no change at all to their document, even
> though its interpretation of some .hcode calls does something different,
> simply because of where word breaks happen to fall.  The change in behavior
> may only occur during a later, possibly quite minor, revision of the
> document.  This is why, for this change in particular, it's especially useful
> for groff to emit notice that .hcode is doing something new.

This argument makes sense to me.

[comment #28 comment #28:]
> Also, this older remark has bearing on my last comment.
> 
> 
> [comment #19 comment #19:]
>> But generally in language development, if you break someone's
>> reliance on undefined behavior, you don't owe them an apology
>> and possibly not even notice.
> 
> True, and I argued exactly that in my last reply to Deri in bug #50770.  But
> we need to be scrupulous in how we define "undefined."
> 
> A formal language specification tells you exactly what the defined behavior
> is, and therefore, by implication, what is undefined.  The roff language has
> never had such a spec.
> 
> Even absent that, documentation might specifically call out certain
> constructions as having undefined behavior.
> 
> But undocumented does not necessarily imply undefined (unless the
> documentation so states); it could merely mean no one's written it down yet.
> And if that's the case, it's perfectly reasonable for users to determine the
> behavior experimentally.  (In many cases, you've done that in order _to_
> write something down.)  And when that experimentally determined behavior
> persists across multiple releases across multiple decades, it becomes harder
> to justify changing it with impunity just because no one's gotten around to
> writing down what it did across all those releases and all those decades.

I think the magnitude of "harder" is a pretty wild variable.

Consider this NEWS item for 1.24:


*  GNU troff no longer accepts a newline as a delimiter for the
   parameterized escape sequences `\A`, `\b`, `\o`, `\w`, `\X`, and
   `\Z`.


No deprecation period, no nothing.  ("No warning, no second chance!")

But nobody has yet seen fit to defend the practice being foreclosed by the
above change.

That's why I want your input on a NEWS item for this ticket.  As the voice of
caution, I can trust you to paint the change as dramatically as possible, and
if I catch you in overstatement, I can file that down.  What remains should
suffice to coach users that bother to read the "NEWS" file at all to consider
their use of the affected features.

I think the number of people using Latin-1 input and setting up hyphenation
codes for accented Latin letters not found in English words is small.  But
they could exist; as I noted, people might be writing fantasy novels in
English using the occasional funny letter.  I don't mind warning them to
double-check their documents against `hcode`'s changed behavior.

So: please propose a "NEWS" item for this change.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66919>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

Reply via email to