[bug #67734] [troff] ban C0 controls and Latin-1 supplement characters from use in identifiers

G. Branden Robinson Fri, 28 Nov 2025 18:09:15 -0800

Follow-up Comment #3, bug #67734 (group groff):

Hi Deri,

At 2025-11-26T12:05:51-0500, Deri James wrote:
> Follow-up Comment #1, bug #67734 (group groff):
>
> It would be nice (for non-english users) to allow \[uXXXX]
> "characters" to be used in identifiers, since currently they are
> rejected.

I'm not crazy about this idea, but I think I can propose something
you'll like better.  See below.

> Also, on systems which use a non UTF locale (such as ISO-8859-11) all
> the Thai characters are A0-FB, so could no longer be used in
> identifiers.

Unfortunately I have no evidence that any Thai language writers use
groff.  I wish I did; I'd solicit them for localization files for the
ISO Latin/Thai encoding (ISO 8859-11) and the Thai language.

(Ironically enough, James Clark likely is such a person.  I gather he
relocated permanently to Thailand decades ago.)

Even with support for its ISO Latin-2 character encoding supported, I
couldn't get any of the Hungarians on the groff mailing list to
contribute a "hu.tmac" file.

So I feel it's worth rolling the dice here; any users of ISO 8859-11
would have to migrate anyway (or use preconv, which they may already be
doing anyway) when we slay the dragon named bug #40720.

> Of course if the UPGRADE in bug #40720 takes into account locale -
> changing the characters to their Unicode Code Point internally for
> TOKEN_CHARs, but until then!!

Here's the idea I think you'll like better.

Once we've widened GNU _troff_'s internal character data type to 32
bits, there won't be any reason to represent inbound UTF-8 sequences
with special character tokens (`TOKEN_SPECIAL_CHAR`).  We can use
ordinary character tokens for them (`TOKEN_CHAR`).

At that point, we should get support for identifiers containing Thai
language characters, Greek, Russian, and whatnot pretty much for free.
That includes being able to use them in automatically generated string
identifiers for PDF bookmarks.

I expect to still ban any C0 controls that aren't already unsupported,
and the delete character (ASCII 127 decimal).  In identifiers, that
is--I've learned the hard way that sometimes ASCII 127 is handy as a
_delimiter_ in "break glass in emergency" situations.  :)

But I expect that "😀😃😄😁😆😅" will be a valid identifier,
stored
internally as a list of six "ordinary characters".

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?67734>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #67734] [troff] ban C0 controls and Latin-1 supplement characters from use in identifiers

Reply via email to