[bug #68129] [troff] change internal character representation to a wider data type

G. Branden Robinson Sat, 07 Mar 2026 15:13:50 -0800

Follow-up Comment #1, bug #68129 (group groff):

A comment I'm recording here so as to record it _somewhere_, since it occurred
to me while reviewing "input.cpp" yesterday.


We have to decide where the rubber meets the road in terms of conversion of
8-bit characters read from input streams into internal wide characters.

GNU _troff_ takes an OO approach to reading input, with "iterators" of "file"
and "string" varieties than each have "fill" member functions, each of which
return an `int` that is either an `unsigned char` or EOF, pretty close to a
classic pattern in C handling of standard I/O streams.

We might have to distinguish these "fill" readers.  Because file iterators
read from _files_, they'll be reading _bytes_, and so will have to resolve
incomplete or invalid UTF-8 sequences.

String iterators should never have to do that.  The "strings" in question are
sequences of _groff_ characters that have already undergone intake from a
standard I/O stream.

This won't necessarily change the interface of `::fill()`.  These can still
return `int`, since a `char32_t` should be interconvertible with `int`.  (I
need to double-check that.)  But we'll still need to handle EOF; the formatter
is designed to apply the end-of-file idiom even to reads from internal data
sources.  (GNU _troff_ uses EOF the way Python uses `StopIterator`.)

A better, or at least more stylish, solution would be an option type, but that
might be too modern a concept for the C++98 we're still chained to.  A more
C++98-ish approach would be to raise an handle an exception to indicate the
end of an iterator (which again is more Pythonish).

But the fastest way to get where we want to go (bug #40720) might be just to
leave the special-casing of EOF in place.  If that is the case we'll need to
be careful not to overload its meaning (integer value).  This question will
come up when we decide where in `char32_t` space to relocate GNU _troff_'s
special input character codes to.

See
https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h?h=1.24.0
.

I had considered using the Unicode Private Use Area (PUA) but also
contemplated using positive values outside the valid Unicode code point
range--it's a 20.1-bit code, and we have 31 bits, not counting the sign bit.
Negative character codes are another possibility, but that's where EOF lives
so maybe it's better to stay out of there.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?68129>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #68129] [troff] change internal character representation to a wider data type

Reply via email to