Follow-up Comment #1, bug #68129 (group groff): A comment I'm recording here so as to record it _somewhere_, since it occurred to me while reviewing "input.cpp" yesterday.
We have to decide where the rubber meets the road in terms of conversion of 8-bit characters read from input streams into internal wide characters. GNU _troff_ takes an OO approach to reading input, with "iterators" of "file" and "string" varieties than each have "fill" member functions, each of which return an `int` that is either an `unsigned char` or EOF, pretty close to a classic pattern in C handling of standard I/O streams. We might have to distinguish these "fill" readers. Because file iterators read from _files_, they'll be reading _bytes_, and so will have to resolve incomplete or invalid UTF-8 sequences. String iterators should never have to do that. The "strings" in question are sequences of _groff_ characters that have already undergone intake from a standard I/O stream. This won't necessarily change the interface of `::fill()`. These can still return `int`, since a `char32_t` should be interconvertible with `int`. (I need to double-check that.) But we'll still need to handle EOF; the formatter is designed to apply the end-of-file idiom even to reads from internal data sources. (GNU _troff_ uses EOF the way Python uses `StopIterator`.) A better, or at least more stylish, solution would be an option type, but that might be too modern a concept for the C++98 we're still chained to. A more C++98-ish approach would be to raise an handle an exception to indicate the end of an iterator (which again is more Pythonish). But the fastest way to get where we want to go (bug #40720) might be just to leave the special-casing of EOF in place. If that is the case we'll need to be careful not to overload its meaning (integer value). This question will come up when we decide where in `char32_t` space to relocate GNU _troff_'s special input character codes to. See https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h?h=1.24.0 . I had considered using the Unicode Private Use Area (PUA) but also contemplated using positive values outside the valid Unicode code point range--it's a 20.1-bit code, and we have 31 bits, not counting the sign bit. Negative character codes are another possibility, but that's where EOF lives so maybe it's better to stay out of there. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?68129> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature
