[bug #40720] [UPGRADE] improve Unicode support

Ingo Schwarze Thu, 14 Jul 2022 04:15:08 -0700

Follow-up Comment #4, bug #40720 (project groff):

[comment #3 comment #3:]
> Back in '04, Werner posted an overview of how to start tackling this:
http://lists.gnu.org/r/groff/2004-05/msg00074.html


That's not really an overview but merely a single, partial idea with no
context.
Essentially, Werner wrote nothing but: "Widen the internal input character
support to 32bit."
Besides, that idea is likely controversial.
While i do not deny that it is usually *possible* to change a codebase from
using char * strings internally to using wchar_t * strings internally, use
MB-to-wide string conversion functions on input and use wide-to-MB string
conversion functions on output, this kind of change is about as disruptive as
anything can be in a codebase: the result usually is that you need code
changes in *almost everything* because few code bases have much code that
neither inputs nor processes nor outputs strings.  Certainly not groff.  So
given that most files and most functions in the groff code base will likely
have to be changed, Werner's dismissive statement "It's not very complicated"
feels badly misleading to me.  Werner's cautionary parenthetic remark "(at
least at the beginning)" does not prevent his statement from being misleading:
changing the input character type to wchar_t gets maximally intrusive *right
away*, not at some later point.  You have to uproot the whole codebase
*before* you can even start doing anything productive with those massive
changes.

The idea Werner attributes to Bernd (which should already give us pause: Bernd
is not exactly known for good software design) is even worse: "Create a new
type for input character codes."  That is not even controversial, it is an
obviously terrible idea.  You do not create a new type when a type for exactly
that purpose already exists in the C standard.  I admit that many people do
just that for a variety of reason - sometimes because they are simply
unfamiliar with the C and POSIX standards, sometimes in misguided attempts at
portability, sometimes out of sheer NIH syndrome.  So *if* you want to change
the input character type to a wide character type (which i wouldn't want to),
you should at the very least use a C standard type.

This ticket is throwing around suggestions of massively intrusive changes
proposed almost two decades ago without bothering to say - even in the vaguest
terms - what the problem really is.  Let me claim this has already been mostly
solved during the last two decades, and the very old mail you are quoting is
simply completely outdated.  For example, just today, a user asked on our
mailing list how to use emoji characters in groff, and a working solution was
readily proposed to him.  Yes, i know the Unicode standard contains lots of
advanced features and some of them may be non-trivial to implement with our
current preconv(1)-based scheme.  But even if we linked groff to some massive
icu4c-style library, which would be an even heavier and more intrusive change
than Werner's decade-old and likely outdated suggestion, *complete* Unicode
support in groff would likely still be more work than rewriting groff
completely from scratch.  So throwing around suggestions for massive changes
serves little purpose until we have a conversation about exactly which
features we want, why the proposed massive changes are the most reasonable way
to implement these specific features, and who is willing to develop both the
fundamental rewrite of the code base and the target changes on top of the new
code base.

Finally, let me point out that how groff currently handles wide characters -
support wide characters both on the input and output side while keeping the
code simple by mostly using plain char[] strings internally - is actually one
good way for keeping wide character support simple in some circumstances.  For
details, see my presentation at the 2016 EuroBSDCon:
"Why and how you ought to keep multibyte character support simple"
https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf
https://www.openbsd.org/papers/eurobsdcon2016-utf8.roff
The parts most relevant in this context are pages 1-5 and 19-39.
I'm not saying groff needs the *specific* techniques presented here - groff
almost certainly is more complicated than the programs discussed in my talk -
but rather that the existing preconv(1) approach and its simplicity and
modularity has striking similarities to what is discussed here, and likely is
a good approach, whereas the full mbstowcs(3)->wchar_t->wcstombs(3) dance is
much harder than most people think and causes lots of little-known trouble in
practice.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?40720>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #40720] [UPGRADE] improve Unicode support

Reply via email to