Follow-up Comment #4, bug #40720 (project groff): [comment #3 comment #3:] > Back in '04, Werner posted an overview of how to start tackling this: http://lists.gnu.org/r/groff/2004-05/msg00074.html
That's not really an overview but merely a single, partial idea with no context. Essentially, Werner wrote nothing but: "Widen the internal input character support to 32bit." Besides, that idea is likely controversial. While i do not deny that it is usually *possible* to change a codebase from using char * strings internally to using wchar_t * strings internally, use MB-to-wide string conversion functions on input and use wide-to-MB string conversion functions on output, this kind of change is about as disruptive as anything can be in a codebase: the result usually is that you need code changes in *almost everything* because few code bases have much code that neither inputs nor processes nor outputs strings. Certainly not groff. So given that most files and most functions in the groff code base will likely have to be changed, Werner's dismissive statement "It's not very complicated" feels badly misleading to me. Werner's cautionary parenthetic remark "(at least at the beginning)" does not prevent his statement from being misleading: changing the input character type to wchar_t gets maximally intrusive *right away*, not at some later point. You have to uproot the whole codebase *before* you can even start doing anything productive with those massive changes. The idea Werner attributes to Bernd (which should already give us pause: Bernd is not exactly known for good software design) is even worse: "Create a new type for input character codes." That is not even controversial, it is an obviously terrible idea. You do not create a new type when a type for exactly that purpose already exists in the C standard. I admit that many people do just that for a variety of reason - sometimes because they are simply unfamiliar with the C and POSIX standards, sometimes in misguided attempts at portability, sometimes out of sheer NIH syndrome. So *if* you want to change the input character type to a wide character type (which i wouldn't want to), you should at the very least use a C standard type. This ticket is throwing around suggestions of massively intrusive changes proposed almost two decades ago without bothering to say - even in the vaguest terms - what the problem really is. Let me claim this has already been mostly solved during the last two decades, and the very old mail you are quoting is simply completely outdated. For example, just today, a user asked on our mailing list how to use emoji characters in groff, and a working solution was readily proposed to him. Yes, i know the Unicode standard contains lots of advanced features and some of them may be non-trivial to implement with our current preconv(1)-based scheme. But even if we linked groff to some massive icu4c-style library, which would be an even heavier and more intrusive change than Werner's decade-old and likely outdated suggestion, *complete* Unicode support in groff would likely still be more work than rewriting groff completely from scratch. So throwing around suggestions for massive changes serves little purpose until we have a conversation about exactly which features we want, why the proposed massive changes are the most reasonable way to implement these specific features, and who is willing to develop both the fundamental rewrite of the code base and the target changes on top of the new code base. Finally, let me point out that how groff currently handles wide characters - support wide characters both on the input and output side while keeping the code simple by mostly using plain char[] strings internally - is actually one good way for keeping wide character support simple in some circumstances. For details, see my presentation at the 2016 EuroBSDCon: "Why and how you ought to keep multibyte character support simple" https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf https://www.openbsd.org/papers/eurobsdcon2016-utf8.roff The parts most relevant in this context are pages 1-5 and 19-39. I'm not saying groff needs the *specific* techniques presented here - groff almost certainly is more complicated than the programs discussed in my talk - but rather that the existing preconv(1) approach and its simplicity and modularity has striking similarities to what is discussed here, and likely is a good approach, whereas the full mbstowcs(3)->wchar_t->wcstombs(3) dance is much harder than most people think and causes lots of little-known trouble in practice. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?40720> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
