Follow-up Comment #3, bug #58796 (project groff): Thanks for the comments, Ingo. I understand and support the Unix philosophy, but I disagree with some of your underlying assumptions.
If you developed a brand-new tool to do some text-processing task, something designed to be used in pipelines with other tools, you could choose to specify that: a) the input character set of your tool be a Unicode encoding, or b) the tool only take some subset of Unicode as input, and require another tool to pipe in translations for the rest of Unicode, using a syntax invented specifically for these tools and not standardized anywhere else. If you chose (b) on the grounds "pipelines are more Unixy," this would not be a popular choice. Requiring helper applications to understand modern character sets is not inherently "the Unix way." It's a stopgap used for historical applications whose cores do not (yet) speak Unicode. Groff is a historical application. It will always support \['e] because it must always be able to process historical documents that used such character representations. But \['e] should in no way be considered the canonical way to represent the Unicode character LATIN SMALL LETTER E WITH ACUTE. Unicode gives us the canonical representation. \['e] and \[u00E9] are merely additional, roff-specific ways to represent this character. The "roff-specific" part is important: the entire Unix philosophy of pipelines requires that all I/O be in as general a form as possible to be able to interact with as wide a range of other programs as possible. groff and preconv, by contrast, communicate in a secret code that no other tool uses. That's not the Unix way; that's a band-aid to cover up something that Werner identified as one of the four major areas of groff that needed to be updated back in 2013. The need has not lessened in the intervening years. That groff is a historical package does not absolve it from modern best practices in software design. Looking to the long term, this is what we should be striving for. preconv is a very useful bridge in the meantime; I believe you that the task of converting historical C++ code to natively handle UTF-8 input is big and messy.* Nonetheless it should be considered groff's ultimate goal. * I'm currently going through a similar process--on a much smaller scale--with some Perl code. And Perl actually handles a lot of the logic automatically that a C program would have to manually implement. I don't know what C++'s facilities are like, but I do know that no matter how good the language's design, you'll run into stupid problems <http://www.perlmonks.org/?node_id=11119633> that will derail you for a few hours. [comment #2 comment #2:] > I would hate it if groff would start requiring iconv. It's far better to leverage existing code that does what you need than to re-implement the same logic in your own code. The principle "solve one task only, but solve it well" ought to free the groff package from implementing its own conversions between character encodings and let it instead focus on its primary task. Anyway, if groff handled Unicode I/O natively (and thus also ASCII, a subset thereof), I wouldn't expect iconv to become an installation requirement; it would be a run-time requirement for those users who need to feed in documents in other character encodings. > it's much better to encode all non-ASCII characters and not force users to adopt an obsolete locale. Good points here; I agree. I fell into the trap of looking at the encoding groff currently natively handles, and not at the big picture. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?58796> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
