[EMAIL PROTECTED] wrote on 2002-12-09 19:31 UTC: > What I have now is: Man somehow "knows" (e.g. from the file .charset > near the man page) what character encoding a man page is in, and also > (e.g. from the users locale) what character encoding is desired. > It does an iconv from man encoding to desired encoding and feeds > that to nroff.
*roff always needs to be aware of it's input encoding, otherwise it has no chance of predicting character display width (think of multi-byte characters, double-width ideograms, combining characters). > - Man should never convert the encoding of man pages, because > where two programs in a pipeline recode characters, this promises to > hide and obscure problems later in difficult to understand ways. > > It is clear that there must be a recode somewhere. > It is not clear to me that it would be preferable to do this in nroff. I think there can be no doubt that a formatter needs to be aware of what characters it processes. Formatters need not only to know the width of characters, but might also apply transforms such as converting to uppercase letters, all of which are encoding dependent. An ugly hack of keeping the formatter unaware and in some transparent mode really only works for a few very primitive 8-bit encodings (such as most of the ISO 8859), but breaks even there sometimes, and it breaks certainly for any multi-byte encoding. > One of the advantages of iconv in man is that it works today, also > with old *roff. I am unhappy with recode in groff. > Always when programs fiddle with one's bits one has to struggle > to tell them to keep their hands off. So please leave the hands of the encoding in man. The problem is that *roff does not at present has a mechanism to tell it what its input encoding is, therefore this needs to be fixed in *roff. By the way, Brian Kernigan converted the real troff to UTF-8 already back in late 1992 (see USENIX Winter 1993 proceedings, page 50)! Considering this historic precident for the original AT&T troff, I think it is time that groff should also be made able to digest UTF-8 man pages, which would finally open the way to converting all non-ASCII man pages to UTF-8, such that man can forget completely about the encoding issue. The fewer configuration mechanisms, the better. > I like a groff that has by default output in the same character set > as input. Of course it needs to know whether the input is in an 8-bit > encoding or something more complicated, but in the common case of > 8-bit encoding and plain text output it may not even be necessary > to know anything about the character set. Thus, things would > "just work" with ISO 8859-2 or KOI-8U even when the user does not > set any locale. > > The system you propose sounds more fragile. The system I proposed is engineered to work reliably for any encoding, including UTF-8. The hack you propose might work by accident for ISO 8859-2 or KOI-8U much of the time. I think, solving the problem of non-ASCII man pages is in the hands of the groff manintainer, not (at least not initially) in the hand of the maintainers of the various wrappers that call groff. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
