Re: groff, man and Unicode

Markus Kuhn Mon, 09 Dec 2002 12:32:23 -0800

[EMAIL PROTECTED] wrote on 2002-12-09 19:31 UTC:
> What I have now is: Man somehow "knows" (e.g. from the file .charset
> near the man page) what character encoding a man page is in, and also
> (e.g. from the users locale) what character encoding is desired.
> It does an iconv from man encoding to desired encoding and feeds
> that to nroff.


*roff always needs to be aware of it's input encoding, otherwise it has
no chance of predicting character display width (think of multi-byte
characters, double-width ideograms, combining characters).

>       - Man should never convert the encoding of man pages, because
>         where two programs in a pipeline recode characters, this promises to
>         hide and obscure problems later in difficult to understand ways.
> 
> It is clear that there must be a recode somewhere.
> It is not clear to me that it would be preferable to do this in nroff.

I think there can be no doubt that a formatter needs to be aware of what
characters it processes. Formatters need not only to know the width of
characters, but might also apply transforms such as converting to
uppercase letters, all of which are encoding dependent. An ugly hack of
keeping the formatter unaware and in some transparent mode really only
works for a few very primitive 8-bit encodings (such as most of the ISO
8859), but breaks even there sometimes, and it breaks certainly for any
multi-byte encoding.

> One of the advantages of iconv in man is that it works today, also
> with old *roff. I am unhappy with recode in groff.
> Always when programs fiddle with one's bits one has to struggle
> to tell them to keep their hands off.

So please leave the hands of the encoding in man. The problem is that
*roff does not at present has a mechanism to tell it what its input
encoding is, therefore this needs to be fixed in *roff.

By the way, Brian Kernigan converted the real troff to UTF-8 already
back in late 1992 (see USENIX Winter 1993 proceedings, page 50)!

Considering this historic precident for the original AT&T troff, I think
it is time that groff should also be made able to digest UTF-8 man
pages, which would finally open the way to converting all non-ASCII man
pages to UTF-8, such that man can forget completely about the encoding
issue. The fewer configuration mechanisms, the better.

> I like a groff that has by default output in the same character set
> as input. Of course it needs to know whether the input is in an 8-bit
> encoding or something more complicated, but in the common case of
> 8-bit encoding and plain text output it may not even be necessary
> to know anything about the character set. Thus, things would
> "just work" with ISO 8859-2 or KOI-8U even when the user does not
> set any locale.
> 
> The system you propose sounds more fragile.

The system I proposed is engineered to work reliably for any encoding,
including UTF-8. The hack you propose might work by accident for ISO
8859-2 or KOI-8U much of the time.

I think, solving the problem of non-ASCII man pages is in the hands of
the groff manintainer, not (at least not initially) in the hand of the
maintainers of the various wrappers that call groff.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: groff, man and Unicode

Reply via email to