[EMAIL PROTECTED] wrote on 2002-12-09 16:29 UTC:
> Yesterday I made a new man and played a bit with it.
> I can get man pages in foreign character sets to format
> correctly, but things require alot of fiddling, where
> long ago things worked fine without problem.
>
> Long ago symbols were just passed through. These days
> groff knows too much, and that causes problems.
>
> I think you once asked me to remove the -Tlatin1 from nroff
> but I find that I need to add it to make nroff work well
> with utf-8 or iso 8859-2. The -Tlatin1 acts as a "pass through"
> flag, while without it the utf8 is converted once more to
> utf8 (as if it were latin1 to start with) yielding unreadable
> garbage. It looks like groff assumes by definition that its
> input is latin1, so that -Tlatin1 becomes "no conversion needed".
>
> Man now throws iconv into the chain.
Hm, this sounds like dangerous hacking to me.
You really need to sync up with Werner here, otherwise we end up with
lots of conversions being added instead of (and that's what we
ultimately want) them being removed.
My personal opinion:
- There are basically two options for determining the input encoding
of groff, and they are not mutually exclusive:
a) Man somehow "knows" (e.g. from a config file that lists the character
encoding based on a per-subdirectory basis) what character encoding
which man page is in and simply tells groff what its input character
encoding is via a (to be added) command-line option like "-eUTF-8".
b) All man pages are tagged with a character encoding name
and groff therefore figures out the output encoding itself
directly from reading the man page.
- Man should never convert the encoding of man pages, because
where two programs in a pipeline recode characters, this promises to
hide and obscure problems later in difficult to understand ways.
- groff really should scrap the character encoding variants
(ascii, ascii8, latin1, utf8, cp1047, nippon, etc.) from the -T
option. The -T option should switch between ps, dvi, ..., html and text.
The new "text" option outputs plaintext (so far called ascii), and the
locale setting (or if really necessary a new command line option
"-EISO-8859-1" or so to override the locale) defines the encoding
of this plaintext output. The output format (ps, text, html) and the
encoding used must be handled completely orthogonally (i.e., use
different command line options), because both the text and html
output format could use different encodings. You can keep "-Tlatin1"
as a backwards compatible hack for "-Ttext -EISO-8859-1", etc. of course.
- The output of nroff should normally be determined by the locale,
and man should simply pass that on transparently to the terminal.
Suggested TODO:
groff:
G1) The default input and output encoding of groff shall be
nl_langinfo(CODESET)
G2) Add a new command (".IE <mime-charset-name>" for input encoding?)
to the groff input format that can be used to override the default
input encoding in the man page file. The Emacs tagging convention
could also be used in a comment of course.
G3) Add two new command line options:
-e<mime-charset-name> override default input encoding
-E<mime-charset-name> override default output encoding
G4) Add checks that abort with an error message if the encodings
specified in .IE and -e disagree.
man:
M1) Add a config file mechanism that sets the -e option of groff
depending on the source file's path. This could either be in
/etc/man.config or perhaps even better in each $MANPATH/.encoding
What do you think?
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/