Re: groff, man and Unicode

Markus Kuhn Mon, 09 Dec 2002 09:49:38 -0800

[EMAIL PROTECTED] wrote on 2002-12-09 16:29 UTC:
> Yesterday I made a new man and played a bit with it.
> I can get man pages in foreign character sets to format
> correctly, but things require alot of fiddling, where
> long ago things worked fine without problem.
> 
> Long ago symbols were just passed through. These days
> groff knows too much, and that causes problems.
> 
> I think you once asked me to remove the -Tlatin1 from nroff
> but I find that I need to add it to make nroff work well
> with utf-8 or iso 8859-2. The -Tlatin1 acts as a "pass through"
> flag, while without it the utf8 is converted once more to
> utf8 (as if it were latin1 to start with) yielding unreadable
> garbage. It looks like groff assumes by definition that its
> input is latin1, so that -Tlatin1 becomes "no conversion needed".
> 
> Man now throws iconv into the chain.


Hm, this sounds like dangerous hacking to me.

You really need to sync up with Werner here, otherwise we end up with
lots of conversions being added instead of (and that's what we
ultimately want) them being removed.

My personal opinion:

  - There are basically two options for determining the input encoding
    of groff, and they are not mutually exclusive:

    a) Man somehow "knows" (e.g. from a config file that lists the character
       encoding based on a per-subdirectory basis) what character encoding
       which man page is in and simply tells groff what its input character
       encoding is via a (to be added) command-line option like "-eUTF-8".

    b) All man pages are tagged with a character encoding name
       and groff therefore figures out the output encoding itself
       directly from reading the man page.

  - Man should never convert the encoding of man pages, because
    where two programs in a pipeline recode characters, this promises to
    hide and obscure problems later in difficult to understand ways.

  - groff really should scrap the character encoding variants 
    (ascii, ascii8, latin1, utf8, cp1047, nippon, etc.) from the -T
    option. The -T option should switch between ps, dvi, ..., html and text.
    The new "text" option outputs plaintext (so far called ascii), and the
    locale setting (or if really necessary a new command line option
    "-EISO-8859-1" or so to override the locale) defines the encoding
    of this plaintext output. The output format (ps, text, html) and the
    encoding used must be handled completely orthogonally (i.e., use
    different command line options), because both the text and html
    output format could use different encodings. You can keep "-Tlatin1"
    as a backwards compatible hack for "-Ttext -EISO-8859-1", etc. of course.

  - The output of nroff should normally be determined by the locale,
    and man should simply pass that on transparently to the terminal.

Suggested TODO:

groff:

  G1) The default input and output encoding of groff shall be
      nl_langinfo(CODESET)

  G2) Add a new command (".IE <mime-charset-name>" for input encoding?)
      to the groff input format that can be used to override the default
      input encoding in the man page file. The Emacs tagging convention
      could also be used in a comment of course.

  G3) Add two new command line options:

         -e<mime-charset-name>       override default input encoding
         -E<mime-charset-name>       override default output encoding

  G4) Add checks that abort with an error message if the encodings
      specified in .IE and -e disagree.

man:

  M1) Add a config file mechanism that sets the -e option of groff
      depending on the source file's path. This could either be in
      /etc/man.config or perhaps even better in each $MANPATH/.encoding

What do you think?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: groff, man and Unicode

Reply via email to