less problems

Markus Kuhn Fri, 02 Mar 2001 01:43:02 -0800
Tomohiro KUBOTA wrote on 2001-03-02 04:34 UTC:
>  - Groff is widely used for format manual pages.  Manual pages are
>    installed into the system and have their own encodings regardless
>    of the users' locale.  Thus, some mechanism for manual pages to
>    specify their own encodings must be added.  The "mechanism" will
>    be common to Mule's way, i.e., -*-coding:foobar;-*- at the first
>    line.  This mechanism will enable manpages in both UTF-8 and 
>    "legacy" encodings can co-exist.

There are two options:

a) add a character set tagging mechanism

b) simply agree that man pages should only be in ASCII or UTF-8

I think that b) is both feasible and simpler. Reasons:

  - Non-English man pages usually come as a single big package and the
    documentation says what encoding is used for the entire man package. It
    is nearly trivial for distribution makers to simply send all of that
    through iconv before putting it into their man RPMs.

  - Practically all downloadable applications that users might want to
    download and install themselves instead of from the distribution are
    written in English and use only ASCII. I can count the counter
    examples with the fingers on a single hand.

  - Man page maintainers do not need to use a UTF-8 editor. They can
    keep things in their traditional encoding and just add to their
    Makefiles an option to apply iconv at installation time.

  - On Linux distributions, there is usually only one single application
    (groff) reading man pages, and there are only very few applications
    (man, xman, etc.) calling groff to do that.

All that has to be added is UTF-8 input support for groff and a compile
time option for groff calling apps such as man, xman, etc. to use groff
in a way that causes it to interpret input files as UTF-8 all the time.
The groff/man plaintext output should ideally be the multi-byte
character set of the current locale. Groff options such as -Tlatin1
should be considered deprecated on platforms like Linux with proper
multi-byte locale support. It would be nice to have groff support say

   -Tplaintext   Plain text (charset according to locale)
   -Tsgrtext     Plain text with added ISO 6429 SGR (ESC [ ... m) emphasis
                 (charset according to locale)
   -Tbstext      Plain text with added backspace emphasis
                 (bold and underline only, charset according to locale)

> Current way to specify encoding as a "device" must be modified,
> because the encoding of the source of manual pages and users'
> environment can be different.  Werner and I also agreed that
> all "latin1", "ascii", "ascii8", "nippon", and "utf8" devices
> will be abolished and will have single "tty" (whatever name can
> be ok) device.

See above. I agree that the device name should not reflect the encoding
(we have locales to set this in a more convenient and standardized way),
but there are still various options for how to add style information to
the plain text and users should be able to choose between these:

  - none
  - ISO 6429 SGR (ESC [ ... m)  (directly understood by all terminal
                                 emulators and many printers)
  - "a\ba" for bold and "_\ba"  (ugly restricted hack, only understood
                                 by less, mixed-to-bad results with most
                                 printer drivers, ignored by terminal
                                 emulators, no support for colours/inverse/
                                 italics)

> Range of valid characters is different from encoding to encoding.
> For example, in ISO-8859-1 environment, U+00a9 can be used for
> copyright mark (for "\(co" in manpage sources) and it will be 
> converted into 0xa9 by the postprocessor.  However, there are 
> encodings which don't have a character which corresponds to U+00a9.
> The postprocessor should convert it into "(C)"?  No.  It will break
> the typesetting.  Thus, some method is needed to let troff know
> the range of valid characters.  "tty-char"-like macro can be used
> for this purpose.

Transliteration is indeed a problem. We do have a (semi-broken)
transliteration mechanism in glibc 2.2 that would in principle be good
for this. I suggested that if transliteration is performed, glibc should
provide in wcwidth() to applications the wcswidth() value of the
transliteration string. This would ensure that wcwidth() fulfills its
purpose of telling applications how much the cursor will advance on the
terminal if a wide character were sent to it, even in the context of
transliteration. But this is not available and it was my impression that
Ulrich Drepper didn't like the idea for unclear reasons. (There is the
issue of wcwidth becoming inappropriate for the terminal emulator to
decide whether a character is normal or wide if it also covers the
effects of transliteration, but it is probably a good idea and current
practice anyway to have a separate hardwired wcwidth in the terminal
emulator so that should not be an obstacle.)

> For manpage writers: I think non-English pages may use non-ASCII
> characters which native speakers can accept.  However, English manpages
> should be written within ASCII characters, not in ISO-8859-1.  This
> is because English manpages are for all people over the world, while
> non-English ones are for native speakers.

I hope you don't want to suggest this as the permanent situation for the
long-term future. English is not appropriately covered by either ASCII
or ISO 8859-1. CP1252 and the Postscript standard encoding are probably
the smallest widely used coded character sets covering the needs of the
English language reasonably well, unless you restrict yourself to a
typewriter style of writing. Fortunately, groff (like TeX) provides
ASCII mnemonics for the non-ASCII characters needed and widely used for
English, but I hope we can eventually avoid these and type all needed
English characters (curly quotation marks and apostrophe, copyright and
trademark sign, en/em dashes, minus and other mathematical symbols,
etc.) directly in UTF-8 source text.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Unicode and man/groff/less problems

Reply via email to