[I forward here some important comments made by Mike Roe.]

----------------------------------------------------------------
One of the biggest problems with UTF-8 locales under Linux
is that the man(1) pages don't display properly. Many man
pages contain some iso-latin-1 characters with the top bit set
e.g. copyright symbols. Then there are the man pages for
the ru_RU locale, which are encoded in KOI8-R.

The obvious thing to do is to convert the groff source for the
man pages into UTF-8. But:

(a) groff can't cope with UTF8 encoded input
(b) groff underlines text by using backspace and underscore as
    if it were printing to a teletype. (Yuck!) "less" decodes
    these backspaces and underscores and uses them to select
    a different colour. I'm not convinced that the current
    implementation gets this right in a UTF-8 locale!

I wonder if anyone has a workaround for this? I've seen a few comments
made about Unicode and groff, but nothing definite.

Fun things to try:

1. xterm -u8
   export LANG=ru_RU.utf8
   man cp | iconv -f koi8-r -t utf8 | more

    (With RedHat 7.0 + upgraded libc, observe the interesting effect
     caused by backspace-underline of cyrillic characters)

2. export LANG=ga_IE
   cal 3 2001
     (Segmentation faults)
     I don't suppose that there are that many Gaelic-speaking Linux users,
     but there are problems in other locales too. I don't think that this
     problem is Unicode-related.

Mike
----------------------------------------------------------------

Some remarks on that from me:

a) I think I noted to Robert a long time a go problems with the
   bold/underline-via-BS code in less in UTF-8 mode, but never
   followed up whether his patch found a way into the distributions.

b) It would be nice to move both groff and less (at least as an
   additional option) from the backspace hack to the ISO 6429 SGR
   sequence (ESC [ ... m). Advantages:

     - SGR provides far more functionality (colours, inverse, italics)

     - SGR works on any terminal emulator, even without more/less
       intervening, whereas the BS trick requires an overstriking
       line printer or less

c) There should indeed be a designated encoding for the man pages,
   and I think it is better to convert all man pages to ASCII/UTF-8
   than to hope that the man pages on the disk (a resource shared by
   all users) happens to be in the encoding of the user's current
   locale

d) As a quick measure, at least the maintainers and distributors of
   English man packages should check regularly that there are
   no non-ASCII characters in them. At a later stage, UTF-8 can also
   be allowed as soon as man/groff can handle UTF-8 man pages. At that
   stage, the non-English man pages should all be converted to UTF-8
   (just like the gettext message files).

Markus


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to