Re: Man pages and UTF-8

Adam Borowski Tue, 14 Aug 2007 15:51:08 -0700

(Colin, CC-ing you as I'm not sure if you're of aware of this long thread,
and both man-db and groff are your territory...)

On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: 
> I proposed Colin to work on it during Debconf, but still had no time to do
> it.

Could you tell us if anything was born?

> Interested peoples should read #196762

An interesting read.

> I tested a CVS snapshot of groff

On the other hand, I investigated what the headgear guys produced.  I just
compiled the package on Debian instead of using a real Red Hat system, so
due to misconfiguration things may be better than I'm reporting here.

UTF-8 input:
  works perfectly.

dev utf8
  works almost perfectly; we can change FFFF to 10FFFF to get full Unicode
  coverage and mark u20000..u2FFFF as doublewidth to get rare Kanji
  formatted right.  Neither of these are supported by euc_JP or any other
  legacy character set anyway.

dev html
  works well for Latin1, Japanese and basic Chinese -- supports only
  characters it knowns.  Trivially fixable by outputting either UTF-8, or,
  trading massive space waste for content-type independency, numbered HTML
  entities.  Doesn't work for Latin2, Cyrillic, Greek, Arabic, etc without
  fixing.

dev ps
  seems to be broken.  My misconfiguration or real brokenness?

dev dvi
  ditto

dev latin1, dev nippon, etc
  supposed to be replaced by dev utf8.

So with tty and HTML support working, it looks like input is just fine.

> [CVS groff]

> There is at least one remaining issue, which is that it does not recognize
> family of glyphs. Thus all glyphs are considered of the same size (we
> won't provide a font description file with the list of every UTF
> characters), and thus the output of groff is ugly.

Any such description file would work only as long as you hard-code any
fonts, and somehow provide them for any potential reader.  Without this,
wcwidth() is as good as you can get for fixed-width fonts.  For comparison,
Red Hat makes a wild assumption that everything u0800..uFFFF is doublewide.

> (Except for this issue, I could display nicely French, English, Japanese
> and Vietnamese UTF-8 manpages)

Cool, and what for Cyrillic, Arabic, Indic, etc?

> I will port the part of ENABLE_MULTIBYTE which permits to specify ranges
> of characters, and see if it looks OK.

Do you want to assume that all characters within a family have the same
width? Then it's better to give wcwidth() a try.

> The CVS version introduced a -K option to specify the encoding
> of the input file to groff. This should help to plan a transition for UTF-8
> manpages by using this option in man-db.

Wouldn't it be easier and less prone to breakages if we simply hard-coded
the encoding as UTF-8 and do all the processing in man-db?  A versioned
dependency or conflict would be enough, and the code would be much simpler.

> Slowly moving files from man/ to man.UTF-8/ while still supporting the
> legacy encoding in man/ would be a simple transition plan.

I'm afraid that's not an option.  So far I found 807 UTF-8 man pages, and
only 21 of them were marked as such.  But fear not, it appears I've got a
solution working, just let me download the rest of archive to check it.

> Note: the only real issue with lack of UTF-8 support for manpages in
> Debian is that it is not possible to provide manpages translated in
> languages whose only valid encoding is UTF (e.g. Vietnamese).
> Otherwise our man-db/groff combination works really nicely and permits to
> display manpages with very little annoyances (i.e. I don't consider having
> to drop my cedilla in manpages to be a real issue).

I do consider it to be one -- if you go carelessly in one place, you'll
likely screw the characters where it really matters.  Having "ń" in the name
of my town, I'm tired of having billing address rejected because the bank
expects "ń" but a form won't allow it; and while I can have stuff delivered
to "Starogard Gda&#0324;ski", I pity the poor Russian schmuck who tries to
mail-order something.

Another real issue is the inability to talk about any non-Latin1 character
in manpages.  What about mans dealing with i18 stuff?

But that is still not the biggest issue here.

Due to Red Hat and probably other dists using UTF-8 already, plenty of man
pages are in UTF-8 when our groff still can't parse them.  Having gone
through 2/3 of the archive, I got 807 such pages so far.  And every single
one displays lovely "Ã¤" or similar instead.  That's 9% of all mans with
non-ASCII characters in the corpus.

> UTF-8 is supported on output, so it is really transparent for users.

If you consider having all unsupported characters silently dropped as being
transparent.  

I'll try to do what I can, but with my knowledge of groff being slim, I
doubt I can even touch stuff like ps or dvi output.

I attached my test manpage.  It lacks test cases for combining chars and
Indic scripts yet.

Cheers and schtuff,
-- 
1KB             // Microsoft corollary to Hanlon's razor:
                //      Never attribute to stupidity what can be
                //      adequately explained by malice.

.TH UTF8test 7 "2007-08-14"
.SH ASCII
abc def ghi 123
.SH LATIN1
ÃÄÅ äåæç ÿúû ø
.SH LATIN2
Ąą Ćć Ĉĉ
.SH CYRILLIC
абвг АБВГ
.SH GREEK
αβγδ ΑΩ
.SH ARABIC
شفغى
.SH JAPANESE
づあエふぜ
.SH CHINESE
亘丸仟仱井
.SH "NON-BMP STUFF (code2001 cuneiforms)"
󰎯󰎠󰎻󰎺󰎾󰎢󰏁󰏄󰎥󰏁󰎠󰎺󰎭󰎡󰎺
.SH "LINE DRAWING"
╟─╢░▒▓

Re: Man pages and UTF-8

Reply via email to