(Colin, CC-ing you as I'm not sure if you're of aware of this long thread, and both man-db and groff are your territory...)
On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: > I proposed Colin to work on it during Debconf, but still had no time to do > it. Could you tell us if anything was born? > Interested peoples should read #196762 An interesting read. > I tested a CVS snapshot of groff On the other hand, I investigated what the headgear guys produced. I just compiled the package on Debian instead of using a real Red Hat system, so due to misconfiguration things may be better than I'm reporting here. UTF-8 input: works perfectly. dev utf8 works almost perfectly; we can change FFFF to 10FFFF to get full Unicode coverage and mark u20000..u2FFFF as doublewidth to get rare Kanji formatted right. Neither of these are supported by euc_JP or any other legacy character set anyway. dev html works well for Latin1, Japanese and basic Chinese -- supports only characters it knowns. Trivially fixable by outputting either UTF-8, or, trading massive space waste for content-type independency, numbered HTML entities. Doesn't work for Latin2, Cyrillic, Greek, Arabic, etc without fixing. dev ps seems to be broken. My misconfiguration or real brokenness? dev dvi ditto dev latin1, dev nippon, etc supposed to be replaced by dev utf8. So with tty and HTML support working, it looks like input is just fine. > [CVS groff] > There is at least one remaining issue, which is that it does not recognize > family of glyphs. Thus all glyphs are considered of the same size (we > won't provide a font description file with the list of every UTF > characters), and thus the output of groff is ugly. Any such description file would work only as long as you hard-code any fonts, and somehow provide them for any potential reader. Without this, wcwidth() is as good as you can get for fixed-width fonts. For comparison, Red Hat makes a wild assumption that everything u0800..uFFFF is doublewide. > (Except for this issue, I could display nicely French, English, Japanese > and Vietnamese UTF-8 manpages) Cool, and what for Cyrillic, Arabic, Indic, etc? > I will port the part of ENABLE_MULTIBYTE which permits to specify ranges > of characters, and see if it looks OK. Do you want to assume that all characters within a family have the same width? Then it's better to give wcwidth() a try. > The CVS version introduced a -K option to specify the encoding > of the input file to groff. This should help to plan a transition for UTF-8 > manpages by using this option in man-db. Wouldn't it be easier and less prone to breakages if we simply hard-coded the encoding as UTF-8 and do all the processing in man-db? A versioned dependency or conflict would be enough, and the code would be much simpler. > Slowly moving files from man/ to man.UTF-8/ while still supporting the > legacy encoding in man/ would be a simple transition plan. I'm afraid that's not an option. So far I found 807 UTF-8 man pages, and only 21 of them were marked as such. But fear not, it appears I've got a solution working, just let me download the rest of archive to check it. > Note: the only real issue with lack of UTF-8 support for manpages in > Debian is that it is not possible to provide manpages translated in > languages whose only valid encoding is UTF (e.g. Vietnamese). > Otherwise our man-db/groff combination works really nicely and permits to > display manpages with very little annoyances (i.e. I don't consider having > to drop my cedilla in manpages to be a real issue). I do consider it to be one -- if you go carelessly in one place, you'll likely screw the characters where it really matters. Having "ń" in the name of my town, I'm tired of having billing address rejected because the bank expects "ń" but a form won't allow it; and while I can have stuff delivered to "Starogard Gdański", I pity the poor Russian schmuck who tries to mail-order something. Another real issue is the inability to talk about any non-Latin1 character in manpages. What about mans dealing with i18 stuff? But that is still not the biggest issue here. Due to Red Hat and probably other dists using UTF-8 already, plenty of man pages are in UTF-8 when our groff still can't parse them. Having gone through 2/3 of the archive, I got 807 such pages so far. And every single one displays lovely "ä" or similar instead. That's 9% of all mans with non-ASCII characters in the corpus. > UTF-8 is supported on output, so it is really transparent for users. If you consider having all unsupported characters silently dropped as being transparent. I'll try to do what I can, but with my knowledge of groff being slim, I doubt I can even touch stuff like ps or dvi output. I attached my test manpage. It lacks test cases for combining chars and Indic scripts yet. Cheers and schtuff, -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice.
.TH UTF8test 7 "2007-08-14" .SH ASCII abc def ghi 123 .SH LATIN1 ÃÄÅ äåæç ÿúû ø .SH LATIN2 Ąą Ćć Ĉĉ .SH CYRILLIC абвг АБВГ .SH GREEK αβγδ ΑΩ .SH ARABIC شفغى .SH JAPANESE づあエふぜ .SH CHINESE 亘丸仟仱井 .SH "NON-BMP STUFF (code2001 cuneiforms)" .SH "LINE DRAWING" ╟─╢░▒▓