-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Adam Borowski wrote: [...] > Due to Red Hat and probably other dists using UTF-8 already, plenty of man > pages are in UTF-8 when our groff still can't parse them. Having gone > through 2/3 of the archive, I got 807 such pages so far. And every single > one displays lovely "ä" or similar instead. That's 9% of all mans with > non-ASCII characters in the corpus.
You mean by that that they're encoded as UTF-8 where man-db expects them in whatever encoding is in its hard-coded table, correct? How are you detecting them? [...] >> UTF-8 is supported on output, so it is really transparent for users. > > If you consider having all unsupported characters silently dropped as being > transparent. This may not be as bad as all that, actually. Currently man-db will cope fine with UTF-8 man pages (if it's expecting them) and will output UTF-8. Of course, it'll lose all characters not in ISO-8859-1, but that's a man-db bug. This means that, assuming they all actually *are* in ISO-8859-1, we should be able to transcode all such man pages to UTF-8, update man-db's table so it expects them, and not lose any functionality. This means that without having to wait for the technology, we can do this: - transcode all man pages currently in ISO-8859-1 into UTF-8 - move all non-ISO-8859-1 man pages into directories with explicit encodings et voila (which will soon be able to be reliably spelt voilá), we have now achieved total UTF-8 dominance. Admittedly, because we're not handling non-ISO-8859-1 characters, it's mere buzzword compliance, but that is now a perfectly manageable bug in man-db and groff. It means that by making one small change to man-db we can start the policy change and the technology change *in parallel*, which ought to save loads of time. ...also, because man pages are now either in UTF-8 or in a directory with an explicit encoding in the name, it ought to be easy to change linda and lintian to check for invalid UTF-8 in the man pages, which should help with the cat-herding aspects of the problem. - -- ┌── dg@cowlark.com ─── http://www.cowlark.com ─────────────────── │ │ "There does not now, nor will there ever, exist a programming language in │ which it is the least bit hard to write bad programs." --- Flon's Axiom -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGwkG7f9E0noFvlzgRAuefAKDaMn2noIGKL88qav+aaIb+4tEPGwCgi4kk 9wqG7+J19tOflGdaQIs/LqI= =ZivR -----END PGP SIGNATURE-----

