Re: [gentoo-dev] UTF-8 encoding and file format of manuals
On Thursday 01 June 2006 14:30, Josh Saddler wrote: > Jan Kundrát wrote: > > Wiktor Wandachowicz wrote: > >>Summing up: > >>* UTF-8 manuals: good or bad? > > > > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in > > piece. > > Agreed. I'd like to see much more extensive use of Unicode throughout my > system by default. Unicode man pages are a good idea. wrong this is why we have USE=unicode -mike pgpJ8E5Lu85MD.pgp Description: PGP signature
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
Paul de Vrieze wrote: > Would it be possible to do automatic detection and unicode conversion in the > portage install stage? I think that would probably be the best option. At a > later stage a simple detection and warning might be sufficient. Tricky. You can parse a file and check if it's valid UTF-8 but the problem is that you can't be sure if it isn't (eg) just a ISO8859-2 formatted one that happened to have interesting sequence of characters. That said, there are some tools that tries to perform some magic (statistical data analysis etc) and guess the correct encoding. [1] [1] app-i18n/enca, http://trific.ath.cx/software/enca/ HTH, -jkt -- cd /local/pub && more beer > /dev/mouth signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
On Thu, 1 Jun 2006 21:08:54 +0200 Paul de Vrieze <[EMAIL PROTECTED]> wrote: > On Thursday 01 June 2006 20:19, Jan Kundrát wrote: > > Wiktor Wandachowicz wrote: > > > Summing up: > > > * UTF-8 manuals: good or bad? > > > > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die > > in piece. > > Would it be possible to do automatic detection and unicode conversion > in the portage install stage? I think that would probably be the best > option. At a later stage a simple detection and warning might be > sufficient. > I'd imagine that glep31check could be easily adapted to do this. -- Tom Martin, http://dev.gentoo.org/~slarti AMD64, net-mail, shell-tools, vim, recruiters Gentoo Linux -- gentoo-dev@gentoo.org mailing list
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
On Thu, Jun 01, 2006 at 02:41:27PM +, Wiktor Wandachowicz wrote: > Respectful Gentoo developers, > > I would like to ask what do you think about UTF-8 encoded manual pages? > I mean, the files like ls.1.gz, which are used by honorable "man" program. > Recently I attacked the problem a little and before submitting any > patches/proposals to Gentoo bugzilla I'd like to know your opinions first. > > Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8", > but the original issue is of a more universal nature. > > Back on subject. ISO-8859-* 8-bit encodings are fine and most localized > manuals use them. However, there are some examples where UTF-8 manuals are > installed as well. Namely, newest portage uses "linguas_pl" by this means: > > $ emerge -pv portage > [ebuild R ] sys-apps/portage-2.1_rc3-r3 USE="-build -doc" LINGUAS="pl" > > In effect, a translated manual pages are added to the system. The problem > is that they use UTF-8 encoding. Having both man-pages-pl and this version > of portage installed gives unexpected results. This way "man ls" prints all > the letters with correct encoding, but "man emerge" does not. On the other > hand, if "man" is configured to display UTF-8 encoded manuals correctly, > all the other manuals print funny characters instead of desired output. > > I wrote a simple script [1] which checks all installed Polish manuals by > using "file" program. For "pl" locale it produces currently about ~70kB > of text, and for default locale it's about 458kB. After grepping for all > occurences of "UTF" I've found out that only the newest portage's manuals > are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1, > Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway). > > While it's easy to contact Polish translators of the portage's manuals so > they could correct them, the problem will have to be solved sooner or later. > UTF-8 encoded manuals will probably occur with higher frequency, and some > general resolution should be made. > > After some discussion on the Polish forum [2] I've learnt about groff > deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps > somewhat in that matter. But it also requires that all manuals be unified > wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise. > So I don't know what course to take. > > Summing up: > * UTF-8 manuals: good or bad? Bad if they're the only option. It means manpages will no longer be available for non-UTF-8 users. Also, forcing everything in /usr/share/man/pl to be UTF-8 will require users to emerge -e world. > * how to handle mixed encodings of manuals? The same way it's done now: install latin2 pl manpages in /usr/share/man/pl and utf8 pl manpages in /usr/share/man/pl.UTF-8 If anything installs utf8 manpages in /usr/share/man/pl, fix the ebuild. > * should man and/or groff handle UTF-8 better? Yes, but it's not required to get this problem sorted out. > * should an eclass function be created to aid in correcting the encoding > of manual pages while installing them? Maybe, but it's not required to get this problem sorted out. -- gentoo-dev@gentoo.org mailing list
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul de Vrieze wrote: > On Thursday 01 June 2006 20:19, Jan Kundrát wrote: >> Wiktor Wandachowicz wrote: >>> Summing up: >>> * UTF-8 manuals: good or bad? >> The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece. > > Would it be possible to do automatic detection and unicode conversion in the > portage install stage? I think that would probably be the best option. At a > later stage a simple detection and warning might be sufficient. > > Paul > I'd agree. Forcing UTF-8/unicode on those of us who don't want the extra bloat is a "bad thing" -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (GNU/Linux) iD8DBQFEfzzQ0K3RJaeXx6cRAkvSAKDiWDgXOa6dhure8BtZhcTqBBZe8wCg0QDe LPmaxvgfz3uchjwjtRRb9uw= =gH7U -END PGP SIGNATURE- -- gentoo-dev@gentoo.org mailing list
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
On Thursday 01 June 2006 20:19, Jan Kundrát wrote: > Wiktor Wandachowicz wrote: > > Summing up: > > * UTF-8 manuals: good or bad? > > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece. Would it be possible to do automatic detection and unicode conversion in the portage install stage? I think that would probably be the best option. At a later stage a simple detection and warning might be sufficient. Paul -- Paul de Vrieze Gentoo Developer Mail: [EMAIL PROTECTED] Homepage: http://www.devrieze.net pgpvYNvUxbnsw.pgp Description: PGP signature
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan Kundrát wrote: > Wiktor Wandachowicz wrote: > >>Summing up: >>* UTF-8 manuals: good or bad? > > > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece. Agreed. I'd like to see much more extensive use of Unicode throughout my system by default. Unicode man pages are a good idea. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) iD8DBQFEfzJHrsJQqN81j74RAhEnAJ9Cv0duJN+K3IGiHKzTEX8eNz25NQCgqSvi Np8wZpV7doCdwo2addFbb2o= =ZHtf -END PGP SIGNATURE- -- gentoo-dev@gentoo.org mailing list
Re: [gentoo-dev] UTF-8 encoding and file format of manuals
Wiktor Wandachowicz wrote: > Summing up: > * UTF-8 manuals: good or bad? The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece. > Any constructive comments are more than welcome! The very same problem exists with man-pages-cs (which are outdated as a bonus). Blésmrt, -jkt -- cd /local/pub && more beer > /dev/mouth signature.asc Description: OpenPGP digital signature
[gentoo-dev] UTF-8 encoding and file format of manuals
Respectful Gentoo developers, I would like to ask what do you think about UTF-8 encoded manual pages? I mean, the files like ls.1.gz, which are used by honorable "man" program. Recently I attacked the problem a little and before submitting any patches/proposals to Gentoo bugzilla I'd like to know your opinions first. Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8", but the original issue is of a more universal nature. Back on subject. ISO-8859-* 8-bit encodings are fine and most localized manuals use them. However, there are some examples where UTF-8 manuals are installed as well. Namely, newest portage uses "linguas_pl" by this means: $ emerge -pv portage [ebuild R ] sys-apps/portage-2.1_rc3-r3 USE="-build -doc" LINGUAS="pl" In effect, a translated manual pages are added to the system. The problem is that they use UTF-8 encoding. Having both man-pages-pl and this version of portage installed gives unexpected results. This way "man ls" prints all the letters with correct encoding, but "man emerge" does not. On the other hand, if "man" is configured to display UTF-8 encoded manuals correctly, all the other manuals print funny characters instead of desired output. I wrote a simple script [1] which checks all installed Polish manuals by using "file" program. For "pl" locale it produces currently about ~70kB of text, and for default locale it's about 458kB. After grepping for all occurences of "UTF" I've found out that only the newest portage's manuals are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1, Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway). While it's easy to contact Polish translators of the portage's manuals so they could correct them, the problem will have to be solved sooner or later. UTF-8 encoded manuals will probably occur with higher frequency, and some general resolution should be made. After some discussion on the Polish forum [2] I've learnt about groff deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps somewhat in that matter. But it also requires that all manuals be unified wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise. So I don't know what course to take. Summing up: * UTF-8 manuals: good or bad? * how to handle mixed encodings of manuals? * should man and/or groff handle UTF-8 better? * should an eclass function be created to aid in correcting the encoding of manual pages while installing them? Any constructive comments are more than welcome! Best regards, Wiktor Wandachowicz (SirYes) [1] http://ics.p.lodz.pl/~wiktorw/gentoo/checkman [2] http://forums.gentoo.org/viewtopic-p-3352287.html [3] http://hoth.amu.edu.pl/~d_szeluga/groff-utf8.tar.bz2 -- gentoo-dev@gentoo.org mailing list