Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-04 Thread Mike Frysinger
On Thursday 01 June 2006 14:30, Josh Saddler wrote:
> Jan Kundrát wrote:
> > Wiktor Wandachowicz wrote:
> >>Summing up:
> >>* UTF-8 manuals: good or bad?
> >
> > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in
> > piece.
>
> Agreed. I'd like to see much more extensive use of Unicode throughout my
> system by default. Unicode man pages are a good idea.

wrong

this is why we have USE=unicode
-mike


pgpJ8E5Lu85MD.pgp
Description: PGP signature


Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-02 Thread Jan Kundrát
Paul de Vrieze wrote:
> Would it be possible to do automatic detection and unicode conversion in the 
> portage install stage? I think that would probably be the best option. At a 
> later stage a simple detection and warning might be sufficient.

Tricky. You can parse a file and check if it's valid UTF-8 but the
problem is that you can't be sure if it isn't (eg) just a ISO8859-2
formatted one that happened to have interesting sequence of characters.

That said, there are some tools that tries to perform some magic
(statistical data analysis etc) and guess the correct encoding. [1]

[1] app-i18n/enca, http://trific.ath.cx/software/enca/

HTH,
-jkt

-- 
cd /local/pub && more beer > /dev/mouth


signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-02 Thread Tom Martin
On Thu, 1 Jun 2006 21:08:54 +0200
Paul de Vrieze <[EMAIL PROTECTED]> wrote:

> On Thursday 01 June 2006 20:19, Jan Kundrát wrote:
> > Wiktor Wandachowicz wrote:
> > > Summing up:
> > > * UTF-8 manuals: good or bad?
> >
> > The Only Way To Go (tm), IMHO. Let's let the legacy encodings die
> > in piece.
> 
> Would it be possible to do automatic detection and unicode conversion
> in the portage install stage? I think that would probably be the best
> option. At a later stage a simple detection and warning might be
> sufficient.
> 

I'd imagine that glep31check could be easily adapted to do this.

-- 
Tom Martin, http://dev.gentoo.org/~slarti
AMD64, net-mail, shell-tools, vim, recruiters
Gentoo Linux
-- 
gentoo-dev@gentoo.org mailing list



Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Harald van Dijk
On Thu, Jun 01, 2006 at 02:41:27PM +, Wiktor Wandachowicz wrote:
> Respectful Gentoo developers,
> 
> I would like to ask what do you think about UTF-8 encoded manual pages?
> I mean, the files like ls.1.gz, which are used by honorable "man" program.
> Recently I attacked the problem a little and before submitting any
> patches/proposals to Gentoo bugzilla I'd like to know your opinions first.
> 
> Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8",
> but the original issue is of a more universal nature.
> 
> Back on subject. ISO-8859-* 8-bit encodings are fine and most localized
> manuals use them. However, there are some examples where UTF-8 manuals are
> installed as well. Namely, newest portage uses "linguas_pl" by this means:
> 
> $ emerge -pv portage
> [ebuild   R   ] sys-apps/portage-2.1_rc3-r3  USE="-build -doc" LINGUAS="pl"
> 
> In effect, a translated manual pages are added to the system. The problem
> is that they use UTF-8 encoding. Having both man-pages-pl and this version
> of portage installed gives unexpected results. This way "man ls" prints all
> the letters with correct encoding, but "man emerge" does not. On the other
> hand, if "man" is configured to display UTF-8 encoded manuals correctly,
> all the other manuals print funny characters instead of desired output.
> 
> I wrote a simple script [1] which checks all installed Polish manuals by
> using "file" program. For "pl" locale it produces currently about ~70kB
> of text, and for default locale it's about 458kB. After grepping for all
> occurences of "UTF" I've found out that only the newest portage's manuals
> are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1,
> Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway).
> 
> While it's easy to contact Polish translators of the portage's manuals so
> they could correct them, the problem will have to be solved sooner or later.
> UTF-8 encoded manuals will probably occur with higher frequency, and some
> general resolution should be made.
> 
> After some discussion on the Polish forum [2] I've learnt about groff
> deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps
> somewhat in that matter. But it also requires that all manuals be unified
> wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise.
> So I don't know what course to take.
> 
> Summing up:
> * UTF-8 manuals: good or bad?

Bad if they're the only option. It means manpages will no longer be
available for non-UTF-8 users. Also, forcing everything in
/usr/share/man/pl to be UTF-8 will require users to emerge -e world.

> * how to handle mixed encodings of manuals?

The same way it's done now: install latin2 pl manpages in
 /usr/share/man/pl
and utf8 pl manpages in
 /usr/share/man/pl.UTF-8
If anything installs utf8 manpages in /usr/share/man/pl, fix the ebuild.

> * should man and/or groff handle UTF-8 better?

Yes, but it's not required to get this problem sorted out.

> * should an eclass function be created to aid in correcting the encoding
>   of manual pages while installing them?

Maybe, but it's not required to get this problem sorted out.
-- 
gentoo-dev@gentoo.org mailing list



Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Mike Doty
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Paul de Vrieze wrote:
> On Thursday 01 June 2006 20:19, Jan Kundrát wrote:
>> Wiktor Wandachowicz wrote:
>>> Summing up:
>>> * UTF-8 manuals: good or bad?
>> The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece.
> 
> Would it be possible to do automatic detection and unicode conversion in the 
> portage install stage? I think that would probably be the best option. At a 
> later stage a simple detection and warning might be sufficient.
> 
> Paul
> 
I'd agree. Forcing UTF-8/unicode on those of us who don't want the extra
bloat is a "bad thing"
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEfzzQ0K3RJaeXx6cRAkvSAKDiWDgXOa6dhure8BtZhcTqBBZe8wCg0QDe
LPmaxvgfz3uchjwjtRRb9uw=
=gH7U
-END PGP SIGNATURE-
-- 
gentoo-dev@gentoo.org mailing list



Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Paul de Vrieze
On Thursday 01 June 2006 20:19, Jan Kundrát wrote:
> Wiktor Wandachowicz wrote:
> > Summing up:
> > * UTF-8 manuals: good or bad?
>
> The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece.

Would it be possible to do automatic detection and unicode conversion in the 
portage install stage? I think that would probably be the best option. At a 
later stage a simple detection and warning might be sufficient.

Paul

-- 
Paul de Vrieze
Gentoo Developer
Mail: [EMAIL PROTECTED]
Homepage: http://www.devrieze.net


pgpvYNvUxbnsw.pgp
Description: PGP signature


Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Josh Saddler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jan Kundrát wrote:
> Wiktor Wandachowicz wrote:
> 
>>Summing up:
>>* UTF-8 manuals: good or bad?
> 
> 
> The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece.

Agreed. I'd like to see much more extensive use of Unicode throughout my system
by default. Unicode man pages are a good idea.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEfzJHrsJQqN81j74RAhEnAJ9Cv0duJN+K3IGiHKzTEX8eNz25NQCgqSvi
Np8wZpV7doCdwo2addFbb2o=
=ZHtf
-END PGP SIGNATURE-
-- 
gentoo-dev@gentoo.org mailing list



Re: [gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Jan Kundrát
Wiktor Wandachowicz wrote:
> Summing up:
> * UTF-8 manuals: good or bad?

The Only Way To Go (tm), IMHO. Let's let the legacy encodings die in piece.

> Any constructive comments are more than welcome!

The very same problem exists with man-pages-cs (which are outdated as a
bonus).

Blésmrt,
-jkt

-- 
cd /local/pub && more beer > /dev/mouth



signature.asc
Description: OpenPGP digital signature


[gentoo-dev] UTF-8 encoding and file format of manuals

2006-06-01 Thread Wiktor Wandachowicz
Respectful Gentoo developers,

I would like to ask what do you think about UTF-8 encoded manual pages?
I mean, the files like ls.1.gz, which are used by honorable "man" program.
Recently I attacked the problem a little and before submitting any
patches/proposals to Gentoo bugzilla I'd like to know your opinions first.

Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8",
but the original issue is of a more universal nature.

Back on subject. ISO-8859-* 8-bit encodings are fine and most localized
manuals use them. However, there are some examples where UTF-8 manuals are
installed as well. Namely, newest portage uses "linguas_pl" by this means:

$ emerge -pv portage
[ebuild   R   ] sys-apps/portage-2.1_rc3-r3  USE="-build -doc" LINGUAS="pl"

In effect, a translated manual pages are added to the system. The problem
is that they use UTF-8 encoding. Having both man-pages-pl and this version
of portage installed gives unexpected results. This way "man ls" prints all
the letters with correct encoding, but "man emerge" does not. On the other
hand, if "man" is configured to display UTF-8 encoded manuals correctly,
all the other manuals print funny characters instead of desired output.

I wrote a simple script [1] which checks all installed Polish manuals by
using "file" program. For "pl" locale it produces currently about ~70kB
of text, and for default locale it's about 458kB. After grepping for all
occurences of "UTF" I've found out that only the newest portage's manuals
are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1,
Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway).

While it's easy to contact Polish translators of the portage's manuals so
they could correct them, the problem will have to be solved sooner or later.
UTF-8 encoded manuals will probably occur with higher frequency, and some
general resolution should be made.

After some discussion on the Polish forum [2] I've learnt about groff
deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps
somewhat in that matter. But it also requires that all manuals be unified
wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise.
So I don't know what course to take.

Summing up:
* UTF-8 manuals: good or bad?
* how to handle mixed encodings of manuals?
* should man and/or groff handle UTF-8 better?
* should an eclass function be created to aid in correcting the encoding
  of manual pages while installing them?

Any constructive comments are more than welcome!

Best regards,
Wiktor Wandachowicz
(SirYes)

[1] http://ics.p.lodz.pl/~wiktorw/gentoo/checkman
[2] http://forums.gentoo.org/viewtopic-p-3352287.html
[3] http://hoth.amu.edu.pl/~d_szeluga/groff-utf8.tar.bz2


-- 
gentoo-dev@gentoo.org mailing list