Kaixo!

On Tue, Apr 10, 2001 at 02:40:51PM +0200, Bruno Haible wrote:

>> The problem of man pages is that they don't told their encoding; that will
>> be a big problem when people will start to switch to utf-8.
[the proposed algo]

> Sounds complicated indeed. Why would you convert the man page itself
> to UTF-8, when groff already has an option (-Tutf8) to produce UTF-8
> output?

That's not the problem.
The problem is it is *impossible* to know for sure what encoding
is used in the source troff file; there is no charset=....
line. 

So, it is needed to do some euristics to try to know.
Hopefully the choices will be between traditional encoding
(eg: koi8-r for 'ru', iso-8859-2 for 'pl' etc) and utf-8.
(detecting between koi8-r and cp1251 would be hopeless imho),
and utf-8 has a nice particularity of being quite easily 
detectable by the special patterns of the 8bit bytes.
(CJK man pages can also have a supplementary difficulty, as
they may exist in 8bit encoding (eg: euc_jp, euc-kr, gb2312 8bit)
and 7bit iso-2022-* encoding; but that last can be somehow recognized by its
special escape sequences).

> I'd suggest:
>   - Assume the manpages are in traditional format.

But what when utf-8 encoded pages will start to appear ?
(and when people will start to use utf-8 as their default
encoding you will see such case, as translators will issue
their translated man pages in the encoding they are using as default)


>     The groff
>     developers will have to define how UTF-8 manpages shall define
>     their encoding.

Indeed that would be a very nice thing; and should even be extended to all
encodings, like html pages or *.po files told their encoding.
It is the only way to properly solve the problem.
Can we reasonably expect that such a solution can be defined
and put in wide use in a near future? (I have no idea, so I aks)
If yes, that would indeed be a much better solution.

> Now all that will remain to be done is to fix 'more' and 'less' to
> correctly the resulting UTF-8 encoded output.

??? What is the problem (I can use less and more without any
problem to see utf-8 files).

> 
> Bruno

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to