tags 354321 fixed-upstream
thanks
On Sat, Feb 25, 2006 at 03:15:41PM +0800, wu songhai wrote:
> Package: man-db
> Version: 2.4.3-3
> Severity: important for Chinese who do not use UTF-8 locole
>
> The current verison of man-db could not handle Chinese manual pages
> with GBK locale, because we have no such nroff device. But we can use
> the utf8 instead. Here is the patch. Now, it can support GB2312, GBK,
> BIG5, EUC-TW coding well, support locale zh_CN.GB2312, zh_CN.GBK,
> zh_TW.BIG5, zh_TW.EUCTW and should zh_SG.GBK.
Thanks for the patch. Sorry it's taken me a while to respond; in the
intervening period I independently made a number of other improvements
to internationalisation support, and I didn't quite get round to working
out how that interacted with this bug until now.
You should find that versions of man-db from 2.4.4 onwards have decent
Chinese support, although I'd recommend using 2.5.1 as a variety of
other internationalisation problems are fixed there. I just tested it
with a zh_TW (not UTF-8) locale in cwterm, and everything appears to be
working well.
However, there were a few loose ends which your patch included and my
changes didn't, so I've incorporated those for man-db 2.5.2.
Mon May 5 01:06:56 BST 2008 Colin Watson <[EMAIL PROTECTED]>
Clean up some loose ends of Chinese support (thanks, Wu Songhai;
Debian bug #354321).
* src/encodings.c (directory_table): Add zh_SG, defaulting to the
GBK encoding.
(charset_alias_table): Map EUCTW to EUC-TW.
(charset_table): Add EUC-TW, defaulting to the nippon driver.
(compatible_encodings): Recognise EUC-TW encoding.
Comments on your original patch, perhaps for future reference, follow:
> The patch assume the pages in directory zh_CN and zh_SG with GBK
> encoding, zh_TW with BIG5 encoding, and directory zh with UTF-8
> encoding.
I'm not convinced that plain zh is appropriate; there's no sensible way
for translators to know whether to use Simplified or Traditional Chinese
there. There's only one manual page (and one symlink to it) in Debian
right now in /usr/share/man/zh/. I think it's better to continue with
the zh_CN vs. zh_TW division for this.
> Another problem is the the less utility can not handle correctly with
> bold style Chinese. We can use w3m instead or use less with -u option
> without bold style.
It looks OK to me now, with uneducated eyes; no mojibake or anything.
Has this improved since you filed your bug report, or am I missing
something?
> diff -bNru man-db-2.4.3/src/encodings.c man-db-2.4.3-new/src/encodings.c
> --- man-db-2.4.3/src/encodings.c 2005-01-05 23:11:54.000000000 +0800
> +++ man-db-2.4.3-new/src/encodings.c 2006-02-25 14:31:50.000000000 +0800
> @@ -133,6 +137,10 @@
> { "ISO-8859-1", "latin1" },
> { "ISO-8859-15", "latin1" },
> { "UTF-8", "utf8" },
> + { "GBK", "gb" },
> + { "GB2312", "gb" },
> + { "BIG5", "big5" },
> + { "EUC-TW", "euc" },
>
> #ifdef MULTIBYTE_GROFF
> { "EUC-JP", "nippon" },
> @@ -160,20 +168,24 @@
> struct device_entry {
> const char *roff_device;
> const char *roff_encoding;
> + const char *virtual_device;
> const char *output_encoding;
> };
>
> static struct device_entry device_table[] = {
> - { "ascii", "ISO-8859-1", "ANSI_X3.4-1968" },
> - { "latin1", "ISO-8859-1", "ISO-8859-1" },
> - { "utf8", "ISO-8859-1", "UTF-8" },
> + { "ascii", "ISO-8859-1", "ascii", "ANSI_X3.4-1968"
> },
> + { "latin1", "ISO-8859-1", "latin1", "ISO-8859-1"
> },
> + { "utf8", "UTF-8", "utf8", "UTF-8"
> },
> + { "gb", "GBK", "utf8", "UTF-8"
> },
> + { "big5", "BIG5", "utf8", "UTF-8"
> },
> + { "euc", "EUC-TW", "utf8", "UTF-8"
> },
>
> #ifdef MULTIBYTE_GROFF
> - { "ascii8", NULL, NULL },
> - { "nippon", "EUC-JP", "EUC-JP" },
> + { "ascii8", NULL, "ascii8", NULL
> },
> + { "nippon", "EUC-JP", "nippon", "EUC-JP"
> },
> #endif /* MULTIBYTE_GROFF */
>
> - { NULL, NULL, NULL }
> + { NULL, NULL, NULL, NULL
> }
> };
>
> static const char *fallback_roff_encoding = "ISO-8859-1";
All the new encodings should be inside MULTIBYTE_GROFF.
Given how groff works at the moment, it's better to map these encodings
to the nippon device and let it sort it out. It's misnamed for
historical reasons - it can actually handle Chinese input too, as far as
I know.
I think the whole virtual_device thing is unnecessary complication.
roff_device is only used for three things: get_roff_encoding (which you
don't want to change), get_output_encoding (which you don't want to
change), and the -T argument to *roff (which should work fine as
nippon). Thus it's much more straightforward just to use roff_device.
> @@ -382,9 +394,11 @@
> * we want or else it probably won't work at all no matter what we
> * do. We might as well try it for now.
> */
> - if (STREQ (input, "UTF-8"))
> + if (STREQ (input, "UTF-8")||STREQ (output, "UTF-8"))
> return 1;
>
> + if (STREQ (input, "BIG5") && STREQ (output, "EUC-TW"))
> + return 1;
> #ifdef MULTIBYTE_GROFF
> /* Special case for ja_JP.UTF-8, which takes UTF-8 input recoded
> * from EUC-JP and produces UTF-8 output. This is rather filthy.
The problem with this is that, because only CJK locales may use UTF-8
input at the moment, this causes nippon always to be an acceptable
fallback, which causes problems for non-CJK locales in some
circumstances. It's better to handle Chinese the same way we handle
Japanese.
Thanks,
--
Colin Watson [EMAIL PROTECTED]
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]