Bug#354321: man-db Chinese support and may be the way to internationalize man-db

Colin Watson Sun, 04 May 2008 17:21:23 -0700

tags 354321 fixed-upstream
thanks

On Sat, Feb 25, 2006 at 03:15:41PM +0800, wu songhai wrote:
> Package: man-db
> Version: 2.4.3-3
> Severity: important for Chinese who do not use UTF-8 locole
> 
> The current verison of man-db could not handle Chinese manual pages
> with GBK locale, because we have no such nroff device. But we can use
> the utf8 instead. Here is the patch. Now, it can support GB2312, GBK,
> BIG5, EUC-TW coding well, support locale zh_CN.GB2312, zh_CN.GBK,
> zh_TW.BIG5, zh_TW.EUCTW and should zh_SG.GBK.


Thanks for the patch. Sorry it's taken me a while to respond; in the
intervening period I independently made a number of other improvements
to internationalisation support, and I didn't quite get round to working
out how that interacted with this bug until now.

You should find that versions of man-db from 2.4.4 onwards have decent
Chinese support, although I'd recommend using 2.5.1 as a variety of
other internationalisation problems are fixed there. I just tested it
with a zh_TW (not UTF-8) locale in cwterm, and everything appears to be
working well.

However, there were a few loose ends which your patch included and my
changes didn't, so I've incorporated those for man-db 2.5.2.

Mon May  5 01:06:56 BST 2008  Colin Watson  <[EMAIL PROTECTED]>

        Clean up some loose ends of Chinese support (thanks, Wu Songhai;
        Debian bug #354321).

        * src/encodings.c (directory_table): Add zh_SG, defaulting to the
          GBK encoding.
          (charset_alias_table): Map EUCTW to EUC-TW.
          (charset_table): Add EUC-TW, defaulting to the nippon driver.
          (compatible_encodings): Recognise EUC-TW encoding.

Comments on your original patch, perhaps for future reference, follow:

> The patch assume the pages in directory zh_CN and zh_SG with GBK
> encoding, zh_TW with BIG5 encoding, and directory zh with UTF-8
> encoding.

I'm not convinced that plain zh is appropriate; there's no sensible way
for translators to know whether to use Simplified or Traditional Chinese
there. There's only one manual page (and one symlink to it) in Debian
right now in /usr/share/man/zh/. I think it's better to continue with
the zh_CN vs. zh_TW division for this.

> Another problem is the the less utility can not handle correctly with
> bold style Chinese. We can use w3m instead or use less with -u option
> without bold style.

It looks OK to me now, with uneducated eyes; no mojibake or anything.
Has this improved since you filed your bug report, or am I missing
something?

> diff -bNru man-db-2.4.3/src/encodings.c man-db-2.4.3-new/src/encodings.c
> --- man-db-2.4.3/src/encodings.c      2005-01-05 23:11:54.000000000 +0800
> +++ man-db-2.4.3-new/src/encodings.c  2006-02-25 14:31:50.000000000 +0800
> @@ -133,6 +137,10 @@
>       { "ISO-8859-1",         "latin1"        },
>       { "ISO-8859-15",        "latin1"        },
>       { "UTF-8",              "utf8"          },
> +     { "GBK",                "gb"            },
> +     { "GB2312",             "gb"            },
> +     { "BIG5",               "big5"          },
> +     { "EUC-TW",             "euc"           },
> 
>  #ifdef MULTIBYTE_GROFF
>       { "EUC-JP",             "nippon"        },
> @@ -160,20 +168,24 @@
>  struct device_entry {
>       const char *roff_device;
>       const char *roff_encoding;
> +     const char *virtual_device;
>       const char *output_encoding;
>  };
> 
>  static struct device_entry device_table[] = {
> -     { "ascii",      "ISO-8859-1",   "ANSI_X3.4-1968"        },
> -     { "latin1",     "ISO-8859-1",   "ISO-8859-1"            },
> -     { "utf8",       "ISO-8859-1",   "UTF-8"                 },
> +     { "ascii",      "ISO-8859-1",   "ascii",        "ANSI_X3.4-1968"        
> },
> +     { "latin1",     "ISO-8859-1",   "latin1",       "ISO-8859-1"            
> },
> +     { "utf8",       "UTF-8",        "utf8",         "UTF-8"                 
> },
> +     { "gb",         "GBK",          "utf8",         "UTF-8"                 
> },
> +     { "big5",       "BIG5",         "utf8",         "UTF-8"                 
> },
> +     { "euc",        "EUC-TW",       "utf8",         "UTF-8"                 
> },
> 
>  #ifdef MULTIBYTE_GROFF
> -     { "ascii8",     NULL,           NULL                    },
> -     { "nippon",     "EUC-JP",       "EUC-JP"                },
> +     { "ascii8",     NULL,           "ascii8",       NULL                    
> },
> +     { "nippon",     "EUC-JP",       "nippon",       "EUC-JP"                
> },
>  #endif /* MULTIBYTE_GROFF */
> 
> -     { NULL,         NULL,           NULL                    }
> +     { NULL,         NULL,           NULL,           NULL                    
> }
>  };
> 
>  static const char *fallback_roff_encoding = "ISO-8859-1";

All the new encodings should be inside MULTIBYTE_GROFF.

Given how groff works at the moment, it's better to map these encodings
to the nippon device and let it sort it out. It's misnamed for
historical reasons - it can actually handle Chinese input too, as far as
I know.

I think the whole virtual_device thing is unnecessary complication.
roff_device is only used for three things: get_roff_encoding (which you
don't want to change), get_output_encoding (which you don't want to
change), and the -T argument to *roff (which should work fine as
nippon). Thus it's much more straightforward just to use roff_device.

> @@ -382,9 +394,11 @@
>        * we want or else it probably won't work at all no matter what we
>        * do. We might as well try it for now.
>        */
> -     if (STREQ (input, "UTF-8"))
> +     if (STREQ (input, "UTF-8")||STREQ (output, "UTF-8"))
>               return 1;
> 
> +     if (STREQ (input, "BIG5") && STREQ (output, "EUC-TW"))
> +             return 1;
>  #ifdef MULTIBYTE_GROFF
>       /* Special case for ja_JP.UTF-8, which takes UTF-8 input recoded
>        * from EUC-JP and produces UTF-8 output. This is rather filthy.

The problem with this is that, because only CJK locales may use UTF-8
input at the moment, this causes nippon always to be an acceptable
fallback, which causes problems for non-CJK locales in some
circumstances. It's better to handle Chinese the same way we handle
Japanese.

Thanks,

-- 
Colin Watson                                       [EMAIL PROTECTED]



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#354321: man-db Chinese support and may be the way to internationalize man-db

Reply via email to