Hi,

At Thu, 01 Mar 2001 17:35:03 +0000,
Markus Kuhn <[EMAIL PROTECTED]> wrote:

> I wonder if anyone has a workaround for this? I've seen a few comments
> made about Unicode and groff, but nothing definite.

I agree that Groff should be able to handle UTF-8.

I have been using Japanese-enabled groff for long years.  It adds
"-Tnippon" device to enable EUC encoding and special line-breaking
algorithm.  However, we knew that this way is a dirty localization.

At first, as a quick hack by a Debian member who aims to develop a
"world version" OS, I implemented "-Tascii8" last year.
Though I don't know my patch was adopted by official groff, my
patch is used for Debian distribution for months.  This "-Tascii8"
device is for encodings which need MSB (i.e., all other than ASCII)
but (for example) 0xad does not mean soft-hyphen.

Werner and I agreed last year that the core part of troff can accept
only UTF-8 input/output.  Thus, I am expected to write a wrapper 
(pre/postprocessor) for encoding conversion.  However, I am glad if
someone write such a pre/postprocessor.

The design of the pre/postprocessor is already determined.  Please
refer groff mailing list.

 - Groff is widely used for format manual pages.  Manual pages are
   installed into the system and have their own encodings regardless
   of the users' locale.  Thus, some mechanism for manual pages to
   specify their own encodings must be added.  The "mechanism" will
   be common to Mule's way, i.e., -*-coding:foobar;-*- at the first
   line.  This mechanism will enable manpages in both UTF-8 and 
   "legacy" encodings can co-exist.
 - If a manual pages doesn't have such an indicator, locale's encoding
   is assumed.  I think this way is reasonable because this doesn't
   break most of existing systems. 
 - If encoding is specified via command line (like --encoding=foobar),
   it will have the priority over other two ways.

To acheve common encoding names, the pre/postprocessor will have a
table of encoding names to convert "common" (preferred MIME names)
and internal (i.e., understood by iconv()) encoding names.

Current way to specify encoding as a "device" must be modified,
because the encoding of the source of manual pages and users'
environment can be different.  Werner and I also agreed that
all "latin1", "ascii", "ascii8", "nippon", and "utf8" devices
will be abolished and will have single "tty" (whatever name can
be ok) device.

Range of valid characters is different from encoding to encoding.
For example, in ISO-8859-1 environment, U+00a9 can be used for
copyright mark (for "\(co" in manpage sources) and it will be 
converted into 0xa9 by the postprocessor.  However, there are 
encodings which don't have a character which corresponds to U+00a9.
The postprocessor should convert it into "(C)"?  No.  It will break
the typesetting.  Thus, some method is needed to let troff know
the range of valid characters.  "tty-char"-like macro can be used
for this purpose.

For manpage writers: I think non-English pages may use non-ASCII
characters which native speakers can accept.  However, English manpages
should be written within ASCII characters, not in ISO-8859-1.  This
is because English manpages are for all people over the world, while
non-English ones are for native speakers.

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/
"Introduction to I18N"
http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to