Hi,
you probably know that mandoc(1) has been providing a -Tutf8 *output*
mode for more than three years now. To *input* non-ASCII characters,
however, encoding them as \[u] roff(7) esacape sequences, also
documented in mandoc_char(7), is required.
In ports land, many manual pages contain occasional non-ASCII
characters - even though i don't consider that a particularly smart
idea, but let's face it, those characters *are* out there. There
are even some manual pages in ports completely written in non-latin
scripts; for these, using explicit \[u] escapes for almost every
letter would be rather impractical, so i can't really blame the
authors or translators for not doing that. The way to read such
pages was to install the preconv(1) utility (contained in textproc/groff
and in portable mandoc) and do stunts like
preconv -eutf8 utf8_manual_file | mandoc -Tutf8 | less
I doubt many people did that.
The patch below integrates the preconv(1) code, written by kristaps@
in 2011, into mandoc(1), hooking it into the input reading module,
doing the necessary UTF-8 to \[u] encoding on the fly when
encountering non-ASCII characters. It also does some simple encoding
autodetection such that you will hopefully almost never need the -K
command line option borrowed from groff(1) to specify the input
encoding manually.
There are three reasons for doing all that:
* For the average user using the default configuration, that is,
LC_ALL=C, show reasonable ASCII approximations of the
occasional UTF-8 and ISO-8859-1 characters showing up in
ports manuals instead of ??.
* For users of LC_CTYPE=foo_BAR.UTF-8, in the above situation,
show non-ASCII glyphs when available, again instead of ??.
* Make life slightly easier for users reading manuals in
languages like Russian, Japanese, Chinese, or Greek.
Try, for example,
$ mandoc -aTutf8 /usr/local/man/ru/man6/wesnoth.6
Nothing changes for manuals containing ASCII characters only,
in particular for base manuals.
Since this is a somewhat bigger and user-visible change, i'm
asking whether there are any concerns or comments before committing.
Thanks,
Ingo
Index: Makefile
===
RCS file: /cvs/src/usr.bin/mandoc/Makefile,v
retrieving revision 1.82
diff -u -p -r1.82 Makefile
--- Makefile27 Aug 2014 00:06:08 - 1.82
+++ Makefile26 Oct 2014 19:05:12 -
@@ -7,7 +7,7 @@ CFLAGS += -W -Wall -Wstrict-prototypes
DPADD += ${LIBUTIL}
LDADD += -lsqlite3 -lutil
-SRCS= mandoc.c mandoc_aux.c read.c \
+SRCS= mandoc.c mandoc_aux.c preconv.c read.c \
roff.c tbl.c tbl_opts.c tbl_layout.c tbl_data.c eqn.c
SRCS+= mdoc_macro.c mdoc.c mdoc_hash.c \
mdoc_argv.c mdoc_validate.c lib.c att.c \
Index: apropos.1
===
RCS file: /cvs/src/usr.bin/mandoc/apropos.1,v
retrieving revision 1.27
diff -u -p -r1.27 apropos.1
--- apropos.1 3 Sep 2014 05:17:08 - 1.27
+++ apropos.1 26 Oct 2014 19:05:12 -
@@ -79,7 +79,7 @@ to paginate them.
In
.Fl a
mode, the options
-.Fl IOTW
+.Fl IKOTW
described in the
.Xr mandoc 1
manual are also available.
Index: libmandoc.h
===
RCS file: /cvs/src/usr.bin/mandoc/libmandoc.h,v
retrieving revision 1.30
diff -u -p -r1.30 libmandoc.h
--- libmandoc.h 16 Oct 2014 01:10:06 - 1.30
+++ libmandoc.h 26 Oct 2014 19:05:12 -
@@ -30,6 +30,12 @@ enum rofferr {
ROFF_ERR /* badness: puke and stop */
};
+struct buf {
+ char*buf;
+ size_t sz;
+ size_t offs;
+};
+
__BEGIN_DECLS
struct roff;
@@ -62,6 +68,9 @@ intman_parseln(struct man *, int, cha
int man_endparse(struct man *);
int man_addspan(struct man *, const struct tbl_span *);
int man_addeqn(struct man *, const struct eqn *);
+
+int preconv_cue(const struct buf *);
+int preconv_encode(struct buf *, struct buf *, int *);
voidroff_free(struct roff *);
struct roff*roff_alloc(struct mparse *, int);
Index: main.c
===
RCS file: /cvs/src/usr.bin/mandoc/main.c,v
retrieving revision 1.101
diff -u -p -r1.101 main.c
--- main.c 18 Oct 2014 15:46:16 - 1.101
+++ main.c 26 Oct 2014 19:05:12 -
@@ -75,6 +75,7 @@ structcurparse {
char outopts[BUFSIZ]; /* buf of output opts */
};
+static int koptions(int *, char *);
int mandocdb(int, char**);
static int moptions(int *, char *);
static void mmsg(enum mandocerr, enum mandoclevel,
@@ -145,14 +146,15 @@ main(int argc, char *argv[])
memset(curp, 0, sizeof(struct curparse));
curp.outtype = OUTT_ASCII;
curp.wlevel = MANDOCLEVEL_FATAL;
- options = MPARSE_SO;
+