Re: native UTF-8 and ISO-8859-1 *input* support for mandoc(1)

2014-10-27 Thread Anthony J. Bentley
Hi Ingo,

Ingo Schwarze writes:
 In ports land, many manual pages contain occasional non-ASCII
 characters - even though i don't consider that a particularly smart
 idea, but let's face it, those characters *are* out there.

I agree that this is appropriate for mandoc to try to handle for a
common, very limited subset of encodings.

 Since this is a somewhat bigger and user-visible change, i'm
 asking whether there are any concerns or comments before committing.

After applying this diff, mandoc -Tutf8 shows U+FFFD anywhere there's a
\ in the source... very obvious in the mdoc(7) page.

 +If not specified, autodetection uses the first match:
 +.Bl -tag -width iso-8859-1
 +.It Cm utf-8
 +if the first three bytes of the input file
 +are the UTF-8 byte order mark (BOM, 0xefbbbf)
 +.It Ar encoding
 +if the first or second line of the input file matches the
 +.Sy emacs
 +mode line format
 +.Pp
 +.D1 .\e -*- Oo ...; Oc coding: Ar encoding ; No -*-
 +.It Cm utf-8
 +if the first non-ASCII byte in the file introduces a valid UTF-8 sequence
 +.It Cm iso-8859-1
 +otherwise
 +.El

I agree with this logic as well. I would be uncomfortable if it got any
more complicated.

-- 
Anthony J. Bentley



native UTF-8 and ISO-8859-1 *input* support for mandoc(1)

2014-10-26 Thread Ingo Schwarze
Hi,

you probably know that mandoc(1) has been providing a -Tutf8 *output*
mode for more than three years now.  To *input* non-ASCII characters,
however, encoding them as \[u] roff(7) esacape sequences, also
documented in mandoc_char(7), is required.

In ports land, many manual pages contain occasional non-ASCII
characters - even though i don't consider that a particularly smart
idea, but let's face it, those characters *are* out there.  There
are even some manual pages in ports completely written in non-latin
scripts; for these, using explicit \[u] escapes for almost every
letter would be rather impractical, so i can't really blame the
authors or translators for not doing that.  The way to read such
pages was to install the preconv(1) utility (contained in textproc/groff
and in portable mandoc) and do stunts like

  preconv -eutf8 utf8_manual_file | mandoc -Tutf8 | less

I doubt many people did that.

The patch below integrates the preconv(1) code, written by kristaps@
in 2011, into mandoc(1), hooking it into the input reading module,
doing the necessary UTF-8 to \[u] encoding on the fly when
encountering non-ASCII characters.  It also does some simple encoding
autodetection such that you will hopefully almost never need the -K
command line option borrowed from groff(1) to specify the input
encoding manually.

There are three reasons for doing all that:

 * For the average user using the default configuration, that is,
   LC_ALL=C, show reasonable ASCII approximations of the
   occasional UTF-8 and ISO-8859-1 characters showing up in
   ports manuals instead of ??.

 * For users of LC_CTYPE=foo_BAR.UTF-8, in the above situation,
   show non-ASCII glyphs when available, again instead of ??.

 * Make life slightly easier for users reading manuals in
   languages like Russian, Japanese, Chinese, or Greek.
   Try, for example,

$ mandoc -aTutf8 /usr/local/man/ru/man6/wesnoth.6

Nothing changes for manuals containing ASCII characters only,
in particular for base manuals.

Since this is a somewhat bigger and user-visible change, i'm
asking whether there are any concerns or comments before committing.

Thanks,
  Ingo


Index: Makefile
===
RCS file: /cvs/src/usr.bin/mandoc/Makefile,v
retrieving revision 1.82
diff -u -p -r1.82 Makefile
--- Makefile27 Aug 2014 00:06:08 -  1.82
+++ Makefile26 Oct 2014 19:05:12 -
@@ -7,7 +7,7 @@ CFLAGS  += -W -Wall -Wstrict-prototypes 
 DPADD += ${LIBUTIL}
 LDADD  += -lsqlite3 -lutil
 
-SRCS=  mandoc.c mandoc_aux.c read.c \
+SRCS=  mandoc.c mandoc_aux.c preconv.c read.c \
roff.c tbl.c tbl_opts.c tbl_layout.c tbl_data.c eqn.c
 SRCS+= mdoc_macro.c mdoc.c mdoc_hash.c \
mdoc_argv.c mdoc_validate.c lib.c att.c \
Index: apropos.1
===
RCS file: /cvs/src/usr.bin/mandoc/apropos.1,v
retrieving revision 1.27
diff -u -p -r1.27 apropos.1
--- apropos.1   3 Sep 2014 05:17:08 -   1.27
+++ apropos.1   26 Oct 2014 19:05:12 -
@@ -79,7 +79,7 @@ to paginate them.
 In
 .Fl a
 mode, the options
-.Fl IOTW
+.Fl IKOTW
 described in the
 .Xr mandoc 1
 manual are also available.
Index: libmandoc.h
===
RCS file: /cvs/src/usr.bin/mandoc/libmandoc.h,v
retrieving revision 1.30
diff -u -p -r1.30 libmandoc.h
--- libmandoc.h 16 Oct 2014 01:10:06 -  1.30
+++ libmandoc.h 26 Oct 2014 19:05:12 -
@@ -30,6 +30,12 @@ enum rofferr {
ROFF_ERR /* badness: puke and stop */
 };
 
+struct buf {
+   char*buf;
+   size_t   sz;
+   size_t   offs;
+};
+
 __BEGIN_DECLS
 
 struct roff;
@@ -62,6 +68,9 @@ intman_parseln(struct man *, int, cha
 int man_endparse(struct man *);
 int man_addspan(struct man *, const struct tbl_span *);
 int man_addeqn(struct man *, const struct eqn *);
+
+int preconv_cue(const struct buf *);
+int preconv_encode(struct buf *, struct buf *, int *);
 
 voidroff_free(struct roff *);
 struct roff*roff_alloc(struct mparse *, int);
Index: main.c
===
RCS file: /cvs/src/usr.bin/mandoc/main.c,v
retrieving revision 1.101
diff -u -p -r1.101 main.c
--- main.c  18 Oct 2014 15:46:16 -  1.101
+++ main.c  26 Oct 2014 19:05:12 -
@@ -75,6 +75,7 @@ structcurparse {
char  outopts[BUFSIZ]; /* buf of output opts */
 };
 
+static int   koptions(int *, char *);
 int  mandocdb(int, char**);
 static int   moptions(int *, char *);
 static void  mmsg(enum mandocerr, enum mandoclevel,
@@ -145,14 +146,15 @@ main(int argc, char *argv[])
memset(curp, 0, sizeof(struct curparse));
curp.outtype = OUTT_ASCII;
curp.wlevel  = MANDOCLEVEL_FATAL;
-   options = MPARSE_SO;
+