Re: [groff] 01/01: preconv: Support Emacs coding tags at file ends.

Ingo Schwarze Thu, 07 May 2020 16:28:09 -0700

Hi Branden,

G. Branden Robinson wrote on Wed, May 06, 2020 at 04:10:33AM -0400:


> commit 04783d5c7184e4c7c85d0e80b077d521c0ac15ed
> Author: G. Branden Robinson <g.branden.robin...@gmail.com>
> AuthorDate: Wed May 6 18:00:55 2020 +1000
> 
>     preconv: Support Emacs coding tags at file ends.
[...]
> diff --git a/src/preproc/preconv/preconv.cpp b/src/preproc/preconv/preconv.cpp
> index 50f1c42..a6e1b00 100644
> --- a/src/preproc/preconv/preconv.cpp
> +++ b/src/preproc/preconv/preconv.cpp
> @@ -813,8 +813,8 @@ get_BOM(FILE *fp, string &BOM, string &data)
>  // or NULL in case no coding tag can occur in the data
>  // (which is stored unmodified in 'data').
>  // ---------------------------------------------------------
> -char *
> -get_tag_lines(FILE *fp, string &data)
> +static char *
> +get_early_tag_lines(FILE *fp, string &data)
>  {
>    int newline_count = 0;
>    int c, prev = -1;
> @@ -934,8 +934,111 @@ get_variable_value_pair(char *d1, char **variable, char 
> **value)
>    return NULL;
>  }
>  
> +// Get coding tag from Emacs local variables list at end of file.
> +//
> +// The region looks like this:
> +//
> +// Local Variables:
> +// coding: latin-2
> +// mode: nroff
> +// End:
> +//
> +// Like Emacs, we search at most 3000 bytes from the end of the file, or
> +// from the last form-feed control (^L) that occurs.
> +//
> +// Our string class doesn't support reverse searches so just use C
> +// strings.
> +static char *
> +get_late_coding_tag(FILE *fp)
> +{
> +  char *coding_tag = NULL;
> +  const int limit = 3000;
> +  if (fseek(fp, 0, SEEK_END) != 0)
> +    return NULL;
> +  // Seek to `limit` bytes from the end of the buffer, or the beginning.
> +  if (fseek(fp, -limit, SEEK_END) != 0)
> +    if (errno == EINVAL)
> +      rewind(fp);
> +    else
> +      return NULL;

This looks suspicious.  Roff, and UNIX in general, is famous for using
an ingenious feature called "pipes".  However, pipes are not seekable,
so the above fseek(3) will fail with an error code like ESPIPE or EBADF.
That implies even when

  preconv tmp.roff

successfully finds the coding at the end of the file,

  cat tmp.roff | preconv

will silently ignore the coding at the end of the file.
The severity of that restriction may be somewhat mitigated by the
fact that preconv(1) is often the first preprocessor in the pipe,
but are we really sure that that nobody relies on the possibility
to run their own preprocessor before preconv(1)?  And that nobody
pipes input into groff(1) on stdin?  If somebody does either of
these, the above difference in behaviour looks like a vicious trap
that would be quite hard to debug for users.  So i'm not sure
encouraging people to put the encoding specifier at the end of files
is a wise move.  It is definitely less robust and likely less
portable, too.

In any cases, my impression is that the rest of the code has been
carefully written to make sure it works with pipes and does not
seek or rewind - except for uchardet support, which was also
introduced somewhat recently and may also have caused some breakage,
i didn't check in detail right now.

> +  char *tmpbuf = (char *) calloc(1, limit + 1 /* trailing '\0' */);
> +  if (!tmpbuf) {
> +    error("unable to allocate memory");
> +    rewind(fp);
> +    return NULL;
> +  }
> +  (void) fread(tmpbuf, 1, limit, fp);
> +  if (ferror(fp)) {
> +    error("file read error");
> +    free(tmpbuf);
> +    rewind(fp);
> +    return NULL;
> +  }
> +  char *start = tmpbuf;
> +  char *end = tmpbuf + strlen(tmpbuf);
> +  char *ff = strrchr(tmpbuf, '\f');
> +  if (ff)
> +    start = ff;
> +  // Find the _last_ occurrence of a local-variables section in the
> +  // buffer, because the document might have Emacs file-local variables
> +  // as a discussion topic, as our roff(7) man page does.
> +  //
> +  // strcasestr() is a GNU extension we're not using.  TODO: Gnulib has
> +  // it, so we can have it, too.
> +  char *lv = NULL, *nextlv = NULL;
> +  const char lvstr[] = "Local Variables:";
> +  // Declare these now because GCC 8 doesn't like `goto`s crossing them.
> +  const char codingstr[] = "coding:";
> +  // From here we must 'goto cleanup' to free our buffer and rewind the
> +  // file position instead of returning early.
> +  lv = strstr(start, lvstr);
> +  if (!lv)
> +    goto cleanup;
> +  else
> +    do {
> +      start += strlen(lvstr);

There is a logical bug here.
You want to move to the end of "Local Variables:",
so what you mean is

        start = lv + strlen(lvstr);

rather than

        start = start + strlen(lvstr);

> +      nextlv = strstr(start, lvstr);
> +      if (nextlv) {
> +     lv = nextlv;
> +     start = lv;
> +      }
> +    } while(nextlv);
> +  end = strstr(start, "End:");
> +  if (!end)
> +    end = strstr(start, "end:");
> +  if (!end)
> +    goto cleanup;
> +  // Tighten [start, end) bracket until only the coding string remains.
> +  // Locate "coding:".
> +  start = strstr(start, codingstr);
> +  if (!start)
> +    goto cleanup;
> +  // Move past it.
> +  start += strlen(codingstr);
> +  // Skip horizontal whitespace.
> +  while (strchr(" \t", *start))
> +    start++;
> +  // Find the next newline and advance the end pointer to it.
> +  end = strchr(start, '\n');
> +  if (!end)
> +    end = strchr(start, '\r');
> +  if (!end)
> +    goto cleanup;
> +  // Back up over any trailing whitespace.
> +  do {
> +    *end = '\0';
> +    end--;
> +  } while ((end > start) && strchr(" \t", *end));
> +  if (start < end)
> +    coding_tag = start;
> +cleanup:
> +  free(tmpbuf);
> +  rewind(fp);
> +  return coding_tag;
> +}

This is a more serious bug, a potential security vulnerability:
Use after free.
First, you set coding_tag = start, which is inside tmpbuf.
Then, you free(tmpbuf), so now coding_tag points to invalid memory.
But then you return the invalid pointer.

By contrast, check_early_coding_tag() returns a pointer obtained
with get_variable_value_pair(), i.e. a pointer to a buffer
declared "static" inside a function.  That is ugly and somewhat
error-prone because later calls of get_variable_value_pair()
would silently overwrite the results from earlier calls, but
since get_variable_value_pair() is only called from one single
place, it is not incorrect.

On OpenBSD, which is careful to catch as many use-after-free bugs
as possible, preconv now prints garbage:

schwarze@isnote $ cat tmp.roff  
test
.\" Local Variables:
.\" coding: utf-8
.\" mode: nroff
.\" End:
schwarze@isnote $ ./preconv -d tmp.roff 2>&1 | hexdump -C
00000000  64 65 66 61 75 6c 74 20  65 6e 63 6f 64 69 6e 67  |default encoding|
00000010  3a 20 27 6c 61 74 69 6e  31 27 0a 66 69 6c 65 20  |: 'latin1'.file |
00000020  27 74 6d 70 2e 72 6f 66  66 27 3a 0a 20 20 63 6f  |'tmp.roff':.  co|
00000030  64 69 6e 67 20 74 61 67  3a 20 27 df df df df df  |ding tag: '.....|
00000040  df df df df df df df df  df df df df df df df df  |................|
*
000003d0  df df df df 27 0a 20 20  65 6e 63 6f 64 69 6e 67  |....'.  encoding|
000003e0  20 75 73 65 64 3a 20 27  df df df df df df df df  | used: '........|
000003f0  df df df df df df df df  df df df df df df df df  |................|
*
00000440  df df df df df df df df  df df df 27 0a 2e 2f 70  |...........'../p|
00000450  72 65 63 6f 6e 76 3a 20  65 72 72 6f 72 3a 20 65  |reconv: error: e|
00000460  6e 63 6f 64 69 6e 67 20  73 79 73 74 65 6d 20 27  |ncoding system '|
00000470  df df df df df df df df  df df df df df df df df  |................|
*
000004d0  df df df 27 20 6e 6f 74  20 73 75 70 70 6f 72 74  |...' not support|
000004e0  65 64 20 62 79 20 69 63  6f 6e 76 28 29 0a 2e 6c  |ed by iconv()..l|
000004f0  66 20 31 20 74 6d 70 2e  72 6f 66 66 0a           |f 1 tmp.roff.|
000004fd

> +cleanup:
> +  free(tmpbuf);
> +  rewind(fp);
> +  return coding_tag;
> +}

And the rewind(3) above is yet another bug.
The function check_coding_tag() is called from do_file().
But before do_file() calls check_coding_tag(), it already calls
the function get_BOM() which starts reading from the input file
and stores the input in the data buffer.
So when you rewind(3), that implies that the first four bytes in
the input file will be read, processed, and written twice, which
can be harmless in some cases, for example when the file starts
with a comment, but which can have all kinds of weird effects in
other cases, for example when the file starts with a macro like .TH.

After fixing the use after free and the pointer arithmetic error,
but not fixing the rewind(3) bug, here is what your patch does:

  schwarze@isnote $ ./preconv -d tmp.roff
  default encoding: 'latin1'
  file 'tmp.roff':
    coding tag: 'utf-8'
    encoding used: 'UTF-8'
  .lf 1 tmp.roff
  t\[u00E9]st\[u00E9]st     # <<< note the duplicate first four bytes
  .\" Local Variables:
  .\" coding: utf-8
  .\" mode: nroff
  .\" End:

  schwarze@isnote $ cat tmp.roff | ./preconv -d 
  default encoding: 'latin1'
  standard input:
    no coding tag           # <<< note that the coding: is not found
    could not detect encoding with uchardet
    encoding used: 'ISO-8859-1'
  .lf 1 -
  t\[u00C3]\[u00A9]st       # <<< note the incorrect use of latin1
  .\" Local Variables:
  .\" coding: utf-8
  .\" mode: nroff
  .\" End:

I hope you don't take this as an insult, i appreciate when people are
working on the code, but i start feeling doubtful whether you are
experienced enough to commit source code changes (as opposed to
documentation changes) without an OK from another developer.

Committing
 * a use after free
 * a logical bug that implies that you didn't fully understand
   how the calling code works
 * another, less severe pointer arithmetic bug
 * and a functional change that might be contoversial because
   it reduces robustness without even mentioning the tradeoff
   in the commit message
all in a single patch feels like quite a bit of trouble.

Regarding how to proceed with this particular patch, i'm not convinced
an outright backout is required.

I'd suggest that we first decide whether we want to encourage people
to put the coding at the end even though that clearly reduces
robustness and possibly harms portability.  Myself, i would probably
argue slightly against encouraging that, but i don't feel very
strongly about it, so if others really hate coding tags at the more
robust place at the beginning, i can live with such a change.

If we decide that we don't want coding at the end, then the whole
thing can be backed out.

If we decide that we do want coding at the end, then i think Branden
can fix the three bugs in the tree with follow-up commits.  The
rewind(3) bug is probably the hardest to fix cleanly, but all should
be doable.  (I hope i found all bugs contained in the patch, but
i'm not completely sure yet whether i did...)

What do you think?

Yours,
  Ingo

Re: [groff] 01/01: preconv: Support Emacs coding tags at file ends.

Reply via email to