Hi Branden, G. Branden Robinson wrote on Wed, May 06, 2020 at 04:10:33AM -0400:
> commit 04783d5c7184e4c7c85d0e80b077d521c0ac15ed > Author: G. Branden Robinson <g.branden.robin...@gmail.com> > AuthorDate: Wed May 6 18:00:55 2020 +1000 > > preconv: Support Emacs coding tags at file ends. [...] > diff --git a/src/preproc/preconv/preconv.cpp b/src/preproc/preconv/preconv.cpp > index 50f1c42..a6e1b00 100644 > --- a/src/preproc/preconv/preconv.cpp > +++ b/src/preproc/preconv/preconv.cpp > @@ -813,8 +813,8 @@ get_BOM(FILE *fp, string &BOM, string &data) > // or NULL in case no coding tag can occur in the data > // (which is stored unmodified in 'data'). > // --------------------------------------------------------- > -char * > -get_tag_lines(FILE *fp, string &data) > +static char * > +get_early_tag_lines(FILE *fp, string &data) > { > int newline_count = 0; > int c, prev = -1; > @@ -934,8 +934,111 @@ get_variable_value_pair(char *d1, char **variable, char > **value) > return NULL; > } > > +// Get coding tag from Emacs local variables list at end of file. > +// > +// The region looks like this: > +// > +// Local Variables: > +// coding: latin-2 > +// mode: nroff > +// End: > +// > +// Like Emacs, we search at most 3000 bytes from the end of the file, or > +// from the last form-feed control (^L) that occurs. > +// > +// Our string class doesn't support reverse searches so just use C > +// strings. > +static char * > +get_late_coding_tag(FILE *fp) > +{ > + char *coding_tag = NULL; > + const int limit = 3000; > + if (fseek(fp, 0, SEEK_END) != 0) > + return NULL; > + // Seek to `limit` bytes from the end of the buffer, or the beginning. > + if (fseek(fp, -limit, SEEK_END) != 0) > + if (errno == EINVAL) > + rewind(fp); > + else > + return NULL; This looks suspicious. Roff, and UNIX in general, is famous for using an ingenious feature called "pipes". However, pipes are not seekable, so the above fseek(3) will fail with an error code like ESPIPE or EBADF. That implies even when preconv tmp.roff successfully finds the coding at the end of the file, cat tmp.roff | preconv will silently ignore the coding at the end of the file. The severity of that restriction may be somewhat mitigated by the fact that preconv(1) is often the first preprocessor in the pipe, but are we really sure that that nobody relies on the possibility to run their own preprocessor before preconv(1)? And that nobody pipes input into groff(1) on stdin? If somebody does either of these, the above difference in behaviour looks like a vicious trap that would be quite hard to debug for users. So i'm not sure encouraging people to put the encoding specifier at the end of files is a wise move. It is definitely less robust and likely less portable, too. In any cases, my impression is that the rest of the code has been carefully written to make sure it works with pipes and does not seek or rewind - except for uchardet support, which was also introduced somewhat recently and may also have caused some breakage, i didn't check in detail right now. > + char *tmpbuf = (char *) calloc(1, limit + 1 /* trailing '\0' */); > + if (!tmpbuf) { > + error("unable to allocate memory"); > + rewind(fp); > + return NULL; > + } > + (void) fread(tmpbuf, 1, limit, fp); > + if (ferror(fp)) { > + error("file read error"); > + free(tmpbuf); > + rewind(fp); > + return NULL; > + } > + char *start = tmpbuf; > + char *end = tmpbuf + strlen(tmpbuf); > + char *ff = strrchr(tmpbuf, '\f'); > + if (ff) > + start = ff; > + // Find the _last_ occurrence of a local-variables section in the > + // buffer, because the document might have Emacs file-local variables > + // as a discussion topic, as our roff(7) man page does. > + // > + // strcasestr() is a GNU extension we're not using. TODO: Gnulib has > + // it, so we can have it, too. > + char *lv = NULL, *nextlv = NULL; > + const char lvstr[] = "Local Variables:"; > + // Declare these now because GCC 8 doesn't like `goto`s crossing them. > + const char codingstr[] = "coding:"; > + // From here we must 'goto cleanup' to free our buffer and rewind the > + // file position instead of returning early. > + lv = strstr(start, lvstr); > + if (!lv) > + goto cleanup; > + else > + do { > + start += strlen(lvstr); There is a logical bug here. You want to move to the end of "Local Variables:", so what you mean is start = lv + strlen(lvstr); rather than start = start + strlen(lvstr); > + nextlv = strstr(start, lvstr); > + if (nextlv) { > + lv = nextlv; > + start = lv; > + } > + } while(nextlv); > + end = strstr(start, "End:"); > + if (!end) > + end = strstr(start, "end:"); > + if (!end) > + goto cleanup; > + // Tighten [start, end) bracket until only the coding string remains. > + // Locate "coding:". > + start = strstr(start, codingstr); > + if (!start) > + goto cleanup; > + // Move past it. > + start += strlen(codingstr); > + // Skip horizontal whitespace. > + while (strchr(" \t", *start)) > + start++; > + // Find the next newline and advance the end pointer to it. > + end = strchr(start, '\n'); > + if (!end) > + end = strchr(start, '\r'); > + if (!end) > + goto cleanup; > + // Back up over any trailing whitespace. > + do { > + *end = '\0'; > + end--; > + } while ((end > start) && strchr(" \t", *end)); > + if (start < end) > + coding_tag = start; > +cleanup: > + free(tmpbuf); > + rewind(fp); > + return coding_tag; > +} This is a more serious bug, a potential security vulnerability: Use after free. First, you set coding_tag = start, which is inside tmpbuf. Then, you free(tmpbuf), so now coding_tag points to invalid memory. But then you return the invalid pointer. By contrast, check_early_coding_tag() returns a pointer obtained with get_variable_value_pair(), i.e. a pointer to a buffer declared "static" inside a function. That is ugly and somewhat error-prone because later calls of get_variable_value_pair() would silently overwrite the results from earlier calls, but since get_variable_value_pair() is only called from one single place, it is not incorrect. On OpenBSD, which is careful to catch as many use-after-free bugs as possible, preconv now prints garbage: schwarze@isnote $ cat tmp.roff test .\" Local Variables: .\" coding: utf-8 .\" mode: nroff .\" End: schwarze@isnote $ ./preconv -d tmp.roff 2>&1 | hexdump -C 00000000 64 65 66 61 75 6c 74 20 65 6e 63 6f 64 69 6e 67 |default encoding| 00000010 3a 20 27 6c 61 74 69 6e 31 27 0a 66 69 6c 65 20 |: 'latin1'.file | 00000020 27 74 6d 70 2e 72 6f 66 66 27 3a 0a 20 20 63 6f |'tmp.roff':. co| 00000030 64 69 6e 67 20 74 61 67 3a 20 27 df df df df df |ding tag: '.....| 00000040 df df df df df df df df df df df df df df df df |................| * 000003d0 df df df df 27 0a 20 20 65 6e 63 6f 64 69 6e 67 |....'. encoding| 000003e0 20 75 73 65 64 3a 20 27 df df df df df df df df | used: '........| 000003f0 df df df df df df df df df df df df df df df df |................| * 00000440 df df df df df df df df df df df 27 0a 2e 2f 70 |...........'../p| 00000450 72 65 63 6f 6e 76 3a 20 65 72 72 6f 72 3a 20 65 |reconv: error: e| 00000460 6e 63 6f 64 69 6e 67 20 73 79 73 74 65 6d 20 27 |ncoding system '| 00000470 df df df df df df df df df df df df df df df df |................| * 000004d0 df df df 27 20 6e 6f 74 20 73 75 70 70 6f 72 74 |...' not support| 000004e0 65 64 20 62 79 20 69 63 6f 6e 76 28 29 0a 2e 6c |ed by iconv()..l| 000004f0 66 20 31 20 74 6d 70 2e 72 6f 66 66 0a |f 1 tmp.roff.| 000004fd > +cleanup: > + free(tmpbuf); > + rewind(fp); > + return coding_tag; > +} And the rewind(3) above is yet another bug. The function check_coding_tag() is called from do_file(). But before do_file() calls check_coding_tag(), it already calls the function get_BOM() which starts reading from the input file and stores the input in the data buffer. So when you rewind(3), that implies that the first four bytes in the input file will be read, processed, and written twice, which can be harmless in some cases, for example when the file starts with a comment, but which can have all kinds of weird effects in other cases, for example when the file starts with a macro like .TH. After fixing the use after free and the pointer arithmetic error, but not fixing the rewind(3) bug, here is what your patch does: schwarze@isnote $ ./preconv -d tmp.roff default encoding: 'latin1' file 'tmp.roff': coding tag: 'utf-8' encoding used: 'UTF-8' .lf 1 tmp.roff t\[u00E9]st\[u00E9]st # <<< note the duplicate first four bytes .\" Local Variables: .\" coding: utf-8 .\" mode: nroff .\" End: schwarze@isnote $ cat tmp.roff | ./preconv -d default encoding: 'latin1' standard input: no coding tag # <<< note that the coding: is not found could not detect encoding with uchardet encoding used: 'ISO-8859-1' .lf 1 - t\[u00C3]\[u00A9]st # <<< note the incorrect use of latin1 .\" Local Variables: .\" coding: utf-8 .\" mode: nroff .\" End: I hope you don't take this as an insult, i appreciate when people are working on the code, but i start feeling doubtful whether you are experienced enough to commit source code changes (as opposed to documentation changes) without an OK from another developer. Committing * a use after free * a logical bug that implies that you didn't fully understand how the calling code works * another, less severe pointer arithmetic bug * and a functional change that might be contoversial because it reduces robustness without even mentioning the tradeoff in the commit message all in a single patch feels like quite a bit of trouble. Regarding how to proceed with this particular patch, i'm not convinced an outright backout is required. I'd suggest that we first decide whether we want to encourage people to put the coding at the end even though that clearly reduces robustness and possibly harms portability. Myself, i would probably argue slightly against encouraging that, but i don't feel very strongly about it, so if others really hate coding tags at the more robust place at the beginning, i can live with such a change. If we decide that we don't want coding at the end, then the whole thing can be backed out. If we decide that we do want coding at the end, then i think Branden can fix the three bugs in the tree with follow-up commits. The rewind(3) bug is probably the hardest to fix cleanly, but all should be doable. (I hope i found all bugs contained in the patch, but i'm not completely sure yet whether i did...) What do you think? Yours, Ingo