3] mailinfo: refactor commit message processing

Junio C Hamano Fri, 16 Sep 2016 12:13:37 -0700

Jonathan Tan <jonathanta...@google.com> writes:

> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).


The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.

        /*
         * The mark must be at least 8 bytes long (e.g. "-- >8 --").
         * Even though there can be arbitrary cruft on the same line
         * (e.g. "cut here"), in order to avoid misidentification, the
         * perforation must occupy more than a third of the visible
         * width of the line, and dashes and scissors must occupy more
         * than half of the perforation.
         */

Even though it is not likely for people to do so, it would probably
be nicer if we can treat

        From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --

as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.

But this is a minor point.  It is not worth to make it work like so
if the resulting code will become messier.

> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
>   o passes UTF-8 conversion
>     o [described above] is scissors line
>     o [not possible] is patchbreak line
>     o [not possible] is blank line
>     o is none of the above - processed as header
>   o fails UTF-8 conversion - processed as header
> o not in-body header
>   o passes UTF-8 conversion
>     o is scissors line - processed as scissors
>     o is patchbreak line - processed as patchbreak
>     o is blank line - ignored if in header_stage
>     o is none of the above - log message
>   o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan <jonathanta...@google.com>
> ---
>  mailinfo.c | 145 
> ++++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct 
> strbuf *b_seg)
>       return out;
>  }
>  
> -static int convert_to_utf8(struct mailinfo *mi,
> -                        struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error 
> message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted 
> string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf 
> *line,
> +                            const char *charset, char **old_buf,
> +                            size_t *old_len)
>  {
> -     char *out;
> +     char *utf8;
>  
> -     if (!mi->metainfo_charset || !charset || !*charset)
> +     if (!mi->metainfo_charset || !charset || !*charset ||
> +         same_encoding(mi->metainfo_charset, charset)) {
> +             if (old_buf)
> +                     *old_buf = NULL;
>               return 0;
> +     }
>  
> -     if (same_encoding(mi->metainfo_charset, charset))
> +     utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> +     if (utf8) {
> +             char *temp = strbuf_detach(line, old_len);
> +             if (old_buf)
> +                     *old_buf = temp;
> +             strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
>               return 0;
> -     out = reencode_string(line->buf, mi->metainfo_charset, charset);
> -     if (!out) {
> +     }
> +     return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> +                        struct strbuf *line, const char *charset)
> +{
> +     if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
>               mi->input_error = -1;
>               return error("cannot convert from %s to %s",
>                            charset, mi->metainfo_charset);
>       }
> -     strbuf_attach(line, out, strlen(out), strlen(out));
>       return 0;
>  }

Please split this part into its own patch.  IIUC, it moves the meat
of convert_to_utf8() to a more silent try_convert_to_utf8() and then
makes the former a thin wrapper of the latter.  Which by itself is a
good change but does not have anything to do with "fix handling of
the in-body headers", other than that the main fix wants to have
such a more silent helper for its own use.


> @@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
>       return ret;
>  }
>  
> +static int check_header_raw(struct mailinfo *mi,
> +                         char *buf, size_t len,
> +                         struct strbuf *hdr_data[], int overwrite) {
> +     const struct strbuf sb = {0, len, buf};
> +     return check_header(mi, &sb, hdr_data, overwrite);
> +}

IIUC, this is a helper for callers that do not have a strbuf but
instead have <buf, len> pair to perform the same check_header() the
callers that have strbuf can do.

As check_header() uses the strbuf as a read-only entity, wrapping
the <buf, len> pair in a temporary strbuf like this is safe.

The incoming <buf> should conceptually be "const char *", but it's
OK.

If check_header() didn't call any helper function that gets passed
&sb as a strbuf, or if convertiong the helper function to take a
<buf, len> pair instead, I would actually suggest refactoring this
the other way around, though.  That is, move the implementation of
check_header() to this function, updating its reference to line->buf
and line->len to reference to <buf> and <len>, and then make
check_header() a thin wrapper that does

        check_header(mi, const struct strbuf *line,
                     struct strbuf *hdr_data[], int overwrite)
        {
                return check_header_raw(mi, line->buf, line->len,
                                        hdr_data, overwrite);
        }

I didn't check how involved to update cmp_header() to take <buf,len>
pair.  If it does not look too bad, then I think I would prefer to
do it that way, and as before, make that conversion a separate
preparatory patch.

> @@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
>               gap * 2 < perforation);
>  }
>  
> -static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +static int resembles_rfc2822_header(const struct strbuf *line)
>  {
> -     assert(!mi->filter_stage);
> +     char *c;
>  
> -     if (mi->header_stage) {
> -             if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
> +     if (!isalpha(line->buf[0]))
> +             return 0;
> +
> +     for (c = line->buf + 1; *c != 0; c++) {
> +             if (*c == ':')
> +                     return 1;
> +             else if (*c != '-' && !isalpha(*c))
>                       return 0;
>       }
> +     return 0;
> +}

Is this helper supposed to handle any rfc2822 looking header, or
only the ones we expect to see as in-body header?


> -     if (mi->use_inbody_headers && mi->header_stage) {
> -             mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
> -             if (mi->header_stage)
> -                     return 0;
> -     } else
> -             /* Only trim the first (blank) line of the commit message
> -              * when ignoring in-body headers.
> -              */
> -             mi->header_stage = 0;
> +static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +{
> +     int ret = 0;
> +     int utf8_result;
> +     char *old_buf;
> +     size_t old_len;
> +
> +     assert(!mi->filter_stage);
>  
> -     /* normalize the log message to UTF-8. */
> -     if (convert_to_utf8(mi, line, mi->charset.buf))
> -             return 0; /* mi->input_error already set */
> +     /*
> +      * Obtain UTF8 for scissors line and patchbreak checks, but retain the
> +      * undecoded line in case we need to process it as an in-body header.
> +      */
> +     utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
> +                                       &old_len);

Just a minor style suggestion.  As <old_buf, old_len> come in a
pair, fold the line before them, so that the readers can easily
see the association between them.  I.e.

        utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf,
                                          &old_buf, &old_len);


> -     if (mi->use_scissors && is_scissors_line(line)) {
> +     if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
>               int i;
>  
> +             if (resembles_rfc2822_header(line))
> +                     /*
> +                      * Explicitly reject scissor lines that resemble a RFC
> +                      * 2822 header, to avoid being prone to error.
> +                      */
> +                     die("scissors line resembles RFC 2822 header");
> +

I guess "disambiguate to favor scissors" is not that difficult ;-)

>               strbuf_setlen(&mi->log_message, 0);
>               mi->header_stage = 1;
>  
> @@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, 
> struct strbuf *line)
>                               strbuf_release(mi->s_hdr_data[i]);
>                       mi->s_hdr_data[i] = NULL;
>               }
> -             return 0;
> +             goto handle_commit_msg_out;
>       }
> -
> -     if (patchbreak(line)) {
> +     if (!utf8_result && patchbreak(line)) {
>               if (mi->message_id)
>                       strbuf_addf(&mi->log_message,
>                                   "Message-Id: %s\n", mi->message_id);
> -             return 1;
> +             ret = 1;
> +             goto handle_commit_msg_out;
>       }
>  
> +     if (mi->header_stage) {
> +             char *buf = old_buf ? old_buf : line->buf;
> +             if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
> +                     goto handle_commit_msg_out;
> +     }
> +
> +     if (mi->use_inbody_headers && mi->header_stage) {
> +             char *buf = old_buf ? old_buf : line->buf;
> +             size_t len = old_buf ? old_len : line->len;
> +             mi->header_stage = check_header_raw(mi, buf, len,
> +                                                 mi->s_hdr_data, 0);

Just a minor comment, but I guess check_header_raw() refactoring is
not strictly needed after all, as this callsite can wrap <buf,len>
into a temporary strbuf.

Unlike the real header that is read in read_one_header_line() inside
a loop to implement line folding before check_header() is called, we
call check_header() before possibly-foldable header lines is fully
assembled into one header.  Probably it comes in later patches, I
guess.

It is not immediately obvious to me how this step helps further work
done by later patches in the series until I read them, but so far
what this patch did looks understandable to me ;-)

Thanks.


> +             if (mi->header_stage)
> +                     goto handle_commit_msg_out;
> +     } else
> +             /* Only trim the first (blank) line of the commit message
> +              * when ignoring in-body headers.
> +              */
> +             mi->header_stage = 0;
> +
> +     /* If adding as a log message, conversion to UTF-8 is required. */
> +     if (utf8_result) {
> +             mi->input_error = -1;
> +             error("cannot convert from %s to %s",
> +                   mi->charset.buf, mi->metainfo_charset);
> +             goto handle_commit_msg_out;
> +     }
>       strbuf_addbuf(&mi->log_message, line);
> -     return 0;
> +
> +handle_commit_msg_out:
> +     free(old_buf);
> +     return ret;
>  }
>  
>  static void handle_patch(struct mailinfo *mi, const struct strbuf *line)

Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing

Reply via email to