Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-25 Thread Kevin Daudt
On Mon, Sep 26, 2016 at 12:38:42AM +0200, Jakub Narębski wrote:
> W dniu 25.09.2016 o 22:17, Kevin Daudt pisze:
> > On Fri, Sep 23, 2016 at 12:15:41AM -0400, Jeff King wrote:
> 
> >> Oops, yes. It is beginning to make the "strbuf_swap()" look less
> >> convoluted. :)
> >>
> > 
> > I've switched to strbuf_swap now, much better. I've implemented
> > recursive parsing without looking at what you provided, just to see what
> > I'd came up with. Though I've not implemented a recursive descent
> > parser, but it might suffice.
> 
> I think you can implement a parser handling proper nesting of parens
> without recursion.
> 
> Though... what is the definition in the RFC?

This part describes comments.

ccontent=   ctext / quoted-pair / comment

comment =   "(" *([FWS] ccontent) [FWS] ")"

CFWS=   *([FWS] comment) (([FWS] comment) / FWS)

So each comment can itself also contain a comment.

This could be done without recursion by keeping a count of how many open
parens we have encountered.

Kevin


Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-25 Thread Jakub Narębski
W dniu 25.09.2016 o 22:17, Kevin Daudt pisze:
> On Fri, Sep 23, 2016 at 12:15:41AM -0400, Jeff King wrote:

>> Oops, yes. It is beginning to make the "strbuf_swap()" look less
>> convoluted. :)
>>
> 
> I've switched to strbuf_swap now, much better. I've implemented
> recursive parsing without looking at what you provided, just to see what
> I'd came up with. Though I've not implemented a recursive descent
> parser, but it might suffice.

I think you can implement a parser handling proper nesting of parens
without recursion.

Though... what is the definition in the RFC?
-- 
Jakub Narębski



Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-25 Thread Kevin Daudt
On Fri, Sep 23, 2016 at 12:15:41AM -0400, Jeff King wrote:
> On Thu, Sep 22, 2016 at 03:17:23PM -0700, Junio C Hamano wrote:
> 
> > Jeff King  writes:
> > 
> > > On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:
> > >
> > >> + ...
> > >> +while ((c = *in++) != 0) {
> > >> +if (take_next_literally) {
> > >> +take_next_literally = 0;
> > >> +} else {
> > >> [...]
> > >> +}
> > >> +
> > >> +strbuf_addch(line, c);
> > >> +}
> > >> +}
> > >
> > > It needs to `free(in)` at the end of the function.
> > 
> > Ehh, in has been incremented and is pointing at the terminating NUL
> > there, so it would be more like
> > 
> > char *to_free, *in;
> > 
> > to_free = strbuf_detach(line, NULL);
> > in = to_free;
> > ...
> > while ((c = *in++)) {
> > ...
> > }
> > free(to_free);
> > 
> > I would think ;-).
> 
> Oops, yes. It is beginning to make the "strbuf_swap()" look less
> convoluted. :)
> 

I've switched to strbuf_swap now, much better. I've implemented
recursive parsing without looking at what you provided, just to see what
I'd came up with. Though I've not implemented a recursive descent
parser, but it might suffice.

I'm sending the patches now.



Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-22 Thread Jeff King
On Thu, Sep 22, 2016 at 03:17:23PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:
> >
> >> + ...
> >> +  while ((c = *in++) != 0) {
> >> +  if (take_next_literally) {
> >> +  take_next_literally = 0;
> >> +  } else {
> >> [...]
> >> +  }
> >> +
> >> +  strbuf_addch(line, c);
> >> +  }
> >> +}
> >
> > It needs to `free(in)` at the end of the function.
> 
> Ehh, in has been incremented and is pointing at the terminating NUL
> there, so it would be more like
> 
>   char *to_free, *in;
> 
> to_free = strbuf_detach(line, NULL);
> in = to_free;
>   ...
> while ((c = *in++)) {
>   ...
>   }
> free(to_free);
> 
> I would think ;-).

Oops, yes. It is beginning to make the "strbuf_swap()" look less
convoluted. :)

-Peff


Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-22 Thread Junio C Hamano
Jeff King  writes:

> On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:
>
>> + ...
>> +while ((c = *in++) != 0) {
>> +if (take_next_literally) {
>> +take_next_literally = 0;
>> +} else {
>> [...]
>> +}
>> +
>> +strbuf_addch(line, c);
>> +}
>> +}
>
> It needs to `free(in)` at the end of the function.

Ehh, in has been incremented and is pointing at the terminating NUL
there, so it would be more like

char *to_free, *in;

to_free = strbuf_detach(line, NULL);
in = to_free;
...
while ((c = *in++)) {
...
}
free(to_free);

I would think ;-).




Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-21 Thread Jeff King
On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:

> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..6a7c2f2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -54,6 +54,50 @@ static void parse_bogus_from(struct mailinfo *mi, const 
> struct strbuf *line)
>   get_sane_name(>name, >name, >email);
>  }
>  
> +static void unquote_quoted_string(struct strbuf *line)
> +{
> + const char *in = strbuf_detach(line, NULL);

I see that this version uses the "detach, and then write into the
replacement" approach, which is good. But...

> + int c, take_next_literally = 0;
> + int found_error = 0;
> +
> + /*
> +  * Stores the character that started the escape mode so that we know 
> what
> +  * character will stop it
> +  */
> + char escape_context = 0;
> +
> + while ((c = *in++) != 0) {
> + if (take_next_literally) {
> + take_next_literally = 0;
> + } else {
> [...]
> + }
> +
> + strbuf_addch(line, c);
> + }
> +}

It needs to `free(in)` at the end of the function.

Your original also fed "line->len" as a hint, but I doubt it really
matters in practice, so I don't mind losing that.

-Peff


Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-19 Thread Jeff King
On Mon, Sep 19, 2016 at 08:54:40PM +0200, Kevin Daudt wrote:

> diff --git a/t/t5100/comment.expect b/t/t5100/comment.expect
> new file mode 100644
> index 000..1197e76
> --- /dev/null
> +++ b/t/t5100/comment.expect
> @@ -0,0 +1,5 @@
> +Author: A U Thor (this is a comment (really))

Hmm. I don't see any recursion in your parsing, so after the first ")"
our escape_context would be 0 again, right? So a more tricky test is:

  Author: A U Thor (this is a comment (really) with \(quoted\) pairs)

We are still inside "ctext" when we hit those quoted pairs, and they
should be unquoted, but your code would not do so (unless we go the
route of simply unquoting pairs everywhere).

I think your parser would have to follow the BNF more closely with a
recursive descent parser, like:

  const char *parse_comment(const char *in, struct strbuf *out)
  {
size_t orig_out = out->len;

if ((in = parse_char('(', in, out))) &&
(in = parse_ccontent(in, out)) &&
(in = parse_char(')', in, out
return in;

strbuf_setlen(out, orig_out);
return NULL;
  }

  const char *parse_ccontent(const char *in, struct strbuf *out)
  {
while (*in && *in != ')') {
const char *next;

if ((next = parse_quoted_pair(in, out)) ||
(next = parse_comment(in, out)) ||
(next = parse_ctext(in, out))) {
in = next;
continue;
}
}

/*
 * if "in" is NUL here we have an unclosed comment; but we'll
 * just silently ignore and accept it
 */
return in;
  }

  const char *parse_char(char c, const char *in, struct strbuf *out)
  {
if (*in != c)
return NULL;
strbuf_addch(out, c);
return in + 1;
  }

You can probably guess at the implementation of parse_quoted_pair(),
parse_ctext(), etc (and naturally, the above is completely untested and
probably has some bugs in it).

In a former life (back when it was still rfc822!) I remember
implementing a similar parser, which I think was in turn based on the
cclient code in pine. It's not _too_ hard to get it all right based on
the BNF in the RFC, but as you can see it's a bit tedious. And I'm not
convinced we actually need it to be completely right for our purposes.
We really are looking for a single address, with the email in "<>" and
the name as everything before that, but de-quoted.

-Peff


Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-19 Thread Junio C Hamano
Junio C Hamano  writes:

> Kevin Daudt  writes:
>
>> +static void unquote_quoted_string(struct strbuf *line)
>> +{
>> +const char *in = strbuf_detach(line, NULL);
>> +int c, take_next_literally = 0;
>> +int found_error = 0;
> ...
>> +}
>> +}
>
> The additional comment makes it very clear what is going on.
>
> Is it an event unusual enough that is worth warning() about if we
> have either take_next_literally or escape_context set to non-NUL
> upon leaving the loop, I wonder?
>
> Will queue.  Thanks.

It turns out that found_error is not used anywhere and tripped the
-Werror=unused-variable check.  I've removed that line while
queuing.

Thanks.


Re: [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-19 Thread Junio C Hamano
Kevin Daudt  writes:

> +static void unquote_quoted_string(struct strbuf *line)
> +{
> + const char *in = strbuf_detach(line, NULL);
> + int c, take_next_literally = 0;
> + int found_error = 0;
> +
> + /*
> +  * Stores the character that started the escape mode so that we know 
> what
> +  * character will stop it
> +  */
> + char escape_context = 0;
> +
> + while ((c = *in++) != 0) {
> + if (take_next_literally) {
> + take_next_literally = 0;
> + } else {
> + switch (c) {
> + case '"':
> + if (!escape_context)
> + escape_context = '"';
> + else if (escape_context == '"')
> + escape_context = 0;
> + continue;
> + case '\\':
> + if (escape_context) {
> + take_next_literally = 1;
> + continue;
> + }
> + break;
> + case '(':
> + if (!escape_context)
> + escape_context = '(';
> + break;
> + case ')':
> + if (escape_context == '(')
> + escape_context = 0;
> + break;
> + }
> + }
> +
> + strbuf_addch(line, c);
> + }
> +}

The additional comment makes it very clear what is going on.

Is it an event unusual enough that is worth warning() about if we
have either take_next_literally or escape_context set to non-NUL
upon leaving the loop, I wonder?

Will queue.  Thanks.




[PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields

2016-09-19 Thread Kevin Daudt
rfc2822 has provisions for quoted strings in structured header fields,
but also allows for escaping these with so-called quoted-pairs.

The only thing git currently does is removing exterior quotes, but
quotes within are left alone.

Remove exterior quotes and remove escape characters so that they don't
show up in the author field.

Signed-off-by: Kevin Daudt 
---
 mailinfo.c   | 46 
 t/t5100-mailinfo.sh  | 14 ++
 t/t5100/comment.expect   |  5 +
 t/t5100/comment.in   |  9 +
 t/t5100/quoted-string.expect |  5 +
 t/t5100/quoted-string.in |  9 +
 6 files changed, 88 insertions(+)
 create mode 100644 t/t5100/comment.expect
 create mode 100644 t/t5100/comment.in
 create mode 100644 t/t5100/quoted-string.expect
 create mode 100644 t/t5100/quoted-string.in

diff --git a/mailinfo.c b/mailinfo.c
index e19abe3..6a7c2f2 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -54,6 +54,50 @@ static void parse_bogus_from(struct mailinfo *mi, const 
struct strbuf *line)
get_sane_name(>name, >name, >email);
 }
 
+static void unquote_quoted_string(struct strbuf *line)
+{
+   const char *in = strbuf_detach(line, NULL);
+   int c, take_next_literally = 0;
+   int found_error = 0;
+
+   /*
+* Stores the character that started the escape mode so that we know 
what
+* character will stop it
+*/
+   char escape_context = 0;
+
+   while ((c = *in++) != 0) {
+   if (take_next_literally) {
+   take_next_literally = 0;
+   } else {
+   switch (c) {
+   case '"':
+   if (!escape_context)
+   escape_context = '"';
+   else if (escape_context == '"')
+   escape_context = 0;
+   continue;
+   case '\\':
+   if (escape_context) {
+   take_next_literally = 1;
+   continue;
+   }
+   break;
+   case '(':
+   if (!escape_context)
+   escape_context = '(';
+   break;
+   case ')':
+   if (escape_context == '(')
+   escape_context = 0;
+   break;
+   }
+   }
+
+   strbuf_addch(line, c);
+   }
+}
+
 static void handle_from(struct mailinfo *mi, const struct strbuf *from)
 {
char *at;
@@ -63,6 +107,8 @@ static void handle_from(struct mailinfo *mi, const struct 
strbuf *from)
strbuf_init(, from->len);
strbuf_addbuf(, from);
 
+   unquote_quoted_string();
+
at = strchr(f.buf, '@');
if (!at) {
parse_bogus_from(mi, from);
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index 27bf3b8..8c21434 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -144,4 +144,18 @@ test_expect_success 'mailinfo unescapes with --mboxrd' '
test_cmp expect mboxrd/msg
 '
 
+test_expect_success 'mailinfo handles rfc2822 quoted-string' '
+   mkdir quoted-string &&
+   git mailinfo /dev/null /dev/null <"$DATA"/quoted-string.in \
+   >quoted-string/info &&
+   test_cmp "$DATA"/quoted-string.expect quoted-string/info
+'
+
+test_expect_success 'mailinfo handles rfc2822 comment' '
+   mkdir comment &&
+   git mailinfo /dev/null /dev/null <"$DATA"/comment.in \
+   >comment/info &&
+   test_cmp "$DATA"/comment.expect comment/info
+'
+
 test_done
diff --git a/t/t5100/comment.expect b/t/t5100/comment.expect
new file mode 100644
index 000..1197e76
--- /dev/null
+++ b/t/t5100/comment.expect
@@ -0,0 +1,5 @@
+Author: A U Thor (this is a comment (really))
+Email: someb...@example.com
+Subject: testing comments
+Date: Sun, 25 May 2008 00:38:18 -0700
+
diff --git a/t/t5100/comment.in b/t/t5100/comment.in
new file mode 100644
index 000..430ba97
--- /dev/null
+++ b/t/t5100/comment.in
@@ -0,0 +1,9 @@
+From 1234567890123456789012345678901234567890 Mon Sep 17 00:00:00 2001
+From: "A U Thor"  (this is a comment \(really\))
+Date: Sun, 25 May 2008 00:38:18 -0700
+Subject: [PATCH] testing comments
+
+
+
+---
+patch
diff --git a/t/t5100/quoted-string.expect b/t/t5100/quoted-string.expect
new file mode 100644
index 000..cab1bce
--- /dev/null
+++ b/t/t5100/quoted-string.expect
@@ -0,0 +1,5 @@
+Author: Author "The Author" Name
+Email: someb...@example.com
+Subject: testing quoted-pair
+Date: Sun, 25 May 2008 00:38:18 -0700
+
diff --git a/t/t5100/quoted-string.in