On Mon, Jul 30, 2012 at 02:30:35PM +0200, Thomas Rast wrote:

> > Removing this line
> > s/_/ /g;
> > here
> > https://github.com/git/git/blob/master/git-send-email.perl#L867
> >
> > Solves this problem for me. But I really don't have any clue, what
> > kind of side effects this modification on "sub unquote_rfc2047" might
> > have.
> 
> It would prevent spaces from being decoded correctly if the encoding
> program chooses to make the '_'.  git-format-patch does not actually do
> this, see the big comment around pretty.c:304.

Yeah, I think we need to be prepared to decode arbitrary rfc2047. Even
though most of our input will come from format-patch, people can munge
the input between format-patch and send-email.

> I think this patch would be a better match for what RFC2047 specifies.
> On the one hand it avoids substituting _ outside of encodings, but OTOH
> it also handles more than one encoded-word.  It still does not handle
> the case where there are several encoded-words of *different* encodings,
> but who would do such a crazy thing?
> 
> diff --git i/git-send-email.perl w/git-send-email.perl
> index ef30c55..88c4758 100755
> --- i/git-send-email.perl
> +++ w/git-send-email.perl
> @@ -862,11 +862,13 @@ sub make_message_id {
>  sub unquote_rfc2047 {
>       local ($_) = @_;
>       my $encoding;
> -     if (s/=\?([^?]+)\?q\?(.*)\?=/$2/g) {
> +     s{=\?([^?]+)\?q\?(.*)\?=}{
>               $encoding = $1;
> -             s/_/ /g;
> -             s/=([0-9A-F]{2})/chr(hex($1))/eg;
> -     }
> +             my $e = $2;
> +             $e =~ s/_/ /g;
> +             $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
> +             $e;
> +     }eg;
>       return wantarray ? ($_, $encoding) : $_;

That is definitely an improvement. The existing code was just plain
wrong to assume that the whole string would be rfc2047 encoded, as that
is never the case (e.g., I don't think it is even legal to encode the
addr-spec part).

You are right that the result does not handle multiple encoded chunks
with different encodings. However, it also does not handle a single
encoded chunk with non-encoded bits. E.g., consider:

  From: =?UTF-8?q?Some=20N=C3=A1me?= <y...@example.com>

We will return:

  ("Some NĂ¡me <y...@example.com>", "UTF-8")

Note that the second half is unencoded ASCII, but we have simply
concatenated it with the encoded bit and claimed the whole thing at
UTF-8. This works for utf-8, of course, because it is an ASCII superset.
But what about utf-16? You now have incompatible mixed encodings in a
single string. Which of course is only the tip of the iceberg; we later
may stick the author field into the body, which does not necessarily
have the same encoding (and there is a big "uh oh we should reencode"
argument that does not actually do so).

Furthermore, even if we do use an ASCII superset, the callers do not use
the encoding field very well. They do things like:

  if (unquote_rfc2047($addr) eq $sender) {

So we are comparing the unquoted address, in some encoding (that we
throw away) to whatever we have in $sender, which may have come from the
author ident, or the command-line, or the expansion of a MUA alias. So
we would fail to match utf-8 and latin1 versions of the same identity.

I think the only sane thing for unquote_rfc2047 to do would be to
actually return just a string in some canonical encoding (i.e., utf8),
and iconv each of the bits separately from its appropriate
encoding. And then we assume that everything internal is in the same
encoding and we can just compare strings to get the right answer. And
outgoing, we produce encoded utf8 for the headers, or match the body for
an in-body "From" header.

Of course, that means we would have to start using perl's encoding
functions, which we do not use currently. I think Encode has been in the
perl base system long enough for us to rely on it, but I am not very up
to speed on encoding issues in perl. And nobody has been complaining
about those issues, which I assume means everybody is just using utf8
for outgoing messages.  So it might not be worth the trouble to fix.

I also notice that we do not handle the 'b' rfc2047 encoding at all.
We do handle this in mailinfo for incoming messages, but again, I guess
everybody just uses the more common 'q' for outgoing (since it's mostly
generated by git, anyway). So if nobody is complaining, I'd be inclined
to just leave it.

So after all that rambling, I think my response is "yes, your patch is a
step in the right direction, and we should do it". There are still a ton
of unsupported setups, but really, if we are going to handle everything,
I'd rather see us make the jump to using one of the existing
mail-handling modules rather than reinventing the wheel. And I know
that some people really like the no-dependencies aspect of what we have
currently.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to