Re: git format-patch doesn't add Content-type for UTF-8 diffs

Jeff King Mon, 30 Jun 2014 10:31:42 -0700

On Mon, Jun 30, 2014 at 02:03:25AM -0700, Paul Eggert wrote:

> I've been having trouble sending my Git-generated patches to the tz mailing
> list.  Patches containing UTF-8 text are garbled, e.g., if you visit
> <http://mm.icann.org/pipermail/tz/2014-June/021086.html> you'll see
> "ÃœrÃ¼mqi" where the patch actually had "Ürümqi".
> 
> I've tracked this down to the fact that "git format-patch" isn't outputting
> a Content-Type: line in the outgoing email.  I thought it was supposed to do
> that; the man page implies that it does.


format-patch will add a content-type header if the commit message
contains non-ascii characters, and is marked as an alternate encoding
(usually this is utf8, but you can use i18n.commitEncoding to store them
in a different format).

However, it doesn't look at the filenames or diff contents at all. If it
were to do so, it would have to guess at the correct encoding, since git
doesn't know anything about the encoding of filenames or contents.
Worse, you could actually have several different encodings, across
multiple files in the same diff.

Typically, the next stage in the pipeline is to give the output to
send-email, or to a MUA. Send-email will detect high-bit characters in
this case and ask you which encoding you want. Many MUAs will do some
kind of auto-detection and fill in the content-type (e.g., I know that
mutt handles this correctly).

How do you send the mails after they come out of format-patch?

> Here's how I can reproduce the bug with the git 1.9.3 that's shipped with
> Fedora 20.  Notice that the patch is missing the line "Content-Type:
> text/plain; charset=UTF-8" that the git-format-patch man page implies it
> should be generating, and this causes the ICANN email software to
> misinterpret the patch's character set encoding.
> [...]

Thanks for the reproduction recipe. While I think we have to accept that
some hard cases (e.g., multiple encodings in a single diff) can't be
handled cleanly, it would be really nice if this all-utf8 case worked
out of the box. And perhaps the complex cases could use binary diffs
when we see multiple encodings.

One tricky thing about the implementation is that we stream the output
from format-patch, and write the content-type header (if any) before we
start opening the blobs for diff.

I wonder if it would be enough to do:

  1. Always add a content-type header, even if the commit is utf-8 and
     contains only ascii characters. This _shouldn't_ hurt anything,
     though I suppose it would if you have latin1 (for example) commit
     messages and did not correctly set the encoding header in your
     commits.

  2. When producing diff header lines do not respect core.quotepath if
     the filename does is not valid for the encoding we claimed earlier.

  3. When producing lines of textual diff, use a binary diff if the
     contents are not valid for the encoding we claimed earlier.

That would make the utf-8 case "just work", and would prevent us from
ever sending malformed contents (i.e., mismatched encodings in the
commit message and diff contents). However, it is not perfect:

  1. Right now if you send a diff for a latin1 file but do not use any
     non-ascii characters in your commit message (and do not set
     i18n.commitEncoding, so it is "utf8"), you get no claimed encoding
     in the email. If your receiving end is OK with that, everything
     works, and you get to see text diffs.

     So my scheme would be a slight regression there. But it is somewhat
     of an accident waiting to happen. If you ever use a utf-8 character
     in your commit message, that particular email will be marked as
     utf-8, and your diff will be broken.

  2. We can only check "is it valid?" for the encoding. That works well
     with utf-8, which has rules. But for something like latin1 versus
     another "use a code page for the high-bit bytes" type of encoding,
     we cannot really tell the difference. However, I do not think we
     are making anything _worse_ there. You'd already get mojibake in
     such a case.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: git format-patch doesn't add Content-type for UTF-8 diffs

Reply via email to