On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:

> Subject: [PATCH v2] diff-highlight: do not split multibyte characters
> 
> When the input is UTF-8 and Perl is operating on bytes instead
> of characters, a diff that changes one multibyte character to
> another that shares an initial byte sequence will result in a
> broken diff display as the common byte sequence prefix will be
> separated from the rest of the bytes in the multibyte character.

Thanks, I had a feeling we should be able to do something with perl's
builtin utf8 support.  This doesn't help people with other encodings,
but I'm not sure the original was all that helpful either (in that we
don't actually _know_ the file encodings in the first place).

I briefly confirmed that this seems to do the right thing on po/bg.po,
which has a couple of sheared characters when viewed with the existing
code.

I timed this one versus the existing diff-highlight. It's about 7%
slower. That's not great, but is acceptable to me. The String::Multibyte
version was a lot faster, which was nice (but I'm still unclear on
_why_).

> Fix this by putting Perl into character mode when splitting the
> line and then back into byte mode after the split is finished.

I also wondered if we could simply put stdin into utf8 mode. But it
looks like it will barf whenever it gets invalid utf8. Checking for
valid utf8 and only doing the multi-byte split in that case (as you do
here) is a lot more robust.

> While the utf8::xxx functions are built-in and do not require
> any 'use' statement, the utf8::is_utf8 function did not appear
> until Perl 5.8.1, but is identical to the Encode::is_utf8
> function which is available in 5.8 so we use that instead of
> utf8::is_utf8.

Makes sense. I'm happy enough listing perl 5.8 as a dependency.

EungJun, does this version meet your needs?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to