On Sun, Feb 25, 2018 at 08:44:46PM -0500, Jeff King wrote:
> On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote:
>
> > > We always use the in-repo contents when generating 'diff'. I think
> > > by "attribute to be used in diff", what you are reallying after is
> > > to convert the in-repo contents to that encoding _BEFORE_ running
> > > 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over
> > > the place will not diff well with the xdiff machinery, but if you
> > > first convert it to UTF-8 and have xdiff work on it, you can get
> > > reasonable result out of it. It is unclear what encoding you want
> > > your final diff output in (it is equally valid in such a set-up to
> > > desire your patch output in UTF-16 or UTF-8), but assuming that you
> > > want UTF-8 in your patch output, perhaps we do not have to break
> > > gitk users by hijacking the 'encoding' attribute. Instead what you
> > > want is a single bit that says between in-repo or working tree which
> > > representation should be given to the xdiff machinery.
> >
> > I fear that we could confuse users with an additional knob/bit that
> > defines what we diff against. Git always diff'ed against in-repo
> > content and I feel it should stay that way.
>
> Well, except for textconv. You can already do this:
>
> echo "foo diff=utf16" >.gitattributes
> git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
>
> We could make that easier to use and much more efficient by:
>
> 1. Allowing a special syntax for textconv filters that kicks off an
> internal iconv.
>
> 2. Providing baked-in config for utf16.
>
> The patch below provides a sketch. But I think Torsten raised a good
> point that you might want the encoding conversion to be independent of
> other diff characteristics (so, e.g., you might say "this is utf16 but
> once converted treat it like C code for finding funcnames, etc").
>
> ---
> diff --git a/diff.c b/diff.c
> index 21c3838b25..04032e059c 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options
> *options, const char *pat
> return pair;
> }
>
> +static char *iconv_textconv(const char *encoding, struct diff_filespec *spec,
> + size_t *outsize)
> +{
> + char *ret;
> + int outsize_int; /* this really should be a size_t */
> +
> + if (diff_populate_filespec(spec, 0))
> + die("unable to load content for %s", spec->path);
> + ret = reencode_string_len(spec->data, spec->size,
> + "utf-8", /* should be log_output_encoding? */
> + encoding, &outsize_int);
> + *outsize = outsize_int;
> + return ret;
> +}
> +
> static char *run_textconv(const char *pgm, struct diff_filespec *spec,
> size_t *outsize)
> {
> @@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct
> diff_filespec *spec,
> struct strbuf buf = STRBUF_INIT;
> int err = 0;
>
> + if (skip_prefix(pgm, "iconv:", &pgm))
> + return iconv_textconv(pgm, spec, outsize);
> +
> temp = prepare_temp_file(spec->path, spec);
> *arg++ = pgm;
> *arg++ = temp->name;
> diff --git a/userdiff.c b/userdiff.c
> index dbfb4e13cd..48fa7e8bdd 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -161,6 +161,7 @@ IPATTERN("css",
> "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
> "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
> ),
> +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
> { "default", NULL, -1, { NULL, 0 } },
> };
> #undef PATTERNS
The patch looks like a possible step into the right direction -
some minor notes: "utf8" is better written as "UTF-8", when talking
to iconv.h, same for utf16.
But, how do I activate the diff ?
I have in .gitattributes
XXXenglish.txt diff=UTF-16
and in .git/config
[diff "UTF-16"]
command = iconv:UTF-16
What am I doing wrong ?