Re: [PATCH v7 0/7] convert: add support for different encodings

2018-03-04 Thread Torsten Bögershausen
On 2018-02-28 14:21, Jeff King wrote:
> On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote:
> 
>>>   2. auto-detect utf-16 (your patch)
>>>  - Just Works for existing repositories storing utf-16
>>>
>>>  - carries some risk of kicking in when people would like it not to
>>>(e.g., when they really do want a binary patch that can be
>>>applied).
>>
>> The binary patch is still supported, but that detail may need some more 
>> explanation
>> in the commit message. Please see  t4066-diff-encoding.sh
> 
> Yeah, but if you don't have binary-patches enabled we'd generate a bogus
> patch. Which, granted, without that you wouldn't be able to apply the
> patch either. But somehow it feels funny to me to generate something
> that _looks_ like a patch but you can't actually apply.
> 
> I also think we'd want a plan for this to be used consistently in other
> diff-like tools. E.g., "git blame" uses textconv for the starting file
> content, and it would be nice for this to kick in then, too. Ditto for
> things like grep, pickaxe, etc.
> 
> I have some patches that reuse some of the textconv infrastructure for
> this, which should mostly make it "just work" everywhere. They need a
> little more polishing before I post them, but you can take a look at:
> 
>   https://github.com/peff/git.git jk/textconv-utf16
> 
> if you want.
> 
> -Peff
> 

Thanks for your work (I actually found some time to take look)

I am looking at the code to put 2 or 3 things on top of it:
- test case(s)
- documentation
- teach diff to add a line "b is converted to UTF-8 from UTF-16"
- teach apply to reads & understands the encoding line and throws
  in a "reencode_string_len() like your patch does

This would keep "git diff | git apply" happy.
All in all the changes do not look too invasive, at least from my point of view.





Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-28 Thread Jeff King
On Wed, Feb 28, 2018 at 09:42:27AM -0800, Junio C Hamano wrote:

> > I also think we'd want a plan for this to be used consistently in other
> > diff-like tools. E.g., "git blame" uses textconv for the starting file
> > content, and it would be nice for this to kick in then, too. Ditto for
> > things like grep, pickaxe, etc.
> 
> You probably do not want to limit your thinking to the generation
> side.  It is entirely plausible to have "we deal with diff in this
> encoding X" in addition to "the in-repo encoding for this project is
> this encoding Y" and "the working tree encoding for this path is Z"
> and allow them to interact in "git diff | git apply" pipeline.
> 
> "diff/format-patch --stdout/etc." on the upstream would first iconv
> Y to X and feed the contents in X to xdiff machinery, which is sent
> down the pipe and received by apply, which reads the preimage from
> the disk or from the repository.  If doing "apply" without
> "--cached/--index", the preimage data from the disk would go through
> iconv Z to X.  If doing "apply --cached/--index", the preimage data
> from the repo would go through iconv Y to X.  The incoming patch is
> in X, so we apply, and the resulting postimage will be re-encoded in
> Z in the working tree and Y in the repository.

I agree that would be convenient, but I have to wonder if all the
complexity is worth it to maintain the idea of a distinct in-repo
representation. It seems like it would open up a ton of corner cases.
And I suspect most people would be happy enough with either a
clean/smudge style worktree conversion or a textconv-style view.

So if somebody wants to work on it, I don't want to stop them. But I
think there's room for the simpler solutions in the meantime.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-28 Thread Lars Schneider

> On 27 Feb 2018, at 22:25, Jeff King  wrote:
> 
> On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:
> 
> Of the three solutions, I think the relative merits are something like
> this:
> 
>  1. baked-in textconv (my patch)
> 
> - reuses an existing diff feature, so minimal code and not likely to
>   break things
> 
> - requires people to add a .gitattributes entry
> 
> - needs work to make bare-repo .gitattributes work (though I think
>   this would be useful for other features, too)
> 
> - has a run-time cost at each diff to do the conversion
> 
> - may sometimes annoy people when it doesn't kick in (e.g.,
>   emailed patches from format-patch won't have a readable diff)
> 
> - doesn't combine with other custom-diff config (e.g., utf-16
>   storing C code should still use diff=c funcname rules, but
>   wouldn't with my patch)
> 
>  2. auto-detect utf-16 (your patch)
> - Just Works for existing repositories storing utf-16
> 
> - carries some risk of kicking in when people would like it not to
>   (e.g., when they really do want a binary patch that can be
>   applied).
> 
>   I think it would probably be OK if this kicked in only when
>   ALLOW_TEXTCONV is set (the default for porcelain), and --binary
>   is not (i.e., when we would otherwise just say "binary
>   files differ").
> 
> - similar to (1), carries a run-time cost for each diff, and users
>   may sometimes still see binary diffs
> 
>  3. w-t-e (Lars's patch)
> 
> - requires no server-side modifications; the diff is plain vanilla
> 
> - works everywhere you diff, plumbing and porcelain
> 
> - does require people to add a .gitattributes entry
> 
> - run-time cost is per-checkout, not per-diff
> 
> So I can see room for (3) to co-exist alongside the others. Between (1)
> and (2), I think (2) is probably the better direction.

Thanks for the great summary! I agree they could co-exist and people
could use whatever works best for them.

I'll incorporate Eric's feedback and send a w-t-e v9 soonish.

- Lars




Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-28 Thread Junio C Hamano
Jeff King  writes:

>> The binary patch is still supported, but that detail may need some more 
>> explanation
>> in the commit message. Please see  t4066-diff-encoding.sh
>
> Yeah, but if you don't have binary-patches enabled we'd generate a bogus
> patch. Which, granted, without that you wouldn't be able to apply the
> patch either. But somehow it feels funny to me to generate something
> that _looks_ like a patch but you can't actually apply.

True.  And at least you _could_ apply a properly formatted binary
patch to the original.

> I also think we'd want a plan for this to be used consistently in other
> diff-like tools. E.g., "git blame" uses textconv for the starting file
> content, and it would be nice for this to kick in then, too. Ditto for
> things like grep, pickaxe, etc.

You probably do not want to limit your thinking to the generation
side.  It is entirely plausible to have "we deal with diff in this
encoding X" in addition to "the in-repo encoding for this project is
this encoding Y" and "the working tree encoding for this path is Z"
and allow them to interact in "git diff | git apply" pipeline.

"diff/format-patch --stdout/etc." on the upstream would first iconv
Y to X and feed the contents in X to xdiff machinery, which is sent
down the pipe and received by apply, which reads the preimage from
the disk or from the repository.  If doing "apply" without
"--cached/--index", the preimage data from the disk would go through
iconv Z to X.  If doing "apply --cached/--index", the preimage data
from the repo would go through iconv Y to X.  The incoming patch is
in X, so we apply, and the resulting postimage will be re-encoded in
Z in the working tree and Y in the repository.



Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-28 Thread Jeff King
On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote:

> >   2. auto-detect utf-16 (your patch)
> >  - Just Works for existing repositories storing utf-16
> > 
> >  - carries some risk of kicking in when people would like it not to
> >(e.g., when they really do want a binary patch that can be
> >applied).
> 
> The binary patch is still supported, but that detail may need some more 
> explanation
> in the commit message. Please see  t4066-diff-encoding.sh

Yeah, but if you don't have binary-patches enabled we'd generate a bogus
patch. Which, granted, without that you wouldn't be able to apply the
patch either. But somehow it feels funny to me to generate something
that _looks_ like a patch but you can't actually apply.

I also think we'd want a plan for this to be used consistently in other
diff-like tools. E.g., "git blame" uses textconv for the starting file
content, and it would be nice for this to kick in then, too. Ditto for
things like grep, pickaxe, etc.

I have some patches that reuse some of the textconv infrastructure for
this, which should mostly make it "just work" everywhere. They need a
little more polishing before I post them, but you can take a look at:

  https://github.com/peff/git.git jk/textconv-utf16

if you want.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-28 Thread Torsten Bögershausen
On Tue, Feb 27, 2018 at 04:25:38PM -0500, Jeff King wrote:
> On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:
> 
> > The other question is:
> > Would this help showing diffs of UTF-16 encoded files on a "git hoster",
> > github/bitbucket/ ?
> 
> Almost. There's probably one more thing needed. We don't currently read
> in-tree .gitattributes when doing a diff in a bare repository. And most
> hosting sites will store bare repositories.
> 
> And of course it would require the users to actually set the attributes
> themselves.
> 
> > Or would the auto-magic UTF-16 avoid binary patch that I send out be more 
> > helpful ?
> > Or both ?
> > Or the w-t-e encoding ?
> 
> Of the three solutions, I think the relative merits are something like
> this:
> 
>   1. baked-in textconv (my patch)
> 
>  - reuses an existing diff feature, so minimal code and not likely to
>break things
> 
>  - requires people to add a .gitattributes entry
> 
>  - needs work to make bare-repo .gitattributes work (though I think
>this would be useful for other features, too)
> 
>  - has a run-time cost at each diff to do the conversion
> 
>  - may sometimes annoy people when it doesn't kick in (e.g.,
>emailed patches from format-patch won't have a readable diff)
> 
>  - doesn't combine with other custom-diff config (e.g., utf-16
>storing C code should still use diff=c funcname rules, but
>wouldn't with my patch)
> 
>   2. auto-detect utf-16 (your patch)
>  - Just Works for existing repositories storing utf-16
> 
>  - carries some risk of kicking in when people would like it not to
>(e.g., when they really do want a binary patch that can be
>applied).

The binary patch is still supported, but that detail may need some more 
explanation
in the commit message. Please see  t4066-diff-encoding.sh
  test_expect_success 'diff --binary against local change' '
 cp file2 file &&
 test_tick &&
 cat >expect <<-\EOF &&
 diff --git a/file b/file
 index 
26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f
 100644
 GIT binary patch
 literal 28
 ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi

 literal 32
 icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4

  EOF
 git diff --binary file >actual &&
 test_cmp expect actual

> 
>I think it would probably be OK if this kicked in only when
>ALLOW_TEXTCONV is set (the default for porcelain), and --binary
>is not (i.e., when we would otherwise just say "binary
>files differ").

The user can still use "git diff" (Where auto-detection of UTF-16 kicks in
and replaces "binary files differ" with an UTF-8 diff.
When the user wants a patch, "git diff --binary" will generate a binary patch,
as before.
The only thing which is missing is the line "binary files differ", which may be 
a
regression. I can re-add it in V2.

> 
>  - similar to (1), carries a run-time cost for each diff, and users
>may sometimes still see binary diffs
> 
>   3. w-t-e (Lars's patch)
> 
>  - requires no server-side modifications; the diff is plain vanilla
> 
>  - works everywhere you diff, plumbing and porcelain
> 
>  - does require people to add a .gitattributes entry
> 
>  - run-time cost is per-checkout, not per-diff
> 
> So I can see room for (3) to co-exist alongside the others. Between (1)
> and (2), I think (2) is probably the better direction.
> 
> -Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Jeff King
On Tue, Feb 27, 2018 at 02:10:20PM -0800, Junio C Hamano wrote:

> > I thought it solved that by the hosting folks never seeing the strange
> > binary-looking data. They see only utf8, which diffs well.
> 
> Ah, OK, that is a "fix" in a wider context (in a narrower context,
> "work around" is a more appropriate term ;-).
> 
> The reason why I have been nudging people toward considering in-repo
> encoding attribute is because forcing projects that already have
> their contents in a strange binary-looking encoding to switch is
> costly.  But perhaps having them pay one-time conversion pain is a
> better investment longer term.

Yeah, thanks for mentioning that. It should have gone in my "relative
merits" list. The conversion flag-day is definitely going to be a pain
for users, and doesn't help with diffing older versions.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote:
>
>> Jeff King  writes:
>> 
>> > Of the three solutions, I think the relative merits are something like
>> > this:
>> > ...
>> >   3. w-t-e (Lars's patch)
>> 
>> I thought Lars's w-t-e was about keeping the in-repo contents in
>> UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
>> won't help the issue hosting folks want to deal with, i.e. showing
>> in-repo data that is stored in a strange binary-looking encoding in
>> a more reasonable encodign while diffing, no?
>
> I thought it solved that by the hosting folks never seeing the strange
> binary-looking data. They see only utf8, which diffs well.

Ah, OK, that is a "fix" in a wider context (in a narrower context,
"work around" is a more appropriate term ;-).

The reason why I have been nudging people toward considering in-repo
encoding attribute is because forcing projects that already have
their contents in a strange binary-looking encoding to switch is
costly.  But perhaps having them pay one-time conversion pain is a
better investment longer term.


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Jeff King
On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > Of the three solutions, I think the relative merits are something like
> > this:
> > ...
> >   3. w-t-e (Lars's patch)
> 
> I thought Lars's w-t-e was about keeping the in-repo contents in
> UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
> won't help the issue hosting folks want to deal with, i.e. showing
> in-repo data that is stored in a strange binary-looking encoding in
> a more reasonable encodign while diffing, no?

I thought it solved that by the hosting folks never seeing the strange
binary-looking data. They see only utf8, which diffs well.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Junio C Hamano
Jeff King  writes:

> Of the three solutions, I think the relative merits are something like
> this:
> ...
>   3. w-t-e (Lars's patch)

I thought Lars's w-t-e was about keeping the in-repo contents in
UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
won't help the issue hosting folks want to deal with, i.e. showing
in-repo data that is stored in a strange binary-looking encoding in
a more reasonable encodign while diffing, no?

Usually we only work in-repo encoding when producing a diff and show
the result in in-repo encoding, but I can imagine a new attribute,
when set, we first convert in-repo to the specified encoding before
passing the result to xdiff machinery.  Then convert it back to
in-repo encoding before showing the diff (or just show the result in
that encoding xdiff machinery processed---I do not know which one
should be the default).




Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Jeff King
On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:

> The other question is:
> Would this help showing diffs of UTF-16 encoded files on a "git hoster",
> github/bitbucket/ ?

Almost. There's probably one more thing needed. We don't currently read
in-tree .gitattributes when doing a diff in a bare repository. And most
hosting sites will store bare repositories.

And of course it would require the users to actually set the attributes
themselves.

> Or would the auto-magic UTF-16 avoid binary patch that I send out be more 
> helpful ?
> Or both ?
> Or the w-t-e encoding ?

Of the three solutions, I think the relative merits are something like
this:

  1. baked-in textconv (my patch)

 - reuses an existing diff feature, so minimal code and not likely to
   break things

 - requires people to add a .gitattributes entry

 - needs work to make bare-repo .gitattributes work (though I think
   this would be useful for other features, too)

 - has a run-time cost at each diff to do the conversion

 - may sometimes annoy people when it doesn't kick in (e.g.,
   emailed patches from format-patch won't have a readable diff)

 - doesn't combine with other custom-diff config (e.g., utf-16
   storing C code should still use diff=c funcname rules, but
   wouldn't with my patch)

  2. auto-detect utf-16 (your patch)
 - Just Works for existing repositories storing utf-16

 - carries some risk of kicking in when people would like it not to
   (e.g., when they really do want a binary patch that can be
   applied).

   I think it would probably be OK if this kicked in only when
   ALLOW_TEXTCONV is set (the default for porcelain), and --binary
   is not (i.e., when we would otherwise just say "binary
   files differ").

 - similar to (1), carries a run-time cost for each diff, and users
   may sometimes still see binary diffs

  3. w-t-e (Lars's patch)

 - requires no server-side modifications; the diff is plain vanilla

 - works everywhere you diff, plumbing and porcelain

 - does require people to add a .gitattributes entry

 - run-time cost is per-checkout, not per-diff

So I can see room for (3) to co-exist alongside the others. Between (1)
and (2), I think (2) is probably the better direction.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-27 Thread Torsten Bögershausen
On Mon, Feb 26, 2018 at 03:46:35PM -0500, Jeff King wrote:
> On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote:
> 
> > > diff --git a/userdiff.c b/userdiff.c
> > > index dbfb4e13cd..48fa7e8bdd 100644
> > > --- a/userdiff.c
> > > +++ b/userdiff.c
> > > @@ -161,6 +161,7 @@ IPATTERN("css",
> > >"-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
> > >"|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
> > >  ),
> > > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
> > >  { "default", NULL, -1, { NULL, 0 } },
> > >  };
> > >  #undef PATTERNS
> > 
> > The patch looks like a possible step into the right direction -
> > some minor notes: "utf8" is better written as "UTF-8", when talking
> > to iconv.h, same for utf16.
> > 
> > But, how do I activate the diff ?
> > I have in .gitattributes
> > XXXenglish.txt diff=UTF-16
> > 
> > and in .git/config
> > [diff "UTF-16"]
> >   command = iconv:UTF-16
> > 
> > 
> > What am I doing wrong ?
> 
> After applying the patch, if I do:
> 
>   git init
>   echo hello | iconv -f utf8 -t utf16 >file
>   git add file
>   git commit -m one
>   echo goodbye | iconv -f utf8 -t utf16 >file
>   git add file
>   git commit -m two
> 
> then:
> 
>   git log -p
> 
> shows "binary files differ" but:
> 
>   echo "file diff=utf16" >.gitattributes
>   git log -p
> 
> shows text diffs. I assume you tweaked the patch before switching to
> the UTF-16 spelling in your example. Did you use a plumbing command to
> show the diff? textconv isn't enabled for plumbing, because the
> resulting patches cannot actually be applied (in that sense an encoding
> switch is potentially special, since in theory one could convert to the
> canonical text format, apply the patch, and then convert back).
> 
> -Peff

Thanks for helping me out.
I didn't use "git log -p", but a simple "git diff".
(And after re-using utf16 with lowercase, it works as you described it)

I wasn't aware of "git log -p", something learned (or re-learned)

The other question is:
Would this help showing diffs of UTF-16 encoded files on a "git hoster",
github/bitbucket/ ?

Or would the auto-magic UTF-16 avoid binary patch that I send out be more 
helpful ?
Or both ?
Or the w-t-e encoding ?

Questions over questions.



Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-26 Thread Jeff King
On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote:

> > diff --git a/userdiff.c b/userdiff.c
> > index dbfb4e13cd..48fa7e8bdd 100644
> > --- a/userdiff.c
> > +++ b/userdiff.c
> > @@ -161,6 +161,7 @@ IPATTERN("css",
> >  "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
> >  "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
> >  ),
> > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
> >  { "default", NULL, -1, { NULL, 0 } },
> >  };
> >  #undef PATTERNS
> 
> The patch looks like a possible step into the right direction -
> some minor notes: "utf8" is better written as "UTF-8", when talking
> to iconv.h, same for utf16.
> 
> But, how do I activate the diff ?
> I have in .gitattributes
> XXXenglish.txt diff=UTF-16
> 
> and in .git/config
> [diff "UTF-16"]
>   command = iconv:UTF-16
> 
> 
> What am I doing wrong ?

After applying the patch, if I do:

  git init
  echo hello | iconv -f utf8 -t utf16 >file
  git add file
  git commit -m one
  echo goodbye | iconv -f utf8 -t utf16 >file
  git add file
  git commit -m two

then:

  git log -p

shows "binary files differ" but:

  echo "file diff=utf16" >.gitattributes
  git log -p

shows text diffs. I assume you tweaked the patch before switching to
the UTF-16 spelling in your example. Did you use a plumbing command to
show the diff? textconv isn't enabled for plumbing, because the
resulting patches cannot actually be applied (in that sense an encoding
switch is potentially special, since in theory one could convert to the
canonical text format, apply the patch, and then convert back).

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-26 Thread Torsten Bögershausen
On Sun, Feb 25, 2018 at 08:44:46PM -0500, Jeff King wrote:
> On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote:
> 
> > > We always use the in-repo contents when generating 'diff'.  I think
> > > by "attribute to be used in diff", what you are reallying after is
> > > to convert the in-repo contents to that encoding _BEFORE_ running
> > > 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> > > the place will not diff well with the xdiff machinery, but if you
> > > first convert it to UTF-8 and have xdiff work on it, you can get
> > > reasonable result out of it.  It is unclear what encoding you want
> > > your final diff output in (it is equally valid in such a set-up to
> > > desire your patch output in UTF-16 or UTF-8), but assuming that you
> > > want UTF-8 in your patch output, perhaps we do not have to break
> > > gitk users by hijacking the 'encoding' attribute.  Instead what you
> > > want is a single bit that says between in-repo or working tree which
> > > representation should be given to the xdiff machinery.
> > 
> > I fear that we could confuse users with an additional knob/bit that
> > defines what we diff against. Git always diff'ed against in-repo 
> > content and I feel it should stay that way.
> 
> Well, except for textconv. You can already do this:
> 
>   echo "foo diff=utf16" >.gitattributes
>   git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
> 
> We could make that easier to use and much more efficient by:
> 
>   1. Allowing a special syntax for textconv filters that kicks off an
>  internal iconv.
> 
>   2. Providing baked-in config for utf16.
> 
> The patch below provides a sketch. But I think Torsten raised a good
> point that you might want the encoding conversion to be independent of
> other diff characteristics (so, e.g., you might say "this is utf16 but
> once converted treat it like C code for finding funcnames, etc").
> 
> ---
> diff --git a/diff.c b/diff.c
> index 21c3838b25..04032e059c 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options 
> *options, const char *pat
>   return pair;
>  }
>  
> +static char *iconv_textconv(const char *encoding, struct diff_filespec *spec,
> + size_t *outsize)
> +{
> + char *ret;
> + int outsize_int; /* this really should be a size_t */
> +
> + if (diff_populate_filespec(spec, 0))
> + die("unable to load content for %s", spec->path);
> + ret = reencode_string_len(spec->data, spec->size,
> +   "utf-8", /* should be log_output_encoding? */
> +   encoding, _int);
> + *outsize = outsize_int;
> + return ret;
> +}
> +
>  static char *run_textconv(const char *pgm, struct diff_filespec *spec,
>   size_t *outsize)
>  {
> @@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct 
> diff_filespec *spec,
>   struct strbuf buf = STRBUF_INIT;
>   int err = 0;
>  
> + if (skip_prefix(pgm, "iconv:", ))
> + return iconv_textconv(pgm, spec, outsize);
> +
>   temp = prepare_temp_file(spec->path, spec);
>   *arg++ = pgm;
>   *arg++ = temp->name;
> diff --git a/userdiff.c b/userdiff.c
> index dbfb4e13cd..48fa7e8bdd 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -161,6 +161,7 @@ IPATTERN("css",
>"-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
>"|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
>  ),
> +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
>  { "default", NULL, -1, { NULL, 0 } },
>  };
>  #undef PATTERNS

The patch looks like a possible step into the right direction -
some minor notes: "utf8" is better written as "UTF-8", when talking
to iconv.h, same for utf16.

But, how do I activate the diff ?
I have in .gitattributes
XXXenglish.txt diff=UTF-16

and in .git/config
[diff "UTF-16"]
  command = iconv:UTF-16


What am I doing wrong ?


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-25 Thread Jeff King
On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote:

> > We always use the in-repo contents when generating 'diff'.  I think
> > by "attribute to be used in diff", what you are reallying after is
> > to convert the in-repo contents to that encoding _BEFORE_ running
> > 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> > the place will not diff well with the xdiff machinery, but if you
> > first convert it to UTF-8 and have xdiff work on it, you can get
> > reasonable result out of it.  It is unclear what encoding you want
> > your final diff output in (it is equally valid in such a set-up to
> > desire your patch output in UTF-16 or UTF-8), but assuming that you
> > want UTF-8 in your patch output, perhaps we do not have to break
> > gitk users by hijacking the 'encoding' attribute.  Instead what you
> > want is a single bit that says between in-repo or working tree which
> > representation should be given to the xdiff machinery.
> 
> I fear that we could confuse users with an additional knob/bit that
> defines what we diff against. Git always diff'ed against in-repo 
> content and I feel it should stay that way.

Well, except for textconv. You can already do this:

  echo "foo diff=utf16" >.gitattributes
  git config diff.utf16.textconv 'iconv -f utf16 -t utf8'

We could make that easier to use and much more efficient by:

  1. Allowing a special syntax for textconv filters that kicks off an
 internal iconv.

  2. Providing baked-in config for utf16.

The patch below provides a sketch. But I think Torsten raised a good
point that you might want the encoding conversion to be independent of
other diff characteristics (so, e.g., you might say "this is utf16 but
once converted treat it like C code for finding funcnames, etc").

---
diff --git a/diff.c b/diff.c
index 21c3838b25..04032e059c 100644
--- a/diff.c
+++ b/diff.c
@@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options 
*options, const char *pat
return pair;
 }
 
+static char *iconv_textconv(const char *encoding, struct diff_filespec *spec,
+   size_t *outsize)
+{
+   char *ret;
+   int outsize_int; /* this really should be a size_t */
+
+   if (diff_populate_filespec(spec, 0))
+   die("unable to load content for %s", spec->path);
+   ret = reencode_string_len(spec->data, spec->size,
+ "utf-8", /* should be log_output_encoding? */
+ encoding, _int);
+   *outsize = outsize_int;
+   return ret;
+}
+
 static char *run_textconv(const char *pgm, struct diff_filespec *spec,
size_t *outsize)
 {
@@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct 
diff_filespec *spec,
struct strbuf buf = STRBUF_INIT;
int err = 0;
 
+   if (skip_prefix(pgm, "iconv:", ))
+   return iconv_textconv(pgm, spec, outsize);
+
temp = prepare_temp_file(spec->path, spec);
*arg++ = pgm;
*arg++ = temp->name;
diff --git a/userdiff.c b/userdiff.c
index dbfb4e13cd..48fa7e8bdd 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -161,6 +161,7 @@ IPATTERN("css",
 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
 ),
+{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
 { "default", NULL, -1, { NULL, 0 } },
 };
 #undef PATTERNS


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-24 Thread Lars Schneider

> On 23 Feb 2018, at 21:11, Junio C Hamano  wrote:
> 
> Junio C Hamano  writes:
> 
>> Lars Schneider  writes:
>> 
>>> I still think it would be nice to see diffs for arbitrary encodings.
>>> Would it be an option to read the `encoding` attribute and use it in
>>> `git diff`?
>> 
>> Reusing that gitk-only thing and suddenly start doing so would break
>> gitk users, no?  The tool expects the diff to come out encoded in
>> the encoding that is specified by that attribute (which is learned
>> from get_path_encoding helper) and does its thing.
>> 
>> I guess that gitk uses diff-tree plumbing and you won't be applying
>> this change to the plumbing, perhaps?  If so, it might not be too
>> bad, but those who decided to postprocess "git diff" output (instead
>> of "git diff-tree" output) mimicking how gitk does it by thinking
>> that is the safe and sane thing to do will be broken by such a
>> change.  You could do "use the encoding only when a command line
>> option says so", but then people will add a configuration variable
>> to turn it always on and these existing scripts will be broken.
>> 
>> I do not personally have much sympathy for the last case (i.e. those
>> who scripted around 'git diff' instead of 'git diff-tree' to get
>> broken), so making the new feature only work with the Porcelain "git
>> diff" might be an option.  I'll need a bit more time to formulate
>> the rest of my thought ;-)
> 
> So we are introducing in this series a way to say in what encoding
> the things should be placed in the working tree files (i.e. the
> w-t-e attribute attached to the paths).  Currently there is no
> mechanism to say what encoding the in-repo contents are and UTF-8 is
> assumed when conversion from/to w-t-e is required, but there is no
> fundamental reason why it shouldn't be customizable (if anything, as
> a piece of fact on the in-repo data, in-repo-encoding is *more*
> appropriate to be an attribute than w-t-e that can merely be project
> preference at best, as I mentioned earlier in this thread).

Correct.


> We always use the in-repo contents when generating 'diff'.  I think
> by "attribute to be used in diff", what you are reallying after is
> to convert the in-repo contents to that encoding _BEFORE_ running
> 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> the place will not diff well with the xdiff machinery, but if you
> first convert it to UTF-8 and have xdiff work on it, you can get
> reasonable result out of it.  It is unclear what encoding you want
> your final diff output in (it is equally valid in such a set-up to
> desire your patch output in UTF-16 or UTF-8), but assuming that you
> want UTF-8 in your patch output, perhaps we do not have to break
> gitk users by hijacking the 'encoding' attribute.  Instead what you
> want is a single bit that says between in-repo or working tree which
> representation should be given to the xdiff machinery.

I fear that we could confuse users with an additional knob/bit that
defines what we diff against. Git always diff'ed against in-repo 
content and I feel it should stay that way.

However, I agree with your earlier emails that "working-tree-encoding"
is just one half of the feature. I also think it would be nice to be
able to define the "in-repo-encoding" as well. Then we could define 
something like that:

*.foo   text in-repo-encoding=UTF-16LE

This tells Git that the file is stored as UTF-16LE. This would help Git
generating a diff via UTF-8 conversion. I feel that the final patch 
should be in UTF-16LE again. Maybe over time we could then deprecate the
"encoding" attribute as the "in-repo-encoding" attribute serves a similar 
purpose (maybe gitk can switch to it).

In that case we could also do things like that:

*.bar   text working-tree-encoding=SHIFT-JIS 
in-repo-encoding=UTF-16LE

SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin.
On checkout the opposite would happen. This way we would lift the
"UTF-8 is the only in-repo encoding" limitation of the current w-t-e
implementation.

Does this sound sensible to you? That being said, I think "in-repo-encoding"
would deserve an own series.

- Lars


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-23 Thread Junio C Hamano
Junio C Hamano  writes:

> Lars Schneider  writes:
>
>> I still think it would be nice to see diffs for arbitrary encodings.
>> Would it be an option to read the `encoding` attribute and use it in
>> `git diff`?
>
> Reusing that gitk-only thing and suddenly start doing so would break
> gitk users, no?  The tool expects the diff to come out encoded in
> the encoding that is specified by that attribute (which is learned
> from get_path_encoding helper) and does its thing.
>
> I guess that gitk uses diff-tree plumbing and you won't be applying
> this change to the plumbing, perhaps?  If so, it might not be too
> bad, but those who decided to postprocess "git diff" output (instead
> of "git diff-tree" output) mimicking how gitk does it by thinking
> that is the safe and sane thing to do will be broken by such a
> change.  You could do "use the encoding only when a command line
> option says so", but then people will add a configuration variable
> to turn it always on and these existing scripts will be broken.
>
> I do not personally have much sympathy for the last case (i.e. those
> who scripted around 'git diff' instead of 'git diff-tree' to get
> broken), so making the new feature only work with the Porcelain "git
> diff" might be an option.  I'll need a bit more time to formulate
> the rest of my thought ;-)

So we are introducing in this series a way to say in what encoding
the things should be placed in the working tree files (i.e. the
w-t-e attribute attached to the paths).  Currently there is no
mechanism to say what encoding the in-repo contents are and UTF-8 is
assumed when conversion from/to w-t-e is required, but there is no
fundamental reason why it shouldn't be customizable (if anything, as
a piece of fact on the in-repo data, in-repo-encoding is *more*
appropriate to be an attribute than w-t-e that can merely be project
preference at best, as I mentioned earlier in this thread).  

We always use the in-repo contents when generating 'diff'.  I think
by "attribute to be used in diff", what you are reallying after is
to convert the in-repo contents to that encoding _BEFORE_ running
'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
the place will not diff well with the xdiff machinery, but if you
first convert it to UTF-8 and have xdiff work on it, you can get
reasonable result out of it.  It is unclear what encoding you want
your final diff output in (it is equally valid in such a set-up to
desire your patch output in UTF-16 or UTF-8), but assuming that you
want UTF-8 in your patch output, perhaps we do not have to break
gitk users by hijacking the 'encoding' attribute.  Instead what you
want is a single bit that says between in-repo or working tree which
representation should be given to the xdiff machinery.  When that
bit is set, then we

 - First ensure that both sides of the diff input is in the working
   tree encoding by running it through convert_to_working_tree();

 - Run xdiff on it;

 - Take the xdiff result, and run it through convert_to_git(),
   before emitting (this is optional, making this a one-and-half bit
   option).

That would allow you to say "I have in-repo data in UTF-16 which I
check out as UTF-8.  xdiff machinery is unhappy.  Please do
something." perhaps?

The other way around (i.e. in-repo is UTF-8, but working tree
encoding is UTF-16) won't need xdiff issues, I would imagine.


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-23 Thread Junio C Hamano
Lars Schneider  writes:

> I still think it would be nice to see diffs for arbitrary encodings.
> Would it be an option to read the `encoding` attribute and use it in
> `git diff`?

Reusing that gitk-only thing and suddenly start doing so would break
gitk users, no?  The tool expects the diff to come out encoded in
the encoding that is specified by that attribute (which is learned
from get_path_encoding helper) and does its thing.

I guess that gitk uses diff-tree plumbing and you won't be applying
this change to the plumbing, perhaps?  If so, it might not be too
bad, but those who decided to postprocess "git diff" output (instead
of "git diff-tree" output) mimicking how gitk does it by thinking
that is the safe and sane thing to do will be broken by such a
change.  You could do "use the encoding only when a command line
option says so", but then people will add a configuration variable
to turn it always on and these existing scripts will be broken.

I do not personally have much sympathy for the last case (i.e. those
who scripted around 'git diff' instead of 'git diff-tree' to get
broken), so making the new feature only work with the Porcelain "git
diff" might be an option.  I'll need a bit more time to formulate
the rest of my thought ;-)



Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-22 Thread Jeff King
On Thu, Feb 22, 2018 at 09:00:45PM +0100, Lars Schneider wrote:

> > If it was only about a diff of UTF-16 files, I may suggest a patch.
> > I simply copy-paste it here for review, if someone thinks that it may
> > be useful, I can send it as a real patch/RFC.
> 
> That's a nice idea but I see two potential problems:
> 
> (1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still
> show these files as binary. That's a huge problem for my users as
> they interact more with these services than the Git command line.
> That's the main reason why I implemented the "UTF-8 as canonical
> form" approach in my series.

I can't speak for the other services, but I can tell you that GitHub
would be pretty eager to enable such a feature if it existed.

I suspect most services providing human-readable diffs would want to do
the same. Though there are still cases where you'd see a binary patch
(e.g., format-patch in emails, or GitHub's .patch endpoint, since those
are meant to be applied and must contain the "real" data).

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-22 Thread Lars Schneider

> On 16 Feb 2018, at 17:58, Torsten Bögershausen  wrote:
> 
> On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote:
> []
>> 
>> Agreed. However, people using ShiftJIS are not my target audience.
>> My target audience are:
>> 
>> (1) People that have to encode their text files in UTF-16 (for 
>>whatever reason - usually because of legacy processes or tools).
>> 
>> (2) People that want to see textual diffs of their UTF-16 encoded 
>>files in their Git tools without special adjustments (on the 
>>command line, on the web, etc).
>> 
>> That was my primary motivation. The fact that w-t-e supports any
>> other encoding too is just a nice side effect. I don't foresee people
>> using other w-t-encodings other than UTF-16 in my organization.
>> 
>> I have the suspicion that the feature could be useful for the Git
>> community at large. Consider this Stack Overflow question:
>> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
>> 
>> This question was viewed 42k times and there is no good solution.
>> I believe w-t-e could be a good solution.
>> 
> 
> If it was only about a diff of UTF-16 files, I may suggest a patch.
> I simply copy-paste it here for review, if someone thinks that it may
> be useful, I can send it as a real patch/RFC.

That's a nice idea but I see two potential problems:

(1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still
show these files as binary. That's a huge problem for my users as
they interact more with these services than the Git command line.
That's the main reason why I implemented the "UTF-8 as canonical
form" approach in my series.

(2) You can only detect a BOM if the encoding is UTF-16. UTF-16BE and
UTF-16LE must not have a BOM and therefore cannot be easily
detected. Plus, even if you detect an UTF-16 BOM then it would be 
just a hint that the file is likely UTF-16 encoded as the sequence
could be there by chance. 

I still think it would be nice to see diffs for arbitrary encodings.
Would it be an option to read the `encoding` attribute and use it in
`git diff`?

- Lars


> 
> git show HEAD
> 
> 
> commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415
> Author: Torsten Bögershausen 
> Date:   Fri Feb 2 15:35:23 2018 +0100
> 
>Auto diff of UTF-16 files in UTF-8
> 
>When an UTF-16 file is commited and later changed, `git diff` shows
>"Binary files XX and YY differ".
> 
>When the user wants a diff in UTF-8, a textconv needs to be specified
>in .gitattributes and the textconv must be configured.
> 
>A more user-friendly diff can be produced for UTF-16 if
>- the user did not use `git diff --binary`
>- the blob is identified as binary
>- the blob has an UTF-16 BOM
>- the blob can be converted into UTF-8
> 
>Enhance the diff machinery to auto-detect UTF-16 blobs and show them
>as UTF-8, unless the user specifies `git diff --binary` which creates
>a binary diff.
> 
> diff --git a/diff.c b/diff.c
> index fb22b19f09..51831ee94d 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
>   strbuf_reset();
>   }
> 
> + if (one && one->reencoded_from_utf16)
> + strbuf_addf(, "a is converted to UTF-8 from 
> UTF-16\n");
> + if (two && two->reencoded_from_utf16)
> + strbuf_addf(, "b is converted to UTF-8 from 
> UTF-16\n");
>   mf1.size = fill_textconv(textconv_one, one, );
>   mf2.size = fill_textconv(textconv_two, two, );
> 
> @@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, 
> unsigned int flags)
>   s->size = size;
>   s->should_free = 1;
>   }
> - }
> - else {
> + if (!s->binary && buffer_is_binary(s->data, s->size) &&
> + buffer_has_utf16_bom(s->data, s->size)) {
> + int outsz = 0;
> + char *outbuf;
> + outbuf = reencode_string_len(s->data, (int)s->size,
> +  "UTF-8", "UTF-16", );
> + if (outbuf) {
> + if (s->should_free)
> + free(s->data);
> + if (s->should_munmap)
> + munmap(s->data, s->size);
> + s->should_munmap = 0;
> + s->data = outbuf;
> + s->size = outsz;
> + s->reencoded_from_utf16 = 1;
> + s->should_free = 1;
> + }
> + }
> + } else {
>   enum object_type type;
>   if (size_only || (flags & CHECK_BINARY)) {
>   type = sha1_object_info(s->oid.hash, 

Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-21 Thread Lars Schneider

> On 16 Feb 2018, at 19:55, Junio C Hamano  wrote:
> 
> Jeff King  writes:
> 
>> So a full proposal would support both cases: "check this out in the
>> local platform's preferred encoding" and "always check this out in
>> _this_ encoding". And Lars's proposal is just the second half of that.
> 
> Actually, what you seem to take as a whole is just half of the
> story.  The other half that is an ability to say "what is in the
> repository for this path is stored in this encoding".  I agree that
> "check it out in this encoding" is a useful thing to have, and using
> the in-tree .gitattributes as a place to state the project-wide
> preference may be OK (and .git/info/attributes should be able to
> override it if needed -- this probably deserves to be added to a
> test somewhere by this series).

Good call! I'll add this test case!


> Luckily, lack of 'in-repository-encoding' attribute is not a show
> stopper for this series.  A later topic could start with "earlier,
> in order to make use of w-t-e attribute, you had to have your
> contents in UTF-8.  Teach the codepath to honor a new attribute that
> tells in what encoding the blob contents are stored." without having
> to be a part of this topic.

I have the impression that this is the purpose of the already existing 
"encoding" attribute, no? AFAIK this attribute is only respected by 
gitk, though. A future series could make Git respect this attribute too.


- Lars



Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Junio C Hamano
Jeff King  writes:

> In which case yeah, I could see choosing an in-repo encoding to possibly
> be useful (but it also seems like a feature that could easily be tacked
> on later if somebody cares).

Yes, I think we are on the same page---in-repo-encoding that is a
natural counterpart to w-t-e attribute can be added later if/when
somebody finds it useful, and it is perfectly OK to declare that we
cater only to UTF-8 users until that happens.


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Jeff King
On Fri, Feb 16, 2018 at 02:25:41PM -0500, Jeff King wrote:

> On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote:
> 
> > Jeff King  writes:
> > 
> > > So a full proposal would support both cases: "check this out in the
> > > local platform's preferred encoding" and "always check this out in
> > > _this_ encoding". And Lars's proposal is just the second half of that.
> > 
> > Actually, what you seem to take as a whole is just half of the
> > story.  The other half that is an ability to say "what is in the
> > repository for this path is stored in this encoding".  I agree that
> > "check it out in this encoding" is a useful thing to have, and using
> > the in-tree .gitattributes as a place to state the project-wide
> > preference may be OK (and .git/info/attributes should be able to
> > override it if needed -- this probably deserves to be added to a
> > test somewhere by this series).
> 
> If we are just talking about a check-out feature, I'm not sure that the
> in-repository encoding is all that interesting. As with CRLFs, we would
> be declaring UTF-8 as the "canonical" in-repo encoding for such
> conversions. Is there a reason you'd want something else?

Maybe answering my own question: because your encoding of choice does
not round-trip to UTF-8?

In which case yeah, I could see choosing an in-repo encoding to possibly
be useful (but it also seems like a feature that could easily be tacked
on later if somebody cares).

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Jeff King
On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > So a full proposal would support both cases: "check this out in the
> > local platform's preferred encoding" and "always check this out in
> > _this_ encoding". And Lars's proposal is just the second half of that.
> 
> Actually, what you seem to take as a whole is just half of the
> story.  The other half that is an ability to say "what is in the
> repository for this path is stored in this encoding".  I agree that
> "check it out in this encoding" is a useful thing to have, and using
> the in-tree .gitattributes as a place to state the project-wide
> preference may be OK (and .git/info/attributes should be able to
> override it if needed -- this probably deserves to be added to a
> test somewhere by this series).

If we are just talking about a check-out feature, I'm not sure that the
in-repository encoding is all that interesting. As with CRLFs, we would
be declaring UTF-8 as the "canonical" in-repo encoding for such
conversions. Is there a reason you'd want something else?

If the feature were "the in-repo encoding is X, and I want you to show
me a diff using encoding Y", then I could see the use of that (and I
think for most people's purposes that would be an equally valid solution
to their problem).

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Junio C Hamano
Lars Schneider  writes:

>> One thing I find more problematic is that the above places *too*
>> much stress on the UTF-8 centric worldview.  It is perfectly valid
>> to store your text contents encoded in ShiftJIS and check them out
>> as-is, with or without this patch.  It is grossly misleading to say
>> that older versions of Git will check them out in UTF-8.  "will
>> checkout these files as-is without encoding conversion" is a better
>> way to say it, probably.
>
> True. But that's not what I wanted to say in the "pitfalls" section.
> If my Git client supports w-t-e and I add the ShiftJIS encoded
> file "foo.bar" to my repository, then Git will store the file as
> UTF-8 _internally_. That means if you clone my repository and your 
> Git client does _not_ support w-t-e, then you will see "foo.bar" as 
> UTF-8 encoded.

What you wrote implies *more* than that, which is what I had trouble
with.

If you say "what you have is checked out as-is", then it is still
clear that those who use w-t-e to convert non UTF-8 into UTF-8 when
checking in will get UTF-8 out when they use an older version of
Git.  If you say "what you have will be checked out in UTF-8", it
makes it sound as if pre w-t-e Git will somehow reject non UTF-8
in-tree contents, or magically convert anything to UTF-8 while
checking out, which is *not* what you want to imply.

>> Also notice that even in the world with w-t-e, such a project won't
>> benefit from w-t-e at all.  After all, they have been happy using
>> ShiftJIS in repository and using the same encoding on the working
>> tree, and because w-t-e assumes that everybody should be using UTF-8
>> internally, such a project cannot take advantage of the new
>> mechanism.
>
> Agreed. However, people using ShiftJIS are not my target audience.

Be aware that you are writing *not* *solely* for your target
audience.  You write document for everybody, and make sure the
description of a feature makes it clear who the feature primarily
targets and how using (or not using) the feature affects users.



Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Junio C Hamano
Jeff King  writes:

> So a full proposal would support both cases: "check this out in the
> local platform's preferred encoding" and "always check this out in
> _this_ encoding". And Lars's proposal is just the second half of that.

Actually, what you seem to take as a whole is just half of the
story.  The other half that is an ability to say "what is in the
repository for this path is stored in this encoding".  I agree that
"check it out in this encoding" is a useful thing to have, and using
the in-tree .gitattributes as a place to state the project-wide
preference may be OK (and .git/info/attributes should be able to
override it if needed -- this probably deserves to be added to a
test somewhere by this series).

Luckily, lack of 'in-repository-encoding' attribute is not a show
stopper for this series.  A later topic could start with "earlier,
in order to make use of w-t-e attribute, you had to have your
contents in UTF-8.  Teach the codepath to honor a new attribute that
tells in what encoding the blob contents are stored." without having
to be a part of this topic.


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Torsten Bögershausen
On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote:
[]
> 
> Agreed. However, people using ShiftJIS are not my target audience.
> My target audience are:
> 
> (1) People that have to encode their text files in UTF-16 (for 
> whatever reason - usually because of legacy processes or tools).
> 
> (2) People that want to see textual diffs of their UTF-16 encoded 
> files in their Git tools without special adjustments (on the 
> command line, on the web, etc).
> 
> That was my primary motivation. The fact that w-t-e supports any
> other encoding too is just a nice side effect. I don't foresee people
> using other w-t-encodings other than UTF-16 in my organization.
> 
> I have the suspicion that the feature could be useful for the Git
> community at large. Consider this Stack Overflow question:
> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
> 
> This question was viewed 42k times and there is no good solution.
> I believe w-t-e could be a good solution.
> 

If it was only about a diff of UTF-16 files, I may suggest a patch.
I simply copy-paste it here for review, if someone thinks that it may
be useful, I can send it as a real patch/RFC.

git show HEAD


commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415
Author: Torsten Bögershausen 
Date:   Fri Feb 2 15:35:23 2018 +0100

Auto diff of UTF-16 files in UTF-8

When an UTF-16 file is commited and later changed, `git diff` shows
"Binary files XX and YY differ".

When the user wants a diff in UTF-8, a textconv needs to be specified
in .gitattributes and the textconv must be configured.

A more user-friendly diff can be produced for UTF-16 if
- the user did not use `git diff --binary`
- the blob is identified as binary
- the blob has an UTF-16 BOM
- the blob can be converted into UTF-8

Enhance the diff machinery to auto-detect UTF-16 blobs and show them
as UTF-8, unless the user specifies `git diff --binary` which creates
a binary diff.

diff --git a/diff.c b/diff.c
index fb22b19f09..51831ee94d 100644
--- a/diff.c
+++ b/diff.c
@@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
strbuf_reset();
}
 
+   if (one && one->reencoded_from_utf16)
+   strbuf_addf(, "a is converted to UTF-8 from 
UTF-16\n");
+   if (two && two->reencoded_from_utf16)
+   strbuf_addf(, "b is converted to UTF-8 from 
UTF-16\n");
mf1.size = fill_textconv(textconv_one, one, );
mf2.size = fill_textconv(textconv_two, two, );
 
@@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
s->size = size;
s->should_free = 1;
}
-   }
-   else {
+   if (!s->binary && buffer_is_binary(s->data, s->size) &&
+   buffer_has_utf16_bom(s->data, s->size)) {
+   int outsz = 0;
+   char *outbuf;
+   outbuf = reencode_string_len(s->data, (int)s->size,
+"UTF-8", "UTF-16", );
+   if (outbuf) {
+   if (s->should_free)
+   free(s->data);
+   if (s->should_munmap)
+   munmap(s->data, s->size);
+   s->should_munmap = 0;
+   s->data = outbuf;
+   s->size = outsz;
+   s->reencoded_from_utf16 = 1;
+   s->should_free = 1;
+   }
+   }
+   } else {
enum object_type type;
if (size_only || (flags & CHECK_BINARY)) {
type = sha1_object_info(s->oid.hash, >size);
@@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
s->data = read_sha1_file(s->oid.hash, , >size);
if (!s->data)
die("unable to read %s", oid_to_hex(>oid));
+   if (!s->binary && buffer_is_binary(s->data, s->size) &&
+   buffer_has_utf16_bom(s->data, s->size)) {
+   int outsz = 0;
+   char *buf;
+   buf = reencode_string_len(s->data, (int)s->size,
+ "UTF-8", "UTF-16", );
+   if (buf) {
+   free(s->data);
+   s->data = buf;
+   s->size = outsz;
+   s->reencoded_from_utf16 = 1;
+   }
+   }
s->should_free = 1;
}
return 0;
@@ -5695,6 +5729,10 @@ 

Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-16 Thread Lars Schneider

> On 15 Feb 2018, at 21:03, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> -- Git clients that do not support the `working-tree-encoding` attribute
>> -  will checkout the respective files UTF-8 encoded and not in the
>> -  expected encoding. Consequently, these files will appear different
>> -  which typically causes trouble. This is in particular the case for
>> -  older Git versions and alternative Git implementations such as JGit
>> -  or libgit2 (as of February 2018).
>> +- Third party Git implementations that do not support the
>> +  `working-tree-encoding` attribute will checkout the respective files
>> +  UTF-8 encoded and not in the expected encoding. Consequently, these
>> +  files will appear different which typically causes trouble. This is
>> +  in particular the case for older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of February 2018).
> 
> I know somebody found "clients" misleading in the original, but the
> ones that do not understand w-t-e do not have to be third party
> reimplementations and imitations.  All existing Git implementations,
> including ours, don't.

Agreed!


> One thing I find more problematic is that the above places *too*
> much stress on the UTF-8 centric worldview.  It is perfectly valid
> to store your text contents encoded in ShiftJIS and check them out
> as-is, with or without this patch.  It is grossly misleading to say
> that older versions of Git will check them out in UTF-8.  "will
> checkout these files as-is without encoding conversion" is a better
> way to say it, probably.

True. But that's not what I wanted to say in the "pitfalls" section.
If my Git client supports w-t-e and I add the ShiftJIS encoded
file "foo.bar" to my repository, then Git will store the file as
UTF-8 _internally_. That means if you clone my repository and your 
Git client does _not_ support w-t-e, then you will see "foo.bar" as 
UTF-8 encoded.


> Also notice that even in the world with w-t-e, such a project won't
> benefit from w-t-e at all.  After all, they have been happy using
> ShiftJIS in repository and using the same encoding on the working
> tree, and because w-t-e assumes that everybody should be using UTF-8
> internally, such a project cannot take advantage of the new
> mechanism.

Agreed. However, people using ShiftJIS are not my target audience.
My target audience are:

(1) People that have to encode their text files in UTF-16 (for 
whatever reason - usually because of legacy processes or tools).

(2) People that want to see textual diffs of their UTF-16 encoded 
files in their Git tools without special adjustments (on the 
command line, on the web, etc).

That was my primary motivation. The fact that w-t-e supports any
other encoding too is just a nice side effect. I don't foresee people
using other w-t-encodings other than UTF-16 in my organization.

I have the suspicion that the feature could be useful for the Git
community at large. Consider this Stack Overflow question:
https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text

This question was viewed 42k times and there is no good solution.
I believe w-t-e could be a good solution.



With your comments in mind, I tried to improve the text like this:

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8, ISO-8859-1, ...) as text files.  Files encoded with certain other
encodings (e.g. UTF-16) are interpreted as binary and consequently
built-in Git text processing tools (e.g. 'git diff') as well as most Git
web front ends do not visualize the contents of these files by default.

...

Please note that using the `working-tree-encoding` attribute may have a
number of pitfalls:

- Alternative Git implementations (e.g. JGit or libgit2) and older Git 
  versions (as of February 2018) do not support the `working-tree-encoding`
  attribute. If you decide to use the `working-tree-encoding` attribute
  in your repository, then it is strongly recommended to ensure that all
  clients working with the repository support it.

  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
  `working-tree-encoding` enabled Git client, then `foo.proj` will be
  stored as UTF-8 internally. A client without `working-tree-encoding`
  support will checkout `foo.proj` as UTF-8 encoded file. This will
  typically cause trouble for the users of this file.

  If a Git client, that does not support the `working-tree-encoding`
  attribute, adds a new file `bar.proj`, then `bar.proj` will be
  stored "as-is" internally (in this example probably as UTF-16). 
  A client with `working-tree-encoding` support will interpret the 
  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
  That operation will fail and cause an error.

...



> And from that point of view, perhaps 

Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-15 Thread Jeff King
On Thu, Feb 15, 2018 at 12:03:06PM -0800, Junio C Hamano wrote:

> And from that point of view, perhaps w-t-e attribute is somewhat
> misdesigned.
> 
> In general, an attribute is about the project's contents in the
> manner independent of platform or environment.  You define "this
> file is a C source" or "this file has JPEG image" there.  What exact
> program you use to present diffs between the two versions of such a
> file (external diff command) or what exact program you use to
> extract the textual representations (textconv filter) is environment
> and platform dependent and is left to the configuration mechanism
> for each repository.
> 
> To be in line with the above design principle, I think the attribute
> ought to be "the in-tree contents of this path is encoded in ..."
> whose values could be things like UTF-8, ShiftJIS, etc.  What
> external encoding the paths should be checked out is not a
> project-wide matter, especially when talking about cross platform
> projects.  Perhaps a project in Japanese language wants to check
> out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
> systems.  The participants all need to know what in-repository
> encoding is used, which is a sensible use of attributes.  They also
> need to know what the recommended external encoding to be used in
> the working tree is for their platforms, but that is more like what
> Makefile variable to set for their platforms, etc., and is not a
> good match to the attributes system.

While I agree what you're saying philosophically here, I suspect you'd
still need another attribute for "no really, this needs to be checked
out as encoding X". The same way we treat line endings as a platform
decision, but we still need to have `eol=crlf` for those files which
really, no matter what platform you're on, have external tools depending
on them to have some particular line ending.

So a full proposal would support both cases: "check this out in the
local platform's preferred encoding" and "always check this out in
_this_ encoding". And Lars's proposal is just the second half of that.

But I'm not sure anybody even really cares about the first part; I don't
think we've seen anybody actually ask for it.

-Peff


Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-15 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> -- Git clients that do not support the `working-tree-encoding` attribute
> -  will checkout the respective files UTF-8 encoded and not in the
> -  expected encoding. Consequently, these files will appear different
> -  which typically causes trouble. This is in particular the case for
> -  older Git versions and alternative Git implementations such as JGit
> -  or libgit2 (as of February 2018).
> +- Third party Git implementations that do not support the
> +  `working-tree-encoding` attribute will checkout the respective files
> +  UTF-8 encoded and not in the expected encoding. Consequently, these
> +  files will appear different which typically causes trouble. This is
> +  in particular the case for older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of February 2018).

I know somebody found "clients" misleading in the original, but the
ones that do not understand w-t-e do not have to be third party
reimplementations and imitations.  All existing Git implementations,
including ours, don't.

One thing I find more problematic is that the above places *too*
much stress on the UTF-8 centric worldview.  It is perfectly valid
to store your text contents encoded in ShiftJIS and check them out
as-is, with or without this patch.  It is grossly misleading to say
that older versions of Git will check them out in UTF-8.  "will
checkout these files as-is without encoding conversion" is a better
way to say it, probably.

Also notice that even in the world with w-t-e, such a project won't
benefit from w-t-e at all.  After all, they have been happy using
ShiftJIS in repository and using the same encoding on the working
tree, and because w-t-e assumes that everybody should be using UTF-8
internally, such a project cannot take advantage of the new
mechanism.

And from that point of view, perhaps w-t-e attribute is somewhat
misdesigned.

In general, an attribute is about the project's contents in the
manner independent of platform or environment.  You define "this
file is a C source" or "this file has JPEG image" there.  What exact
program you use to present diffs between the two versions of such a
file (external diff command) or what exact program you use to
extract the textual representations (textconv filter) is environment
and platform dependent and is left to the configuration mechanism
for each repository.

To be in line with the above design principle, I think the attribute
ought to be "the in-tree contents of this path is encoded in ..."
whose values could be things like UTF-8, ShiftJIS, etc.  What
external encoding the paths should be checked out is not a
project-wide matter, especially when talking about cross platform
projects.  Perhaps a project in Japanese language wants to check
out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
systems.  The participants all need to know what in-repository
encoding is used, which is a sensible use of attributes.  They also
need to know what the recommended external encoding to be used in
the working tree is for their platforms, but that is more like what
Makefile variable to set for their platforms, etc., and is not a
good match to the attributes system.



[PATCH v7 0/7] convert: add support for different encodings

2018-02-15 Thread lars . schneider
From: Lars Schneider 

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in master.

Changes since v6:

* use consistent casing for core.checkRoundtripEncoding (Junio)
* fix gibberish in commit message (Junio)
* improve documentation (Torsten)
* improve advise messages (Torsten)


Thanks,
Lars

  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353
Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git 
checkout 2b94bec353


### Interdiff (v6..v7):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index ea5a9509c6..10cb37795d 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Git clients that do not support the `working-tree-encoding` attribute
-  will checkout the respective files UTF-8 encoded and not in the
-  expected encoding. Consequently, these files will appear different
-  which typically causes trouble. This is in particular the case for
-  older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of February 2018).
+- Third party Git implementations that do not support the
+  `working-tree-encoding` attribute will checkout the respective files
+  UTF-8 encoded and not in the expected encoding. Consequently, these
+  files will appear different which typically causes trouble. This is
+  in particular the case for older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of February 2018).

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
-  encoding to not be round trip safe, then add it to 
`core.checkRoundtripEncoding`
-  to make Git check the round trip encoding (see linkgit:git-config[1]).
-  SHIFT-JIS (Japanese character set) is known to have round trip issues
-  with UTF-8 and is checked by default.
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
@@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the 
`working-tree-encoding`
 attribute is used to avoid ambiguity.

 
-*.proj working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj text working-tree-encoding=UTF-16LE eol=CRLF
 

 You can get a list of all available encodings on your platform with the
diff --git a/convert.c b/convert.c
index 71dffc7167..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char 
*src, size_t src_len,

if (has_prohibited_utf_bom(enc->name, src, src_len)) {
const char *error_msg = _(
-   "BOM is prohibited for '%s' if encoded as %s");
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advise is shown for UTF-??BE and UTF-??LE encodings.
+* We truncate the encoding name to 6 chars with %.6s to cut
+* off the last two "byte order" characters.
+*/
const char *advise_msg = _(
-   "You told Git to treat '%s' as %s. A byte order mark "
-   "(BOM) is prohibited with this encoding. Either use "
-   "%.6s as working tree encoding or remove the BOM from 
the "
-   "file.");
-
-   advise(advise_msg, path, enc->name, enc->name, enc->name);
+   "The file '%s' contains a byte order mark (BOM). "
+   "Please use %.6s as working-tree-encoding.");
+   advise(advise_msg, path, enc->name);
if