Re: [PATCH v7 0/7] convert: add support for different encodings
On 2018-02-28 14:21, Jeff King wrote: > On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote: > >>> 2. auto-detect utf-16 (your patch) >>> - Just Works for existing repositories storing utf-16 >>> >>> - carries some risk of kicking in when people would like it not to >>>(e.g., when they really do want a binary patch that can be >>>applied). >> >> The binary patch is still supported, but that detail may need some more >> explanation >> in the commit message. Please see t4066-diff-encoding.sh > > Yeah, but if you don't have binary-patches enabled we'd generate a bogus > patch. Which, granted, without that you wouldn't be able to apply the > patch either. But somehow it feels funny to me to generate something > that _looks_ like a patch but you can't actually apply. > > I also think we'd want a plan for this to be used consistently in other > diff-like tools. E.g., "git blame" uses textconv for the starting file > content, and it would be nice for this to kick in then, too. Ditto for > things like grep, pickaxe, etc. > > I have some patches that reuse some of the textconv infrastructure for > this, which should mostly make it "just work" everywhere. They need a > little more polishing before I post them, but you can take a look at: > > https://github.com/peff/git.git jk/textconv-utf16 > > if you want. > > -Peff > Thanks for your work (I actually found some time to take look) I am looking at the code to put 2 or 3 things on top of it: - test case(s) - documentation - teach diff to add a line "b is converted to UTF-8 from UTF-16" - teach apply to reads & understands the encoding line and throws in a "reencode_string_len() like your patch does This would keep "git diff | git apply" happy. All in all the changes do not look too invasive, at least from my point of view.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Wed, Feb 28, 2018 at 09:42:27AM -0800, Junio C Hamano wrote: > > I also think we'd want a plan for this to be used consistently in other > > diff-like tools. E.g., "git blame" uses textconv for the starting file > > content, and it would be nice for this to kick in then, too. Ditto for > > things like grep, pickaxe, etc. > > You probably do not want to limit your thinking to the generation > side. It is entirely plausible to have "we deal with diff in this > encoding X" in addition to "the in-repo encoding for this project is > this encoding Y" and "the working tree encoding for this path is Z" > and allow them to interact in "git diff | git apply" pipeline. > > "diff/format-patch --stdout/etc." on the upstream would first iconv > Y to X and feed the contents in X to xdiff machinery, which is sent > down the pipe and received by apply, which reads the preimage from > the disk or from the repository. If doing "apply" without > "--cached/--index", the preimage data from the disk would go through > iconv Z to X. If doing "apply --cached/--index", the preimage data > from the repo would go through iconv Y to X. The incoming patch is > in X, so we apply, and the resulting postimage will be re-encoded in > Z in the working tree and Y in the repository. I agree that would be convenient, but I have to wonder if all the complexity is worth it to maintain the idea of a distinct in-repo representation. It seems like it would open up a ton of corner cases. And I suspect most people would be happy enough with either a clean/smudge style worktree conversion or a textconv-style view. So if somebody wants to work on it, I don't want to stop them. But I think there's room for the simpler solutions in the meantime. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
> On 27 Feb 2018, at 22:25, Jeff Kingwrote: > > On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote: > > Of the three solutions, I think the relative merits are something like > this: > > 1. baked-in textconv (my patch) > > - reuses an existing diff feature, so minimal code and not likely to > break things > > - requires people to add a .gitattributes entry > > - needs work to make bare-repo .gitattributes work (though I think > this would be useful for other features, too) > > - has a run-time cost at each diff to do the conversion > > - may sometimes annoy people when it doesn't kick in (e.g., > emailed patches from format-patch won't have a readable diff) > > - doesn't combine with other custom-diff config (e.g., utf-16 > storing C code should still use diff=c funcname rules, but > wouldn't with my patch) > > 2. auto-detect utf-16 (your patch) > - Just Works for existing repositories storing utf-16 > > - carries some risk of kicking in when people would like it not to > (e.g., when they really do want a binary patch that can be > applied). > > I think it would probably be OK if this kicked in only when > ALLOW_TEXTCONV is set (the default for porcelain), and --binary > is not (i.e., when we would otherwise just say "binary > files differ"). > > - similar to (1), carries a run-time cost for each diff, and users > may sometimes still see binary diffs > > 3. w-t-e (Lars's patch) > > - requires no server-side modifications; the diff is plain vanilla > > - works everywhere you diff, plumbing and porcelain > > - does require people to add a .gitattributes entry > > - run-time cost is per-checkout, not per-diff > > So I can see room for (3) to co-exist alongside the others. Between (1) > and (2), I think (2) is probably the better direction. Thanks for the great summary! I agree they could co-exist and people could use whatever works best for them. I'll incorporate Eric's feedback and send a w-t-e v9 soonish. - Lars
Re: [PATCH v7 0/7] convert: add support for different encodings
Jeff Kingwrites: >> The binary patch is still supported, but that detail may need some more >> explanation >> in the commit message. Please see t4066-diff-encoding.sh > > Yeah, but if you don't have binary-patches enabled we'd generate a bogus > patch. Which, granted, without that you wouldn't be able to apply the > patch either. But somehow it feels funny to me to generate something > that _looks_ like a patch but you can't actually apply. True. And at least you _could_ apply a properly formatted binary patch to the original. > I also think we'd want a plan for this to be used consistently in other > diff-like tools. E.g., "git blame" uses textconv for the starting file > content, and it would be nice for this to kick in then, too. Ditto for > things like grep, pickaxe, etc. You probably do not want to limit your thinking to the generation side. It is entirely plausible to have "we deal with diff in this encoding X" in addition to "the in-repo encoding for this project is this encoding Y" and "the working tree encoding for this path is Z" and allow them to interact in "git diff | git apply" pipeline. "diff/format-patch --stdout/etc." on the upstream would first iconv Y to X and feed the contents in X to xdiff machinery, which is sent down the pipe and received by apply, which reads the preimage from the disk or from the repository. If doing "apply" without "--cached/--index", the preimage data from the disk would go through iconv Z to X. If doing "apply --cached/--index", the preimage data from the repo would go through iconv Y to X. The incoming patch is in X, so we apply, and the resulting postimage will be re-encoded in Z in the working tree and Y in the repository.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote: > > 2. auto-detect utf-16 (your patch) > > - Just Works for existing repositories storing utf-16 > > > > - carries some risk of kicking in when people would like it not to > >(e.g., when they really do want a binary patch that can be > >applied). > > The binary patch is still supported, but that detail may need some more > explanation > in the commit message. Please see t4066-diff-encoding.sh Yeah, but if you don't have binary-patches enabled we'd generate a bogus patch. Which, granted, without that you wouldn't be able to apply the patch either. But somehow it feels funny to me to generate something that _looks_ like a patch but you can't actually apply. I also think we'd want a plan for this to be used consistently in other diff-like tools. E.g., "git blame" uses textconv for the starting file content, and it would be nice for this to kick in then, too. Ditto for things like grep, pickaxe, etc. I have some patches that reuse some of the textconv infrastructure for this, which should mostly make it "just work" everywhere. They need a little more polishing before I post them, but you can take a look at: https://github.com/peff/git.git jk/textconv-utf16 if you want. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
On Tue, Feb 27, 2018 at 04:25:38PM -0500, Jeff King wrote: > On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote: > > > The other question is: > > Would this help showing diffs of UTF-16 encoded files on a "git hoster", > > github/bitbucket/ ? > > Almost. There's probably one more thing needed. We don't currently read > in-tree .gitattributes when doing a diff in a bare repository. And most > hosting sites will store bare repositories. > > And of course it would require the users to actually set the attributes > themselves. > > > Or would the auto-magic UTF-16 avoid binary patch that I send out be more > > helpful ? > > Or both ? > > Or the w-t-e encoding ? > > Of the three solutions, I think the relative merits are something like > this: > > 1. baked-in textconv (my patch) > > - reuses an existing diff feature, so minimal code and not likely to >break things > > - requires people to add a .gitattributes entry > > - needs work to make bare-repo .gitattributes work (though I think >this would be useful for other features, too) > > - has a run-time cost at each diff to do the conversion > > - may sometimes annoy people when it doesn't kick in (e.g., >emailed patches from format-patch won't have a readable diff) > > - doesn't combine with other custom-diff config (e.g., utf-16 >storing C code should still use diff=c funcname rules, but >wouldn't with my patch) > > 2. auto-detect utf-16 (your patch) > - Just Works for existing repositories storing utf-16 > > - carries some risk of kicking in when people would like it not to >(e.g., when they really do want a binary patch that can be >applied). The binary patch is still supported, but that detail may need some more explanation in the commit message. Please see t4066-diff-encoding.sh test_expect_success 'diff --binary against local change' ' cp file2 file && test_tick && cat >expect <<-\EOF && diff --git a/file b/file index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644 GIT binary patch literal 28 ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi literal 32 icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4 EOF git diff --binary file >actual && test_cmp expect actual > >I think it would probably be OK if this kicked in only when >ALLOW_TEXTCONV is set (the default for porcelain), and --binary >is not (i.e., when we would otherwise just say "binary >files differ"). The user can still use "git diff" (Where auto-detection of UTF-16 kicks in and replaces "binary files differ" with an UTF-8 diff. When the user wants a patch, "git diff --binary" will generate a binary patch, as before. The only thing which is missing is the line "binary files differ", which may be a regression. I can re-add it in V2. > > - similar to (1), carries a run-time cost for each diff, and users >may sometimes still see binary diffs > > 3. w-t-e (Lars's patch) > > - requires no server-side modifications; the diff is plain vanilla > > - works everywhere you diff, plumbing and porcelain > > - does require people to add a .gitattributes entry > > - run-time cost is per-checkout, not per-diff > > So I can see room for (3) to co-exist alongside the others. Between (1) > and (2), I think (2) is probably the better direction. > > -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
On Tue, Feb 27, 2018 at 02:10:20PM -0800, Junio C Hamano wrote: > > I thought it solved that by the hosting folks never seeing the strange > > binary-looking data. They see only utf8, which diffs well. > > Ah, OK, that is a "fix" in a wider context (in a narrower context, > "work around" is a more appropriate term ;-). > > The reason why I have been nudging people toward considering in-repo > encoding attribute is because forcing projects that already have > their contents in a strange binary-looking encoding to switch is > costly. But perhaps having them pay one-time conversion pain is a > better investment longer term. Yeah, thanks for mentioning that. It should have gone in my "relative merits" list. The conversion flag-day is definitely going to be a pain for users, and doesn't help with diffing older versions. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
Jeff Kingwrites: > On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote: > >> Jeff King writes: >> >> > Of the three solutions, I think the relative merits are something like >> > this: >> > ... >> > 3. w-t-e (Lars's patch) >> >> I thought Lars's w-t-e was about keeping the in-repo contents in >> UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it >> won't help the issue hosting folks want to deal with, i.e. showing >> in-repo data that is stored in a strange binary-looking encoding in >> a more reasonable encodign while diffing, no? > > I thought it solved that by the hosting folks never seeing the strange > binary-looking data. They see only utf8, which diffs well. Ah, OK, that is a "fix" in a wider context (in a narrower context, "work around" is a more appropriate term ;-). The reason why I have been nudging people toward considering in-repo encoding attribute is because forcing projects that already have their contents in a strange binary-looking encoding to switch is costly. But perhaps having them pay one-time conversion pain is a better investment longer term.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote: > Jeff Kingwrites: > > > Of the three solutions, I think the relative merits are something like > > this: > > ... > > 3. w-t-e (Lars's patch) > > I thought Lars's w-t-e was about keeping the in-repo contents in > UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it > won't help the issue hosting folks want to deal with, i.e. showing > in-repo data that is stored in a strange binary-looking encoding in > a more reasonable encodign while diffing, no? I thought it solved that by the hosting folks never seeing the strange binary-looking data. They see only utf8, which diffs well. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
Jeff Kingwrites: > Of the three solutions, I think the relative merits are something like > this: > ... > 3. w-t-e (Lars's patch) I thought Lars's w-t-e was about keeping the in-repo contents in UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it won't help the issue hosting folks want to deal with, i.e. showing in-repo data that is stored in a strange binary-looking encoding in a more reasonable encodign while diffing, no? Usually we only work in-repo encoding when producing a diff and show the result in in-repo encoding, but I can imagine a new attribute, when set, we first convert in-repo to the specified encoding before passing the result to xdiff machinery. Then convert it back to in-repo encoding before showing the diff (or just show the result in that encoding xdiff machinery processed---I do not know which one should be the default).
Re: [PATCH v7 0/7] convert: add support for different encodings
On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote: > The other question is: > Would this help showing diffs of UTF-16 encoded files on a "git hoster", > github/bitbucket/ ? Almost. There's probably one more thing needed. We don't currently read in-tree .gitattributes when doing a diff in a bare repository. And most hosting sites will store bare repositories. And of course it would require the users to actually set the attributes themselves. > Or would the auto-magic UTF-16 avoid binary patch that I send out be more > helpful ? > Or both ? > Or the w-t-e encoding ? Of the three solutions, I think the relative merits are something like this: 1. baked-in textconv (my patch) - reuses an existing diff feature, so minimal code and not likely to break things - requires people to add a .gitattributes entry - needs work to make bare-repo .gitattributes work (though I think this would be useful for other features, too) - has a run-time cost at each diff to do the conversion - may sometimes annoy people when it doesn't kick in (e.g., emailed patches from format-patch won't have a readable diff) - doesn't combine with other custom-diff config (e.g., utf-16 storing C code should still use diff=c funcname rules, but wouldn't with my patch) 2. auto-detect utf-16 (your patch) - Just Works for existing repositories storing utf-16 - carries some risk of kicking in when people would like it not to (e.g., when they really do want a binary patch that can be applied). I think it would probably be OK if this kicked in only when ALLOW_TEXTCONV is set (the default for porcelain), and --binary is not (i.e., when we would otherwise just say "binary files differ"). - similar to (1), carries a run-time cost for each diff, and users may sometimes still see binary diffs 3. w-t-e (Lars's patch) - requires no server-side modifications; the diff is plain vanilla - works everywhere you diff, plumbing and porcelain - does require people to add a .gitattributes entry - run-time cost is per-checkout, not per-diff So I can see room for (3) to co-exist alongside the others. Between (1) and (2), I think (2) is probably the better direction. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
On Mon, Feb 26, 2018 at 03:46:35PM -0500, Jeff King wrote: > On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote: > > > > diff --git a/userdiff.c b/userdiff.c > > > index dbfb4e13cd..48fa7e8bdd 100644 > > > --- a/userdiff.c > > > +++ b/userdiff.c > > > @@ -161,6 +161,7 @@ IPATTERN("css", > > >"-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */ > > >"|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */ > > > ), > > > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" }, > > > { "default", NULL, -1, { NULL, 0 } }, > > > }; > > > #undef PATTERNS > > > > The patch looks like a possible step into the right direction - > > some minor notes: "utf8" is better written as "UTF-8", when talking > > to iconv.h, same for utf16. > > > > But, how do I activate the diff ? > > I have in .gitattributes > > XXXenglish.txt diff=UTF-16 > > > > and in .git/config > > [diff "UTF-16"] > > command = iconv:UTF-16 > > > > > > What am I doing wrong ? > > After applying the patch, if I do: > > git init > echo hello | iconv -f utf8 -t utf16 >file > git add file > git commit -m one > echo goodbye | iconv -f utf8 -t utf16 >file > git add file > git commit -m two > > then: > > git log -p > > shows "binary files differ" but: > > echo "file diff=utf16" >.gitattributes > git log -p > > shows text diffs. I assume you tweaked the patch before switching to > the UTF-16 spelling in your example. Did you use a plumbing command to > show the diff? textconv isn't enabled for plumbing, because the > resulting patches cannot actually be applied (in that sense an encoding > switch is potentially special, since in theory one could convert to the > canonical text format, apply the patch, and then convert back). > > -Peff Thanks for helping me out. I didn't use "git log -p", but a simple "git diff". (And after re-using utf16 with lowercase, it works as you described it) I wasn't aware of "git log -p", something learned (or re-learned) The other question is: Would this help showing diffs of UTF-16 encoded files on a "git hoster", github/bitbucket/ ? Or would the auto-magic UTF-16 avoid binary patch that I send out be more helpful ? Or both ? Or the w-t-e encoding ? Questions over questions.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote: > > diff --git a/userdiff.c b/userdiff.c > > index dbfb4e13cd..48fa7e8bdd 100644 > > --- a/userdiff.c > > +++ b/userdiff.c > > @@ -161,6 +161,7 @@ IPATTERN("css", > > "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */ > > "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */ > > ), > > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" }, > > { "default", NULL, -1, { NULL, 0 } }, > > }; > > #undef PATTERNS > > The patch looks like a possible step into the right direction - > some minor notes: "utf8" is better written as "UTF-8", when talking > to iconv.h, same for utf16. > > But, how do I activate the diff ? > I have in .gitattributes > XXXenglish.txt diff=UTF-16 > > and in .git/config > [diff "UTF-16"] > command = iconv:UTF-16 > > > What am I doing wrong ? After applying the patch, if I do: git init echo hello | iconv -f utf8 -t utf16 >file git add file git commit -m one echo goodbye | iconv -f utf8 -t utf16 >file git add file git commit -m two then: git log -p shows "binary files differ" but: echo "file diff=utf16" >.gitattributes git log -p shows text diffs. I assume you tweaked the patch before switching to the UTF-16 spelling in your example. Did you use a plumbing command to show the diff? textconv isn't enabled for plumbing, because the resulting patches cannot actually be applied (in that sense an encoding switch is potentially special, since in theory one could convert to the canonical text format, apply the patch, and then convert back). -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
On Sun, Feb 25, 2018 at 08:44:46PM -0500, Jeff King wrote: > On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote: > > > > We always use the in-repo contents when generating 'diff'. I think > > > by "attribute to be used in diff", what you are reallying after is > > > to convert the in-repo contents to that encoding _BEFORE_ running > > > 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over > > > the place will not diff well with the xdiff machinery, but if you > > > first convert it to UTF-8 and have xdiff work on it, you can get > > > reasonable result out of it. It is unclear what encoding you want > > > your final diff output in (it is equally valid in such a set-up to > > > desire your patch output in UTF-16 or UTF-8), but assuming that you > > > want UTF-8 in your patch output, perhaps we do not have to break > > > gitk users by hijacking the 'encoding' attribute. Instead what you > > > want is a single bit that says between in-repo or working tree which > > > representation should be given to the xdiff machinery. > > > > I fear that we could confuse users with an additional knob/bit that > > defines what we diff against. Git always diff'ed against in-repo > > content and I feel it should stay that way. > > Well, except for textconv. You can already do this: > > echo "foo diff=utf16" >.gitattributes > git config diff.utf16.textconv 'iconv -f utf16 -t utf8' > > We could make that easier to use and much more efficient by: > > 1. Allowing a special syntax for textconv filters that kicks off an > internal iconv. > > 2. Providing baked-in config for utf16. > > The patch below provides a sketch. But I think Torsten raised a good > point that you might want the encoding conversion to be independent of > other diff characteristics (so, e.g., you might say "this is utf16 but > once converted treat it like C code for finding funcnames, etc"). > > --- > diff --git a/diff.c b/diff.c > index 21c3838b25..04032e059c 100644 > --- a/diff.c > +++ b/diff.c > @@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options > *options, const char *pat > return pair; > } > > +static char *iconv_textconv(const char *encoding, struct diff_filespec *spec, > + size_t *outsize) > +{ > + char *ret; > + int outsize_int; /* this really should be a size_t */ > + > + if (diff_populate_filespec(spec, 0)) > + die("unable to load content for %s", spec->path); > + ret = reencode_string_len(spec->data, spec->size, > + "utf-8", /* should be log_output_encoding? */ > + encoding, _int); > + *outsize = outsize_int; > + return ret; > +} > + > static char *run_textconv(const char *pgm, struct diff_filespec *spec, > size_t *outsize) > { > @@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct > diff_filespec *spec, > struct strbuf buf = STRBUF_INIT; > int err = 0; > > + if (skip_prefix(pgm, "iconv:", )) > + return iconv_textconv(pgm, spec, outsize); > + > temp = prepare_temp_file(spec->path, spec); > *arg++ = pgm; > *arg++ = temp->name; > diff --git a/userdiff.c b/userdiff.c > index dbfb4e13cd..48fa7e8bdd 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -161,6 +161,7 @@ IPATTERN("css", >"-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */ >"|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */ > ), > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" }, > { "default", NULL, -1, { NULL, 0 } }, > }; > #undef PATTERNS The patch looks like a possible step into the right direction - some minor notes: "utf8" is better written as "UTF-8", when talking to iconv.h, same for utf16. But, how do I activate the diff ? I have in .gitattributes XXXenglish.txt diff=UTF-16 and in .git/config [diff "UTF-16"] command = iconv:UTF-16 What am I doing wrong ?
Re: [PATCH v7 0/7] convert: add support for different encodings
On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote: > > We always use the in-repo contents when generating 'diff'. I think > > by "attribute to be used in diff", what you are reallying after is > > to convert the in-repo contents to that encoding _BEFORE_ running > > 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over > > the place will not diff well with the xdiff machinery, but if you > > first convert it to UTF-8 and have xdiff work on it, you can get > > reasonable result out of it. It is unclear what encoding you want > > your final diff output in (it is equally valid in such a set-up to > > desire your patch output in UTF-16 or UTF-8), but assuming that you > > want UTF-8 in your patch output, perhaps we do not have to break > > gitk users by hijacking the 'encoding' attribute. Instead what you > > want is a single bit that says between in-repo or working tree which > > representation should be given to the xdiff machinery. > > I fear that we could confuse users with an additional knob/bit that > defines what we diff against. Git always diff'ed against in-repo > content and I feel it should stay that way. Well, except for textconv. You can already do this: echo "foo diff=utf16" >.gitattributes git config diff.utf16.textconv 'iconv -f utf16 -t utf8' We could make that easier to use and much more efficient by: 1. Allowing a special syntax for textconv filters that kicks off an internal iconv. 2. Providing baked-in config for utf16. The patch below provides a sketch. But I think Torsten raised a good point that you might want the encoding conversion to be independent of other diff characteristics (so, e.g., you might say "this is utf16 but once converted treat it like C code for finding funcnames, etc"). --- diff --git a/diff.c b/diff.c index 21c3838b25..04032e059c 100644 --- a/diff.c +++ b/diff.c @@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options *options, const char *pat return pair; } +static char *iconv_textconv(const char *encoding, struct diff_filespec *spec, + size_t *outsize) +{ + char *ret; + int outsize_int; /* this really should be a size_t */ + + if (diff_populate_filespec(spec, 0)) + die("unable to load content for %s", spec->path); + ret = reencode_string_len(spec->data, spec->size, + "utf-8", /* should be log_output_encoding? */ + encoding, _int); + *outsize = outsize_int; + return ret; +} + static char *run_textconv(const char *pgm, struct diff_filespec *spec, size_t *outsize) { @@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec, struct strbuf buf = STRBUF_INIT; int err = 0; + if (skip_prefix(pgm, "iconv:", )) + return iconv_textconv(pgm, spec, outsize); + temp = prepare_temp_file(spec->path, spec); *arg++ = pgm; *arg++ = temp->name; diff --git a/userdiff.c b/userdiff.c index dbfb4e13cd..48fa7e8bdd 100644 --- a/userdiff.c +++ b/userdiff.c @@ -161,6 +161,7 @@ IPATTERN("css", "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */ "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */ ), +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" }, { "default", NULL, -1, { NULL, 0 } }, }; #undef PATTERNS
Re: [PATCH v7 0/7] convert: add support for different encodings
> On 23 Feb 2018, at 21:11, Junio C Hamanowrote: > > Junio C Hamano writes: > >> Lars Schneider writes: >> >>> I still think it would be nice to see diffs for arbitrary encodings. >>> Would it be an option to read the `encoding` attribute and use it in >>> `git diff`? >> >> Reusing that gitk-only thing and suddenly start doing so would break >> gitk users, no? The tool expects the diff to come out encoded in >> the encoding that is specified by that attribute (which is learned >> from get_path_encoding helper) and does its thing. >> >> I guess that gitk uses diff-tree plumbing and you won't be applying >> this change to the plumbing, perhaps? If so, it might not be too >> bad, but those who decided to postprocess "git diff" output (instead >> of "git diff-tree" output) mimicking how gitk does it by thinking >> that is the safe and sane thing to do will be broken by such a >> change. You could do "use the encoding only when a command line >> option says so", but then people will add a configuration variable >> to turn it always on and these existing scripts will be broken. >> >> I do not personally have much sympathy for the last case (i.e. those >> who scripted around 'git diff' instead of 'git diff-tree' to get >> broken), so making the new feature only work with the Porcelain "git >> diff" might be an option. I'll need a bit more time to formulate >> the rest of my thought ;-) > > So we are introducing in this series a way to say in what encoding > the things should be placed in the working tree files (i.e. the > w-t-e attribute attached to the paths). Currently there is no > mechanism to say what encoding the in-repo contents are and UTF-8 is > assumed when conversion from/to w-t-e is required, but there is no > fundamental reason why it shouldn't be customizable (if anything, as > a piece of fact on the in-repo data, in-repo-encoding is *more* > appropriate to be an attribute than w-t-e that can merely be project > preference at best, as I mentioned earlier in this thread). Correct. > We always use the in-repo contents when generating 'diff'. I think > by "attribute to be used in diff", what you are reallying after is > to convert the in-repo contents to that encoding _BEFORE_ running > 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over > the place will not diff well with the xdiff machinery, but if you > first convert it to UTF-8 and have xdiff work on it, you can get > reasonable result out of it. It is unclear what encoding you want > your final diff output in (it is equally valid in such a set-up to > desire your patch output in UTF-16 or UTF-8), but assuming that you > want UTF-8 in your patch output, perhaps we do not have to break > gitk users by hijacking the 'encoding' attribute. Instead what you > want is a single bit that says between in-repo or working tree which > representation should be given to the xdiff machinery. I fear that we could confuse users with an additional knob/bit that defines what we diff against. Git always diff'ed against in-repo content and I feel it should stay that way. However, I agree with your earlier emails that "working-tree-encoding" is just one half of the feature. I also think it would be nice to be able to define the "in-repo-encoding" as well. Then we could define something like that: *.foo text in-repo-encoding=UTF-16LE This tells Git that the file is stored as UTF-16LE. This would help Git generating a diff via UTF-8 conversion. I feel that the final patch should be in UTF-16LE again. Maybe over time we could then deprecate the "encoding" attribute as the "in-repo-encoding" attribute serves a similar purpose (maybe gitk can switch to it). In that case we could also do things like that: *.bar text working-tree-encoding=SHIFT-JIS in-repo-encoding=UTF-16LE SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin. On checkout the opposite would happen. This way we would lift the "UTF-8 is the only in-repo encoding" limitation of the current w-t-e implementation. Does this sound sensible to you? That being said, I think "in-repo-encoding" would deserve an own series. - Lars
Re: [PATCH v7 0/7] convert: add support for different encodings
Junio C Hamanowrites: > Lars Schneider writes: > >> I still think it would be nice to see diffs for arbitrary encodings. >> Would it be an option to read the `encoding` attribute and use it in >> `git diff`? > > Reusing that gitk-only thing and suddenly start doing so would break > gitk users, no? The tool expects the diff to come out encoded in > the encoding that is specified by that attribute (which is learned > from get_path_encoding helper) and does its thing. > > I guess that gitk uses diff-tree plumbing and you won't be applying > this change to the plumbing, perhaps? If so, it might not be too > bad, but those who decided to postprocess "git diff" output (instead > of "git diff-tree" output) mimicking how gitk does it by thinking > that is the safe and sane thing to do will be broken by such a > change. You could do "use the encoding only when a command line > option says so", but then people will add a configuration variable > to turn it always on and these existing scripts will be broken. > > I do not personally have much sympathy for the last case (i.e. those > who scripted around 'git diff' instead of 'git diff-tree' to get > broken), so making the new feature only work with the Porcelain "git > diff" might be an option. I'll need a bit more time to formulate > the rest of my thought ;-) So we are introducing in this series a way to say in what encoding the things should be placed in the working tree files (i.e. the w-t-e attribute attached to the paths). Currently there is no mechanism to say what encoding the in-repo contents are and UTF-8 is assumed when conversion from/to w-t-e is required, but there is no fundamental reason why it shouldn't be customizable (if anything, as a piece of fact on the in-repo data, in-repo-encoding is *more* appropriate to be an attribute than w-t-e that can merely be project preference at best, as I mentioned earlier in this thread). We always use the in-repo contents when generating 'diff'. I think by "attribute to be used in diff", what you are reallying after is to convert the in-repo contents to that encoding _BEFORE_ running 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over the place will not diff well with the xdiff machinery, but if you first convert it to UTF-8 and have xdiff work on it, you can get reasonable result out of it. It is unclear what encoding you want your final diff output in (it is equally valid in such a set-up to desire your patch output in UTF-16 or UTF-8), but assuming that you want UTF-8 in your patch output, perhaps we do not have to break gitk users by hijacking the 'encoding' attribute. Instead what you want is a single bit that says between in-repo or working tree which representation should be given to the xdiff machinery. When that bit is set, then we - First ensure that both sides of the diff input is in the working tree encoding by running it through convert_to_working_tree(); - Run xdiff on it; - Take the xdiff result, and run it through convert_to_git(), before emitting (this is optional, making this a one-and-half bit option). That would allow you to say "I have in-repo data in UTF-16 which I check out as UTF-8. xdiff machinery is unhappy. Please do something." perhaps? The other way around (i.e. in-repo is UTF-8, but working tree encoding is UTF-16) won't need xdiff issues, I would imagine.
Re: [PATCH v7 0/7] convert: add support for different encodings
Lars Schneiderwrites: > I still think it would be nice to see diffs for arbitrary encodings. > Would it be an option to read the `encoding` attribute and use it in > `git diff`? Reusing that gitk-only thing and suddenly start doing so would break gitk users, no? The tool expects the diff to come out encoded in the encoding that is specified by that attribute (which is learned from get_path_encoding helper) and does its thing. I guess that gitk uses diff-tree plumbing and you won't be applying this change to the plumbing, perhaps? If so, it might not be too bad, but those who decided to postprocess "git diff" output (instead of "git diff-tree" output) mimicking how gitk does it by thinking that is the safe and sane thing to do will be broken by such a change. You could do "use the encoding only when a command line option says so", but then people will add a configuration variable to turn it always on and these existing scripts will be broken. I do not personally have much sympathy for the last case (i.e. those who scripted around 'git diff' instead of 'git diff-tree' to get broken), so making the new feature only work with the Porcelain "git diff" might be an option. I'll need a bit more time to formulate the rest of my thought ;-)
Re: [PATCH v7 0/7] convert: add support for different encodings
On Thu, Feb 22, 2018 at 09:00:45PM +0100, Lars Schneider wrote: > > If it was only about a diff of UTF-16 files, I may suggest a patch. > > I simply copy-paste it here for review, if someone thinks that it may > > be useful, I can send it as a real patch/RFC. > > That's a nice idea but I see two potential problems: > > (1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still > show these files as binary. That's a huge problem for my users as > they interact more with these services than the Git command line. > That's the main reason why I implemented the "UTF-8 as canonical > form" approach in my series. I can't speak for the other services, but I can tell you that GitHub would be pretty eager to enable such a feature if it existed. I suspect most services providing human-readable diffs would want to do the same. Though there are still cases where you'd see a binary patch (e.g., format-patch in emails, or GitHub's .patch endpoint, since those are meant to be applied and must contain the "real" data). -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
> On 16 Feb 2018, at 17:58, Torsten Bögershausenwrote: > > On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote: > [] >> >> Agreed. However, people using ShiftJIS are not my target audience. >> My target audience are: >> >> (1) People that have to encode their text files in UTF-16 (for >>whatever reason - usually because of legacy processes or tools). >> >> (2) People that want to see textual diffs of their UTF-16 encoded >>files in their Git tools without special adjustments (on the >>command line, on the web, etc). >> >> That was my primary motivation. The fact that w-t-e supports any >> other encoding too is just a nice side effect. I don't foresee people >> using other w-t-encodings other than UTF-16 in my organization. >> >> I have the suspicion that the feature could be useful for the Git >> community at large. Consider this Stack Overflow question: >> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text >> >> This question was viewed 42k times and there is no good solution. >> I believe w-t-e could be a good solution. >> > > If it was only about a diff of UTF-16 files, I may suggest a patch. > I simply copy-paste it here for review, if someone thinks that it may > be useful, I can send it as a real patch/RFC. That's a nice idea but I see two potential problems: (1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still show these files as binary. That's a huge problem for my users as they interact more with these services than the Git command line. That's the main reason why I implemented the "UTF-8 as canonical form" approach in my series. (2) You can only detect a BOM if the encoding is UTF-16. UTF-16BE and UTF-16LE must not have a BOM and therefore cannot be easily detected. Plus, even if you detect an UTF-16 BOM then it would be just a hint that the file is likely UTF-16 encoded as the sequence could be there by chance. I still think it would be nice to see diffs for arbitrary encodings. Would it be an option to read the `encoding` attribute and use it in `git diff`? - Lars > > git show HEAD > > > commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415 > Author: Torsten Bögershausen > Date: Fri Feb 2 15:35:23 2018 +0100 > >Auto diff of UTF-16 files in UTF-8 > >When an UTF-16 file is commited and later changed, `git diff` shows >"Binary files XX and YY differ". > >When the user wants a diff in UTF-8, a textconv needs to be specified >in .gitattributes and the textconv must be configured. > >A more user-friendly diff can be produced for UTF-16 if >- the user did not use `git diff --binary` >- the blob is identified as binary >- the blob has an UTF-16 BOM >- the blob can be converted into UTF-8 > >Enhance the diff machinery to auto-detect UTF-16 blobs and show them >as UTF-8, unless the user specifies `git diff --binary` which creates >a binary diff. > > diff --git a/diff.c b/diff.c > index fb22b19f09..51831ee94d 100644 > --- a/diff.c > +++ b/diff.c > @@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a, > strbuf_reset(); > } > > + if (one && one->reencoded_from_utf16) > + strbuf_addf(, "a is converted to UTF-8 from > UTF-16\n"); > + if (two && two->reencoded_from_utf16) > + strbuf_addf(, "b is converted to UTF-8 from > UTF-16\n"); > mf1.size = fill_textconv(textconv_one, one, ); > mf2.size = fill_textconv(textconv_two, two, ); > > @@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, > unsigned int flags) > s->size = size; > s->should_free = 1; > } > - } > - else { > + if (!s->binary && buffer_is_binary(s->data, s->size) && > + buffer_has_utf16_bom(s->data, s->size)) { > + int outsz = 0; > + char *outbuf; > + outbuf = reencode_string_len(s->data, (int)s->size, > + "UTF-8", "UTF-16", ); > + if (outbuf) { > + if (s->should_free) > + free(s->data); > + if (s->should_munmap) > + munmap(s->data, s->size); > + s->should_munmap = 0; > + s->data = outbuf; > + s->size = outsz; > + s->reencoded_from_utf16 = 1; > + s->should_free = 1; > + } > + } > + } else { > enum object_type type; > if (size_only || (flags & CHECK_BINARY)) { > type = sha1_object_info(s->oid.hash,
Re: [PATCH v7 0/7] convert: add support for different encodings
> On 16 Feb 2018, at 19:55, Junio C Hamanowrote: > > Jeff King writes: > >> So a full proposal would support both cases: "check this out in the >> local platform's preferred encoding" and "always check this out in >> _this_ encoding". And Lars's proposal is just the second half of that. > > Actually, what you seem to take as a whole is just half of the > story. The other half that is an ability to say "what is in the > repository for this path is stored in this encoding". I agree that > "check it out in this encoding" is a useful thing to have, and using > the in-tree .gitattributes as a place to state the project-wide > preference may be OK (and .git/info/attributes should be able to > override it if needed -- this probably deserves to be added to a > test somewhere by this series). Good call! I'll add this test case! > Luckily, lack of 'in-repository-encoding' attribute is not a show > stopper for this series. A later topic could start with "earlier, > in order to make use of w-t-e attribute, you had to have your > contents in UTF-8. Teach the codepath to honor a new attribute that > tells in what encoding the blob contents are stored." without having > to be a part of this topic. I have the impression that this is the purpose of the already existing "encoding" attribute, no? AFAIK this attribute is only respected by gitk, though. A future series could make Git respect this attribute too. - Lars
Re: [PATCH v7 0/7] convert: add support for different encodings
Jeff Kingwrites: > In which case yeah, I could see choosing an in-repo encoding to possibly > be useful (but it also seems like a feature that could easily be tacked > on later if somebody cares). Yes, I think we are on the same page---in-repo-encoding that is a natural counterpart to w-t-e attribute can be added later if/when somebody finds it useful, and it is perfectly OK to declare that we cater only to UTF-8 users until that happens.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Fri, Feb 16, 2018 at 02:25:41PM -0500, Jeff King wrote: > On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote: > > > Jeff Kingwrites: > > > > > So a full proposal would support both cases: "check this out in the > > > local platform's preferred encoding" and "always check this out in > > > _this_ encoding". And Lars's proposal is just the second half of that. > > > > Actually, what you seem to take as a whole is just half of the > > story. The other half that is an ability to say "what is in the > > repository for this path is stored in this encoding". I agree that > > "check it out in this encoding" is a useful thing to have, and using > > the in-tree .gitattributes as a place to state the project-wide > > preference may be OK (and .git/info/attributes should be able to > > override it if needed -- this probably deserves to be added to a > > test somewhere by this series). > > If we are just talking about a check-out feature, I'm not sure that the > in-repository encoding is all that interesting. As with CRLFs, we would > be declaring UTF-8 as the "canonical" in-repo encoding for such > conversions. Is there a reason you'd want something else? Maybe answering my own question: because your encoding of choice does not round-trip to UTF-8? In which case yeah, I could see choosing an in-repo encoding to possibly be useful (but it also seems like a feature that could easily be tacked on later if somebody cares). -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote: > Jeff Kingwrites: > > > So a full proposal would support both cases: "check this out in the > > local platform's preferred encoding" and "always check this out in > > _this_ encoding". And Lars's proposal is just the second half of that. > > Actually, what you seem to take as a whole is just half of the > story. The other half that is an ability to say "what is in the > repository for this path is stored in this encoding". I agree that > "check it out in this encoding" is a useful thing to have, and using > the in-tree .gitattributes as a place to state the project-wide > preference may be OK (and .git/info/attributes should be able to > override it if needed -- this probably deserves to be added to a > test somewhere by this series). If we are just talking about a check-out feature, I'm not sure that the in-repository encoding is all that interesting. As with CRLFs, we would be declaring UTF-8 as the "canonical" in-repo encoding for such conversions. Is there a reason you'd want something else? If the feature were "the in-repo encoding is X, and I want you to show me a diff using encoding Y", then I could see the use of that (and I think for most people's purposes that would be an equally valid solution to their problem). -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
Lars Schneiderwrites: >> One thing I find more problematic is that the above places *too* >> much stress on the UTF-8 centric worldview. It is perfectly valid >> to store your text contents encoded in ShiftJIS and check them out >> as-is, with or without this patch. It is grossly misleading to say >> that older versions of Git will check them out in UTF-8. "will >> checkout these files as-is without encoding conversion" is a better >> way to say it, probably. > > True. But that's not what I wanted to say in the "pitfalls" section. > If my Git client supports w-t-e and I add the ShiftJIS encoded > file "foo.bar" to my repository, then Git will store the file as > UTF-8 _internally_. That means if you clone my repository and your > Git client does _not_ support w-t-e, then you will see "foo.bar" as > UTF-8 encoded. What you wrote implies *more* than that, which is what I had trouble with. If you say "what you have is checked out as-is", then it is still clear that those who use w-t-e to convert non UTF-8 into UTF-8 when checking in will get UTF-8 out when they use an older version of Git. If you say "what you have will be checked out in UTF-8", it makes it sound as if pre w-t-e Git will somehow reject non UTF-8 in-tree contents, or magically convert anything to UTF-8 while checking out, which is *not* what you want to imply. >> Also notice that even in the world with w-t-e, such a project won't >> benefit from w-t-e at all. After all, they have been happy using >> ShiftJIS in repository and using the same encoding on the working >> tree, and because w-t-e assumes that everybody should be using UTF-8 >> internally, such a project cannot take advantage of the new >> mechanism. > > Agreed. However, people using ShiftJIS are not my target audience. Be aware that you are writing *not* *solely* for your target audience. You write document for everybody, and make sure the description of a feature makes it clear who the feature primarily targets and how using (or not using) the feature affects users.
Re: [PATCH v7 0/7] convert: add support for different encodings
Jeff Kingwrites: > So a full proposal would support both cases: "check this out in the > local platform's preferred encoding" and "always check this out in > _this_ encoding". And Lars's proposal is just the second half of that. Actually, what you seem to take as a whole is just half of the story. The other half that is an ability to say "what is in the repository for this path is stored in this encoding". I agree that "check it out in this encoding" is a useful thing to have, and using the in-tree .gitattributes as a place to state the project-wide preference may be OK (and .git/info/attributes should be able to override it if needed -- this probably deserves to be added to a test somewhere by this series). Luckily, lack of 'in-repository-encoding' attribute is not a show stopper for this series. A later topic could start with "earlier, in order to make use of w-t-e attribute, you had to have your contents in UTF-8. Teach the codepath to honor a new attribute that tells in what encoding the blob contents are stored." without having to be a part of this topic.
Re: [PATCH v7 0/7] convert: add support for different encodings
On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote: [] > > Agreed. However, people using ShiftJIS are not my target audience. > My target audience are: > > (1) People that have to encode their text files in UTF-16 (for > whatever reason - usually because of legacy processes or tools). > > (2) People that want to see textual diffs of their UTF-16 encoded > files in their Git tools without special adjustments (on the > command line, on the web, etc). > > That was my primary motivation. The fact that w-t-e supports any > other encoding too is just a nice side effect. I don't foresee people > using other w-t-encodings other than UTF-16 in my organization. > > I have the suspicion that the feature could be useful for the Git > community at large. Consider this Stack Overflow question: > https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text > > This question was viewed 42k times and there is no good solution. > I believe w-t-e could be a good solution. > If it was only about a diff of UTF-16 files, I may suggest a patch. I simply copy-paste it here for review, if someone thinks that it may be useful, I can send it as a real patch/RFC. git show HEAD commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415 Author: Torsten BögershausenDate: Fri Feb 2 15:35:23 2018 +0100 Auto diff of UTF-16 files in UTF-8 When an UTF-16 file is commited and later changed, `git diff` shows "Binary files XX and YY differ". When the user wants a diff in UTF-8, a textconv needs to be specified in .gitattributes and the textconv must be configured. A more user-friendly diff can be produced for UTF-16 if - the user did not use `git diff --binary` - the blob is identified as binary - the blob has an UTF-16 BOM - the blob can be converted into UTF-8 Enhance the diff machinery to auto-detect UTF-16 blobs and show them as UTF-8, unless the user specifies `git diff --binary` which creates a binary diff. diff --git a/diff.c b/diff.c index fb22b19f09..51831ee94d 100644 --- a/diff.c +++ b/diff.c @@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a, strbuf_reset(); } + if (one && one->reencoded_from_utf16) + strbuf_addf(, "a is converted to UTF-8 from UTF-16\n"); + if (two && two->reencoded_from_utf16) + strbuf_addf(, "b is converted to UTF-8 from UTF-16\n"); mf1.size = fill_textconv(textconv_one, one, ); mf2.size = fill_textconv(textconv_two, two, ); @@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags) s->size = size; s->should_free = 1; } - } - else { + if (!s->binary && buffer_is_binary(s->data, s->size) && + buffer_has_utf16_bom(s->data, s->size)) { + int outsz = 0; + char *outbuf; + outbuf = reencode_string_len(s->data, (int)s->size, +"UTF-8", "UTF-16", ); + if (outbuf) { + if (s->should_free) + free(s->data); + if (s->should_munmap) + munmap(s->data, s->size); + s->should_munmap = 0; + s->data = outbuf; + s->size = outsz; + s->reencoded_from_utf16 = 1; + s->should_free = 1; + } + } + } else { enum object_type type; if (size_only || (flags & CHECK_BINARY)) { type = sha1_object_info(s->oid.hash, >size); @@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags) s->data = read_sha1_file(s->oid.hash, , >size); if (!s->data) die("unable to read %s", oid_to_hex(>oid)); + if (!s->binary && buffer_is_binary(s->data, s->size) && + buffer_has_utf16_bom(s->data, s->size)) { + int outsz = 0; + char *buf; + buf = reencode_string_len(s->data, (int)s->size, + "UTF-8", "UTF-16", ); + if (buf) { + free(s->data); + s->data = buf; + s->size = outsz; + s->reencoded_from_utf16 = 1; + } + } s->should_free = 1; } return 0; @@ -5695,6 +5729,10 @@
Re: [PATCH v7 0/7] convert: add support for different encodings
> On 15 Feb 2018, at 21:03, Junio C Hamanowrote: > > lars.schnei...@autodesk.com writes: > >> -- Git clients that do not support the `working-tree-encoding` attribute >> - will checkout the respective files UTF-8 encoded and not in the >> - expected encoding. Consequently, these files will appear different >> - which typically causes trouble. This is in particular the case for >> - older Git versions and alternative Git implementations such as JGit >> - or libgit2 (as of February 2018). >> +- Third party Git implementations that do not support the >> + `working-tree-encoding` attribute will checkout the respective files >> + UTF-8 encoded and not in the expected encoding. Consequently, these >> + files will appear different which typically causes trouble. This is >> + in particular the case for older Git versions and alternative Git >> + implementations such as JGit or libgit2 (as of February 2018). > > I know somebody found "clients" misleading in the original, but the > ones that do not understand w-t-e do not have to be third party > reimplementations and imitations. All existing Git implementations, > including ours, don't. Agreed! > One thing I find more problematic is that the above places *too* > much stress on the UTF-8 centric worldview. It is perfectly valid > to store your text contents encoded in ShiftJIS and check them out > as-is, with or without this patch. It is grossly misleading to say > that older versions of Git will check them out in UTF-8. "will > checkout these files as-is without encoding conversion" is a better > way to say it, probably. True. But that's not what I wanted to say in the "pitfalls" section. If my Git client supports w-t-e and I add the ShiftJIS encoded file "foo.bar" to my repository, then Git will store the file as UTF-8 _internally_. That means if you clone my repository and your Git client does _not_ support w-t-e, then you will see "foo.bar" as UTF-8 encoded. > Also notice that even in the world with w-t-e, such a project won't > benefit from w-t-e at all. After all, they have been happy using > ShiftJIS in repository and using the same encoding on the working > tree, and because w-t-e assumes that everybody should be using UTF-8 > internally, such a project cannot take advantage of the new > mechanism. Agreed. However, people using ShiftJIS are not my target audience. My target audience are: (1) People that have to encode their text files in UTF-16 (for whatever reason - usually because of legacy processes or tools). (2) People that want to see textual diffs of their UTF-16 encoded files in their Git tools without special adjustments (on the command line, on the web, etc). That was my primary motivation. The fact that w-t-e supports any other encoding too is just a nice side effect. I don't foresee people using other w-t-encodings other than UTF-16 in my organization. I have the suspicion that the feature could be useful for the Git community at large. Consider this Stack Overflow question: https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text This question was viewed 42k times and there is no good solution. I believe w-t-e could be a good solution. With your comments in mind, I tried to improve the text like this: Git recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, ...) as text files. Files encoded with certain other encodings (e.g. UTF-16) are interpreted as binary and consequently built-in Git text processing tools (e.g. 'git diff') as well as most Git web front ends do not visualize the contents of these files by default. ... Please note that using the `working-tree-encoding` attribute may have a number of pitfalls: - Alternative Git implementations (e.g. JGit or libgit2) and older Git versions (as of February 2018) do not support the `working-tree-encoding` attribute. If you decide to use the `working-tree-encoding` attribute in your repository, then it is strongly recommended to ensure that all clients working with the repository support it. If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an `working-tree-encoding` enabled Git client, then `foo.proj` will be stored as UTF-8 internally. A client without `working-tree-encoding` support will checkout `foo.proj` as UTF-8 encoded file. This will typically cause trouble for the users of this file. If a Git client, that does not support the `working-tree-encoding` attribute, adds a new file `bar.proj`, then `bar.proj` will be stored "as-is" internally (in this example probably as UTF-16). A client with `working-tree-encoding` support will interpret the internal contents as UTF-8 and try to convert it to UTF-16 on checkout. That operation will fail and cause an error. ... > And from that point of view, perhaps
Re: [PATCH v7 0/7] convert: add support for different encodings
On Thu, Feb 15, 2018 at 12:03:06PM -0800, Junio C Hamano wrote: > And from that point of view, perhaps w-t-e attribute is somewhat > misdesigned. > > In general, an attribute is about the project's contents in the > manner independent of platform or environment. You define "this > file is a C source" or "this file has JPEG image" there. What exact > program you use to present diffs between the two versions of such a > file (external diff command) or what exact program you use to > extract the textual representations (textconv filter) is environment > and platform dependent and is left to the configuration mechanism > for each repository. > > To be in line with the above design principle, I think the attribute > ought to be "the in-tree contents of this path is encoded in ..." > whose values could be things like UTF-8, ShiftJIS, etc. What > external encoding the paths should be checked out is not a > project-wide matter, especially when talking about cross platform > projects. Perhaps a project in Japanese language wants to check > out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived > systems. The participants all need to know what in-repository > encoding is used, which is a sensible use of attributes. They also > need to know what the recommended external encoding to be used in > the working tree is for their platforms, but that is more like what > Makefile variable to set for their platforms, etc., and is not a > good match to the attributes system. While I agree what you're saying philosophically here, I suspect you'd still need another attribute for "no really, this needs to be checked out as encoding X". The same way we treat line endings as a platform decision, but we still need to have `eol=crlf` for those files which really, no matter what platform you're on, have external tools depending on them to have some particular line ending. So a full proposal would support both cases: "check this out in the local platform's preferred encoding" and "always check this out in _this_ encoding". And Lars's proposal is just the second half of that. But I'm not sure anybody even really cares about the first part; I don't think we've seen anybody actually ask for it. -Peff
Re: [PATCH v7 0/7] convert: add support for different encodings
lars.schnei...@autodesk.com writes: > -- Git clients that do not support the `working-tree-encoding` attribute > - will checkout the respective files UTF-8 encoded and not in the > - expected encoding. Consequently, these files will appear different > - which typically causes trouble. This is in particular the case for > - older Git versions and alternative Git implementations such as JGit > - or libgit2 (as of February 2018). > +- Third party Git implementations that do not support the > + `working-tree-encoding` attribute will checkout the respective files > + UTF-8 encoded and not in the expected encoding. Consequently, these > + files will appear different which typically causes trouble. This is > + in particular the case for older Git versions and alternative Git > + implementations such as JGit or libgit2 (as of February 2018). I know somebody found "clients" misleading in the original, but the ones that do not understand w-t-e do not have to be third party reimplementations and imitations. All existing Git implementations, including ours, don't. One thing I find more problematic is that the above places *too* much stress on the UTF-8 centric worldview. It is perfectly valid to store your text contents encoded in ShiftJIS and check them out as-is, with or without this patch. It is grossly misleading to say that older versions of Git will check them out in UTF-8. "will checkout these files as-is without encoding conversion" is a better way to say it, probably. Also notice that even in the world with w-t-e, such a project won't benefit from w-t-e at all. After all, they have been happy using ShiftJIS in repository and using the same encoding on the working tree, and because w-t-e assumes that everybody should be using UTF-8 internally, such a project cannot take advantage of the new mechanism. And from that point of view, perhaps w-t-e attribute is somewhat misdesigned. In general, an attribute is about the project's contents in the manner independent of platform or environment. You define "this file is a C source" or "this file has JPEG image" there. What exact program you use to present diffs between the two versions of such a file (external diff command) or what exact program you use to extract the textual representations (textconv filter) is environment and platform dependent and is left to the configuration mechanism for each repository. To be in line with the above design principle, I think the attribute ought to be "the in-tree contents of this path is encoded in ..." whose values could be things like UTF-8, ShiftJIS, etc. What external encoding the paths should be checked out is not a project-wide matter, especially when talking about cross platform projects. Perhaps a project in Japanese language wants to check out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived systems. The participants all need to know what in-repository encoding is used, which is a sensible use of attributes. They also need to know what the recommended external encoding to be used in the working tree is for their platforms, but that is more like what Makefile variable to set for their platforms, etc., and is not a good match to the attributes system.
[PATCH v7 0/7] convert: add support for different encodings
From: Lars SchneiderHi, Patches 1-4, 6 are preparation and helper functions. Patch 5,7 are the actual change. This series depends on Torsten's 8462ff43e4 (convert_to_git(): safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already in master. Changes since v6: * use consistent casing for core.checkRoundtripEncoding (Junio) * fix gibberish in commit message (Junio) * improve documentation (Torsten) * improve advise messages (Torsten) Thanks, Lars RFC: https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/ v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/ v2: https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/ v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/ v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/ v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/ v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/ Base Ref: Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353 Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git checkout 2b94bec353 ### Interdiff (v6..v7): diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index ea5a9509c6..10cb37795d 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding. Please note that using the `working-tree-encoding` attribute may have a number of pitfalls: -- Git clients that do not support the `working-tree-encoding` attribute - will checkout the respective files UTF-8 encoded and not in the - expected encoding. Consequently, these files will appear different - which typically causes trouble. This is in particular the case for - older Git versions and alternative Git implementations such as JGit - or libgit2 (as of February 2018). +- Third party Git implementations that do not support the + `working-tree-encoding` attribute will checkout the respective files + UTF-8 encoded and not in the expected encoding. Consequently, these + files will appear different which typically causes trouble. This is + in particular the case for older Git versions and alternative Git + implementations such as JGit or libgit2 (as of February 2018). - Reencoding content to non-UTF encodings can cause errors as the conversion might not be UTF-8 round trip safe. If you suspect your - encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding` - to make Git check the round trip encoding (see linkgit:git-config[1]). - SHIFT-JIS (Japanese character set) is known to have round trip issues - with UTF-8 and is checked by default. + encoding to not be round trip safe, then add it to + `core.checkRoundtripEncoding` to make Git check the round trip + encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character + set) is known to have round trip issues with UTF-8 and is checked by + default. - Reencoding content requires resources that might slow down certain Git operations (e.g 'git checkout' or 'git add'). @@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the `working-tree-encoding` attribute is used to avoid ambiguity. -*.proj working-tree-encoding=UTF-16LE text eol=CRLF +*.proj text working-tree-encoding=UTF-16LE eol=CRLF You can get a list of all available encodings on your platform with the diff --git a/convert.c b/convert.c index 71dffc7167..398cd9cf7b 100644 --- a/convert.c +++ b/convert.c @@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, if (has_prohibited_utf_bom(enc->name, src, src_len)) { const char *error_msg = _( - "BOM is prohibited for '%s' if encoded as %s"); + "BOM is prohibited in '%s' if encoded as %s"); + /* +* This advise is shown for UTF-??BE and UTF-??LE encodings. +* We truncate the encoding name to 6 chars with %.6s to cut +* off the last two "byte order" characters. +*/ const char *advise_msg = _( - "You told Git to treat '%s' as %s. A byte order mark " - "(BOM) is prohibited with this encoding. Either use " - "%.6s as working tree encoding or remove the BOM from the " - "file."); - - advise(advise_msg, path, enc->name, enc->name, enc->name); + "The file '%s' contains a byte order mark (BOM). " + "Please use %.6s as working-tree-encoding."); + advise(advise_msg, path, enc->name); if