Re: [PATCH v1] convert: add support for 'encoding' attribute

2018-01-03 Thread Lars Schneider

On 03 Jan 2018, at 20:15, Junio C Hamano  wrote:

> Torsten Bögershausen  writes:
> 
>> May be.
>> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
>> Especially it didn't know anything about strbuf.
>> I don't know why strbuf.h and other functions had been added here,
>> 
>> I once moved them into strbuf.c without any problems, but never send out
>> a patch, because of possible merge conflicts in ongoing patches.
>> 
>> In any case, if it is about strbuf, I would try to put it into strbuf.c
> 
> Please don't.
> 
> A code that happens to use strbuf as a container and about
> manipulating the contents is quite different from a code about
> strbuf.  The latter is to enhance and extend how the strbuf as a
> container behaves.  An operation about character encoding for a
> string that happens to be stored in strbuf is more about the
> encoding, and much much less about strbuf.
> 
> convert.c is about massaging contents coming from the outside world
> into a shape stored in Git and the other way around, and there are
> multiple ways the contents are massaged.  EOL convention may be
> adjusted, characters may be reencoded, end-user defined conversion
> may be applied.  Some of these operations may use helpers specific
> for the task from other more library-ish files, like checking if a
> string looks like encoded in UTF-8 from utf8.[ch].

Agreed. I did that in v2. See these patches:

https://public-inbox.org/git/2017122915.39680-3-lars.schnei...@autodesk.com/
https://public-inbox.org/git/2017122915.39680-4-lars.schnei...@autodesk.com/

- Lars

Re: [PATCH v1] convert: add support for 'encoding' attribute

2018-01-03 Thread Junio C Hamano
Torsten Bögershausen  writes:

> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
>
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
>
> In any case, if it is about strbuf, I would try to put it into strbuf.c

Please don't.

A code that happens to use strbuf as a container and about
manipulating the contents is quite different from a code about
strbuf.  The latter is to enhance and extend how the strbuf as a
container behaves.  An operation about character encoding for a
string that happens to be stored in strbuf is more about the
encoding, and much much less about strbuf.

convert.c is about massaging contents coming from the outside world
into a shape stored in Git and the other way around, and there are
multiple ways the contents are massaged.  EOL convention may be
adjusted, characters may be reencoded, end-user defined conversion
may be applied.  Some of these operations may use helpers specific
for the task from other more library-ish files, like checking if a
string looks like encoded in UTF-8 from utf8.[ch].



Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-29 Thread Lars Schneider

> On 29 Dec 2017, at 13:59, Torsten Bögershausen  wrote:
> 
> On 2017-12-28 17:14, Lars Schneider wrote:> 
>>> On 17 Dec 2017, at 18:14, Torsten Bögershausen  wrote:
>>> 
>>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
 From: Lars Schneider 
 
>> 
 +`encoding`
 +^^
 +
 +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>> 
>>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
>> 
>> I am not sure I understand what you want to tell me here.
>> Do you think UTF-8 encoding is not correct in the sentence above?
> 
> Git itself was designed to handle source code in ASCII.
> Text files which are encoded in ISO-8859-1, UTF-8 or any
> super-set of ASCII are handled as well, and what exactly to do
> with bytes >0x80 is outside the responsibility of Git.
> E.g. the interpretation and rendering on the screen may be
> dependent on the locale or being guessed.
> All in all, saying that Git expects UTF-8 may be "overdriven".
> Any kind of file that uses an '\n' as an end of line
> and has no NUL bytes '\0' works as a text file in Git.
> (End of BlaBla ;-)
> 
> We can probably replace:
> "By default Git assumes UTF-8 encoding for text files"
> 
> with something like
> "Git handles UTF-8 encoded files as text files, but UTF-16 encoded
> files are handled as binary files"

Well, UTF-32 are handled as binary file, too :-)

How about this?

Git recognizes files encoded with ASCII or one of its supersets
(e.g. UTF-8 or ISO-8859-1) as text files.  All other...


> 
>>> 
 +attribute sets the encoding to be used in the working directory.
 +If the path is added to the index, then Git encodes the content to
 +UTF-8.  On checkout the content is encoded back to the original
 +encoding.  Consequently, you can use all built-in Git text processing
 +tools (e.g. git diff, line ending conversions, etc.) with your
 +non-UTF-8 encoded file.
 +
 +Please note that re-encoding content can cause errors and requires
 +resources. Use the `encoding` attribute only if you cannot store
 +a file in UTF-8 encoding and if you want Git to be able to process
 +the text.
 +
 +
 +*.txt text encoding=UTF-16
 +
>>> 
>>> I think that encoding (or worktree-encoding as said elsewhere) must imply 
>>> text.
>>> (That is not done in convert.c, but assuming binary or even auto
>>> does not make much sense to me)
>> 
>> "text" only means "LF". "-text" means CRLF which would be perfectly fine
>> for UTF-16. Therefore I don't think we need to imply text.
>> Does this make sense?
> Yes and now.
> 
> "text" means convert CRLF at "git add" (or commit) into LF,
> And potentially LF into CRLF at checkout,
> depending on the EOL attribute (if set), core.autocrlf and/or core.eol.
> 
> "-text" means don't touch CRLF or LF at all. Files with CRLF are commited
> with CRLF.
> Running a Unix like "diff" tool will often show "^M" to indicate that there
> is a CR before the LF, which may be distracting or confusing.
> 
> If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
> it makes sense to convert the line endings into LF before storing them
> into the index (at least this is my experience).
> 
> (Not specifying "text" or "-text" at all will trigger the auto-handling,
> which is not needed at all).
> So if we want to be sure that line endings are CRLF in the worktree we
> should ask the user to specify things like this:
> 
> *.ext worktree-encoding=UTF-16 text eol=CRLF
> 
> It may be enough to give this example in the documentation.
> and I would rather be over-careful here, to avoid future problems.

Agreed. I'll add that example!


 +
 +All `iconv` encodings with a stable round-trip conversion to and from
 +UTF-8 are supported.  You can see a full list with the following command:
 +
 +
 +iconv --list
 +
 +
 `eol`
 ^
 
 diff --git a/convert.c b/convert.c
 index 20d7ab67bd..ee19c17104 100644
 --- a/convert.c
 +++ b/convert.c
 @@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
 +#include "utf8.h"
 
 /*
 * convert.c - convert a file when checking it out and checking it in.
 @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, 
 struct text_stat *stats,
 
 }
>>> 
>>> I would avoid to use these #ifdefs here.
>>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
>> 
>> I'll try that. But wouldn't it make more sense to move the functions to 
>> utf.c?
> 
> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions 

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-29 Thread Torsten Bögershausen
On 2017-12-28 17:14, Lars Schneider wrote:> 
>> On 17 Dec 2017, at 18:14, Torsten Bögershausen  wrote:
>>
>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
>>> From: Lars Schneider 
>>>
> 
>>> +`encoding`
>>> +^^
>>> +
>>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>
>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
> 
> I am not sure I understand what you want to tell me here.
> Do you think UTF-8 encoding is not correct in the sentence above?

Git itself was designed to handle source code in ASCII.
Text files which are encoded in ISO-8859-1, UTF-8 or any
super-set of ASCII are handled as well, and what exactly to do
with bytes >0x80 is outside the responsibility of Git.
E.g. the interpretation and rendering on the screen may be
dependent on the locale or being guessed.
All in all, saying that Git expects UTF-8 may be "overdriven".
Any kind of file that uses an '\n' as an end of line
and has no NUL bytes '\0' works as a text file in Git.
(End of BlaBla ;-)

We can probably replace:
"By default Git assumes UTF-8 encoding for text files"

with something like
"Git handles UTF-8 encoded files as text files, but UTF-16 encoded
 files are handled as binary files"

>>
>>> +attribute sets the encoding to be used in the working directory.
>>> +If the path is added to the index, then Git encodes the content to
>>> +UTF-8.  On checkout the content is encoded back to the original
>>> +encoding.  Consequently, you can use all built-in Git text processing
>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>> +non-UTF-8 encoded file.
>>> +
>>> +Please note that re-encoding content can cause errors and requires
>>> +resources. Use the `encoding` attribute only if you cannot store
>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>> +the text.
>>> +
>>> +
>>> +*.txt  text encoding=UTF-16
>>> +
>>
>> I think that encoding (or worktree-encoding as said elsewhere) must imply 
>> text.
>> (That is not done in convert.c, but assuming binary or even auto
>> does not make much sense to me)
> 
> "text" only means "LF". "-text" means CRLF which would be perfectly fine
> for UTF-16. Therefore I don't think we need to imply text.
> Does this make sense?
Yes and now.

"text" means convert CRLF at "git add" (or commit) into LF,
And potentially LF into CRLF at checkout,
depending on the EOL attribute (if set), core.autocrlf and/or core.eol.

"-text" means don't touch CRLF or LF at all. Files with CRLF are commited
with CRLF.
Running a Unix like "diff" tool will often show "^M" to indicate that there
is a CR before the LF, which may be distracting or confusing.

If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
it makes sense to convert the line endings into LF before storing them
into the index (at least this is my experience).

(Not specifying "text" or "-text" at all will trigger the auto-handling,
 which is not needed at all).
So if we want to be sure that line endings are CRLF in the worktree we
should ask the user to specify things like this:

*.ext worktree-encoding=UTF-16 text eol=CRLF

It may be enough to give this example in the documentation.
and I would rather be over-careful here, to avoid future problems.

> 
>>
>>
>>> +
>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>> +UTF-8 are supported.  You can see a full list with the following command:
>>> +
>>> +
>>> +iconv --list
>>> +
>>> +
>>> `eol`
>>> ^
>>>
>>> diff --git a/convert.c b/convert.c
>>> index 20d7ab67bd..ee19c17104 100644
>>> --- a/convert.c
>>> +++ b/convert.c
>>> @@ -7,6 +7,7 @@
>>> #include "sigchain.h"
>>> #include "pkt-line.h"
>>> #include "sub-process.h"
>>> +#include "utf8.h"
>>>
>>> /*
>>>  * convert.c - convert a file when checking it out and checking it in.
>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>>> text_stat *stats,
>>>
>>> }
>>
>> I would avoid to use these #ifdefs here.
>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
> 
> I'll try that. But wouldn't it make more sense to move the functions to utf.c?

May be.
Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
Especially it didn't know anything about strbuf.
I don't know why strbuf.h and other functions had been added here,

I once moved them into strbuf.c without any problems, but never send out
a patch, because of possible merge conflicts in ongoing patches.

In any case, if it is about strbuf, I would try to put it into strbuf.c

>>
>>>
>>> +#ifdef NO_ICONV
>>> +#ifndef _ICONV_T
>>> +/* The type is just a placeholder and not actually used. */
>>> +typedef void* iconv_t;
>>> +#endif
>>> +#endif
>>> +
>>> +static struct encoding {
>>> +   const char *name;
>>> +   iconv_t 

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-28 Thread Lars Schneider

> On 17 Dec 2017, at 18:14, Torsten Bögershausen  wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider 
>> 

>> +`encoding`
>> +^^
>> +
>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
> 
> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.

I am not sure I understand what you want to tell me here.
Do you think UTF-8 encoding is not correct in the sentence above?

> 
>> +attribute sets the encoding to be used in the working directory.
>> +If the path is added to the index, then Git encodes the content to
>> +UTF-8.  On checkout the content is encoded back to the original
>> +encoding.  Consequently, you can use all built-in Git text processing
>> +tools (e.g. git diff, line ending conversions, etc.) with your
>> +non-UTF-8 encoded file.
>> +
>> +Please note that re-encoding content can cause errors and requires
>> +resources. Use the `encoding` attribute only if you cannot store
>> +a file in UTF-8 encoding and if you want Git to be able to process
>> +the text.
>> +
>> +
>> +*.txt   text encoding=UTF-16
>> +
> 
> I think that encoding (or worktree-encoding as said elsewhere) must imply 
> text.
> (That is not done in convert.c, but assuming binary or even auto
> does not make much sense to me)

"text" only means "LF". "-text" means CRLF which would be perfectly fine
for UTF-16. Therefore I don't think we need to imply text.
Does this make sense?

> 
> 
>> +
>> +All `iconv` encodings with a stable round-trip conversion to and from
>> +UTF-8 are supported.  You can see a full list with the following command:
>> +
>> +
>> +iconv --list
>> +
>> +
>> `eol`
>> ^
>> 
>> diff --git a/convert.c b/convert.c
>> index 20d7ab67bd..ee19c17104 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -7,6 +7,7 @@
>> #include "sigchain.h"
>> #include "pkt-line.h"
>> #include "sub-process.h"
>> +#include "utf8.h"
>> 
>> /*
>>  * convert.c - convert a file when checking it out and checking it in.
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> 
>> }
> 
> I would avoid to use these #ifdefs here.
> All functions can be in strbuf.c (and may have #ifdefs there), but not here.

I'll try that. But wouldn't it make more sense to move the functions to utf.c?

> 
>> 
>> +#ifdef NO_ICONV
>> +#ifndef _ICONV_T
>> +/* The type is just a placeholder and not actually used. */
>> +typedef void* iconv_t;
>> +#endif
>> +#endif
>> +
>> +static struct encoding {
>> +const char *name;
>> +iconv_t to_git;   /* conversion to Git's canonical encoding (UTF-8) 
>> */
>> +iconv_t to_worktree;  /* conversion to user-defined encoding */
>> +struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> This seems like an optimazation, assuning that iconv_open() eats a lot of 
> ressources.
> I don't think this is the case. (Undere MacOS we run iconv_open() together 
> with
> every opendir(), and optimizing this out did not show any measurable 
> improvements)

True, but then I would need to free() the memory in a lot of places.
Therefore I thought it is easier to keep the object. OK for you?


>> +static const char *default_encoding = "utf-8";
> 
> The most portable form is "UTF-8" (correct me if that is wrong)

It shouldn't matter. But I've changed it to uppercase to be on the safe side.


>> +static iconv_t invalid_conversion = (iconv_t)-1;
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +char *dst, *re_src;
>> +int dst_len, re_src_len;
>> +
>> +/*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> +if (!enc || (src && !src_len))
>> +return 0;
>> +
>> +/*
>> + * Looks like we got called from "would_convert_to_git()".
>> + * This means Git wants to know if it would encode (= modify!)
>> + * the content. Let's answer with "yes", since an encoding was
>> + * specified.
>> + */
>> +if (!buf && !src)
>> +return 1;
>> +
>> +if (enc->to_git == invalid_conversion) {
>> +enc->to_git = iconv_open(default_encoding, encoding->name);
>> +if (enc->to_git == invalid_conversion)
>> +warning(_("unsupported encoding %s"), encoding->name);
>> +}
> 
>   /* There are 2 different types of reaction:
> Either users  know what that a warning means: You asked for 
> problems,
>   and do the right thing. Other may may ignore the warning
>  - in both cases an error is useful */

Agreed!


>> +if (enc->to_worktree == invalid_conversion)
>> +  

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-23 Thread Torsten Bögershausen
On Mon, Dec 18, 2017 at 08:12:49AM -0500, Jeff King wrote:
> On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote:
> 
> > Just to confirm my missing knowledge here:
> > Does this mean, that git-gui and gitk can decode/reencode
> > the content of a file/blob, when the .gitattributes say so ?
> 
> That's my impression, yes.
> 
> > If yes, would it make sense to enhance the "git diff" instead ?
> > "git diff --encoding" will pick up the commited encoding from
> > .attributes, convert it into UTF-8, and run the diff ?
> 
> You can do something like this already:
> 
>   git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
>   echo file diff=utf16 >.gitattributes
> 
> I have no objection to teaching it that "encoding" means roughly the
> same thing, which would have two advantages:
> 
>  - we could do it in-process, which would be much faster
> 
>  - we could skip the extra config step, which is a blocker for
>server-side diffs on sites like GitHub
> 
> I don't think you'd really need "--encoding". This could just be
> triggered by the normal "--textconv" rules (i.e., just treat this "as
> if" the textconv above was configured when we see an encoding
> attribute).

I can probably come up with a patch the next weeks or so.

> 
> > We actually could enhance the "git diff" output with a single
> > line saying
> > "Git index-encoding=cp1251"
> > or so, which can be picked up by "git apply".
> 
> That extends it beyond what textconv can do (we assume that textconv
> patches are _just_ for human consumption, and can't be applied). And it
> would mean you'd potentially want to trigger it in more places. On the
> other hand, I don't know that we're guaranteed that a round-trip
> between encodings will produce a byte-wise identical result. The nice
> thing about piggy-backing on textconv is that it's already dealt with
> that problem.
> 
> -Peff


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Johannes Sixt

Am 18.12.2017 um 11:13 schrieb Torsten Bögershausen:

Just to confirm my missing knowledge here:
Does this mean, that git-gui and gitk can decode/reencode
the content of a file/blob, when the .gitattributes say so ?


No. I think they parse the output of git-diff et.al., split it per file, 
and then consult .gitattributes to interpret the byte sequence according 
to the encoding.


-- Hannes


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Junio C Hamano
Torsten Bögershausen  writes:

> Just to confirm my missing knowledge here:
> Does this mean, that git-gui and gitk can decode/reencode
> the content of a file/blob, when the .gitattributes say so ?

These programs, when told that a file is in an encoding, read bytes
from that file and interpret that byte sequence in the given
encoding, and blits the corresponding glyphs on the screen---giving
UTF-8 out is not the primary feature of them, so I am not sure
decode/reencode is a good term to use here.


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Jeff King
On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote:

> Just to confirm my missing knowledge here:
> Does this mean, that git-gui and gitk can decode/reencode
> the content of a file/blob, when the .gitattributes say so ?

That's my impression, yes.

> If yes, would it make sense to enhance the "git diff" instead ?
> "git diff --encoding" will pick up the commited encoding from
> .attributes, convert it into UTF-8, and run the diff ?

You can do something like this already:

  git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
  echo file diff=utf16 >.gitattributes

I have no objection to teaching it that "encoding" means roughly the
same thing, which would have two advantages:

 - we could do it in-process, which would be much faster

 - we could skip the extra config step, which is a blocker for
   server-side diffs on sites like GitHub

I don't think you'd really need "--encoding". This could just be
triggered by the normal "--textconv" rules (i.e., just treat this "as
if" the textconv above was configured when we see an encoding
attribute).

> We actually could enhance the "git diff" output with a single
> line saying
> "Git index-encoding=cp1251"
> or so, which can be picked up by "git apply".

That extends it beyond what textconv can do (we assume that textconv
patches are _just_ for human consumption, and can't be applied). And it
would mean you'd potentially want to trigger it in more places. On the
other hand, I don't know that we're guaranteed that a round-trip
between encodings will produce a byte-wise identical result. The nice
thing about piggy-backing on textconv is that it's already dealt with
that problem.

-Peff


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Jeff King
On Mon, Dec 18, 2017 at 11:54:32AM +0100, Lars Schneider wrote:

> >  warning: failed to encode 'file' from utf-8 to utf16
> > 
> > At least it figured out that it couldn't convert the content. It's
> > slightly troubling that it would try in the first place, though; are
> > there encoding pairs where we might accidentally generate nonsense?
> 
> At this point we interpret utf-16 content as utf-8 and try to convert
> it to utf-16. That of course fails because utf-16 content is no valid
> utf-8. How could we stop trying that? How could Git possibly know what
> kind of encoding is used (apart from our new hint in gitattributes)?

Yeah, sorry if I wasn't clear: I don't really have an answer to those
questions either. So this is probably the best we can do. I was mostly
just trying to think through the worst case, and what could go wrong.

> > It may make sense to die() during "git add ." (since we're actually
> > changing the index entry, and we don't want to put nonsense into a
> > tree). But I'm not sure it's the best thing for operations which just
> > want to read the content. For them, perhaps it would be more appropriate
> > to issue a warning and return the untouched content.
> 
> Absolutely! Thanks for spotting this. I will try to run die() only on
> "git add" in v2.

Great, thanks!

-Peff


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Lars Schneider

> On 15 Dec 2017, at 10:58, Jeff King  wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
> 
>> From: Lars Schneider 
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
> 
> This made me wonder what happens when you have a file that _was_ utf16,
> and then you later convert it to utf8 and add a .gitattributes entry.
> 
> I tried this with your patch:
> 
>  git init repo
>  cd repo
> 
>  echo foo | iconv -t utf16 >file
>  git add file
>  git commit -m utf16
> 
>  echo 'file encoding=utf16' >.gitattributes
>  touch file ;# make it stat-dirty
>  git commit -m convert
> 
>  git checkout HEAD^
> 
> That works OK, because we try to read the attributes from the
> destination tree for a checkout. If you do this:
> 
>  echo 'file encoding=utf16' >.git/info/attributes
>  git checkout HEAD^
> 
> then we get:
> 
>  warning: failed to encode 'file' from utf-8 to utf16
> 
> At least it figured out that it couldn't convert the content. It's
> slightly troubling that it would try in the first place, though; are
> there encoding pairs where we might accidentally generate nonsense?

At this point we interpret utf-16 content as utf-8 and try to convert
it to utf-16. That of course fails because utf-16 content is no valid
utf-8. How could we stop trying that? How could Git possibly know what
kind of encoding is used (apart from our new hint in gitattributes)?

I asked myself the question about nonsense encoding pairs, too. That
was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git
add. However, with ".git/info/attributes" you could circumvent that
check and probably generate nonsense.


> It _should_ be uncommon, I think, to have a repo-level attribute set
> like that. It's very common for us to use whatever happens to be in the
> checked-out .gitattributes for some attributes (e.g., when doing a diff
> of an older commit), but I think for checkout-related ones it's not.  So
> I think it may generally work in practice. And certainly the line-ending
> code would share any similar problems, but at least there the result is
> less confusing than mojibake.
> 
> Playing around, I also managed to do this:
> 
>  echo 'file encoding=utf16' >.gitattributes
>  echo foo >file
> 
>  # I did these with an old version of git that didn't
>  # support the new attribute, so they blindly added the utf8 content.
>  git add .
>  git commit -m convert
> 
>  git.compile checkout HEAD^
> 
> which yielded:
> 
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> Now obviously my situation is somewhat nonsense. I was trying to force
> the in-repo representation to utf8, but ended up with a mismatched
> working tree file. But what's somewhat troubling is that I couldn't
> checkout _away_ from that state due to the die() in convert_to_git().
> Which is in turn just there as part of refresh_index().
> 
> And indeed, other commands hit the same problem:
> 
>  $ git.compile diff
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
>  $ git.compile checkout -f
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> It may make sense to die() during "git add ." (since we're actually
> changing the index entry, and we don't want to put nonsense into a
> tree). But I'm not sure it's the best thing for operations which just
> want to read the content. For them, perhaps it would be more appropriate
> to issue a warning and return the untouched content.

Absolutely! Thanks for spotting this. I will try to run die() only on
"git add" in v2.

- Lars

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-18 Thread Torsten Bögershausen
On Mon, Dec 11, 2017 at 09:47:24PM +0100, Johannes Sixt wrote:
> Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com:
> >From: Lars Schneider 
> >
> >Git and its tools (e.g. git diff) expect all text files in UTF-8
> >encoding. Git will happily accept content in all other encodings, too,
> >but it might not be able to process the text (e.g. viewing diffs or
> >changing line endings).
> >
> >Add an attribute to tell Git what encoding the user has defined for a
> >given file. If the content is added to the index, then Git converts the
> >content to a canonical UTF-8 representation. On checkout Git will
> >reverse the conversion.
> >
> >Reviewed-by: Patrick Lühne 
> >Signed-off-by: Lars Schneider 
> >---
> >
> >Hi,
> >
> >here is a WIP patch to add text encoding support for files encoded with
> >something other than UTF-8 [RFC].
> >
> >The 'encoding' attribute is already used to view blobs in gitk. That
> >could be a problem as the content is stored in Git with the defined
> >encoding. This patch would interpret the content as UTF-8 encoded and
> 
> This will be a major drawback for me because my code base stores text files
> that are not UTF-8 encoded. And I do use the existing 'encoding' attribute
> to view the text in git-gui and gitk. Repurposing this attribute name is not
> an option, IMO.

Just to confirm my missing knowledge here:
Does this mean, that git-gui and gitk can decode/reencode
the content of a file/blob, when the .gitattributes say so ?

If yes, would it make sense to enhance the "git diff" instead ?
"git diff --encoding" will pick up the commited encoding from
.attributes, convert it into UTF-8, and run the diff ?
We actually could enhance the "git diff" output with a single
line saying
"Git index-encoding=cp1251"
or so, which can be picked up by "git apply".

The advantage would be that we could continue to commit in UTF-16
as before, and avoid the glitches with .gitattributes, that Peff
pointed out.

Does this make sense ?

> 
> >it would try to reencode it to the defined encoding on checkout > Plus,
> >many repos define the attribute very broad (e.g. "*
> encoding=cp1251").

Is this a user mistake ?

> >These folks would see errors like these with my patch:
> > error: failed to encode 'foo.bar' from utf-8 to cp1251
> 
> -- Hannes


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-17 Thread Torsten Bögershausen
On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
> From: Lars Schneider 
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Reviewed-by: Patrick Lühne 
> Signed-off-by: Lars Schneider 
> ---
> 
> Hi,
> 
> here is a WIP patch to add text encoding support for files encoded with
> something other than UTF-8 [RFC].
> 
> The 'encoding' attribute is already used to view blobs in gitk. That
> could be a problem as the content is stored in Git with the defined
> encoding. This patch would interpret the content as UTF-8 encoded and
> it would try to reencode it to the defined encoding on checkout.
> Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251").
> These folks would see errors like these with my patch:
> error: failed to encode 'foo.bar' from utf-8 to cp1251
> 
> A quick search on GitHub reveals 2,722 repositories that use the
> 'encoding' attribute [1]. Using the GitHub API [2], I identified the
> following encodings in all publicly accessible repositories:
> 
> ANSI<-- garbage (ignore by my implementation)
> cp1251
> cp866
> cp950
> iso8859-1
> koi8-r
> shiftjis<-- garbage (ignore by my implementation)
> UTF-8   <-- unnecessary (ignore by my implementation)
> utf8<-- garbage (ignore by my implementation)
> 
> TODOs:
> - The iconv docs mention that "errno == EINVAL" is harmless but
>   we don't handle that case in utf8.c:reencode_string_iconv()
> - Git does not compile with NO_ICONV=1 right now because of
>   compat/precompose_utf8.c. I will send a patch to fix that.

Minor: Does Git not compile under MacOS when setting NO_ICONV=1 ?
This is possible not introduced by the patch. 

> 
> Questions:
> - Should I use warning() or error() in the patch?
>   (currently I use the warning)

Errors, see below.

> - Do you agree with the approach in general?

Yes, some comments inline.

> 
> Thanks,
> Lars
> 
> 
> [RFC] 
> http://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com
>   [1] 
> https://github.com/search?p=1=encoding+filename%3Agitattributes=Code=%E2%9C%93
>   [2] curl --user larsxschneider:secret -H 'Accept: 
> application/vnd.github.v3.text-match+json' 
> 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes'
>  | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | 
> uniq
>   [3] 
> https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples
> 
> 
> 
> 
> Notes:
> Base Ref: master
> Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d
> Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && 
> git checkout afc9e88a4d
> 
>  Documentation/gitattributes.txt |  27 ++
>  convert.c   | 192 
> +++-
>  t/t0028-encoding.sh |  60 +
>  3 files changed, 278 insertions(+), 1 deletion(-)
>  create mode 100755 t/t0028-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..84de2fa49c 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -146,6 +146,33 @@ Unspecified::
>  Any other value causes Git to act as if `text` has been left
>  unspecified.
> 
> +`encoding`
> +^^
> +
> +By default Git assumes UTF-8 encoding for text files.  The `encoding`

This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.

> +attribute sets the encoding to be used in the working directory.
> +If the path is added to the index, then Git encodes the content to
> +UTF-8.  On checkout the content is encoded back to the original
> +encoding.  Consequently, you can use all built-in Git text processing
> +tools (e.g. git diff, line ending conversions, etc.) with your
> +non-UTF-8 encoded file.
> +
> +Please note that re-encoding content can cause errors and requires
> +resources. Use the `encoding` attribute only if you cannot store
> +a file in UTF-8 encoding and if you want Git to be able to process
> +the text.
> +
> +
> +*.txttext encoding=UTF-16
> +

I think that encoding (or worktree-encoding as said elsewhere) must imply text.
(That is not done in convert.c, but assuming binary or even auto
does not make much sense to me)


> +
> +All `iconv` encodings with a stable round-trip conversion to and from
> +UTF-8 

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-15 Thread Jeff King
On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:

> From: Lars Schneider 
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.

This made me wonder what happens when you have a file that _was_ utf16,
and then you later convert it to utf8 and add a .gitattributes entry.

I tried this with your patch:

  git init repo
  cd repo
  
  echo foo | iconv -t utf16 >file
  git add file
  git commit -m utf16
  
  echo 'file encoding=utf16' >.gitattributes
  touch file ;# make it stat-dirty
  git commit -m convert
  
  git checkout HEAD^

That works OK, because we try to read the attributes from the
destination tree for a checkout. If you do this:

  echo 'file encoding=utf16' >.git/info/attributes
  git checkout HEAD^

then we get:

  warning: failed to encode 'file' from utf-8 to utf16

At least it figured out that it couldn't convert the content. It's
slightly troubling that it would try in the first place, though; are
there encoding pairs where we might accidentally generate nonsense?

It _should_ be uncommon, I think, to have a repo-level attribute set
like that. It's very common for us to use whatever happens to be in the
checked-out .gitattributes for some attributes (e.g., when doing a diff
of an older commit), but I think for checkout-related ones it's not.  So
I think it may generally work in practice. And certainly the line-ending
code would share any similar problems, but at least there the result is
less confusing than mojibake.

Playing around, I also managed to do this:

  echo 'file encoding=utf16' >.gitattributes
  echo foo >file

  # I did these with an old version of git that didn't
  # support the new attribute, so they blindly added the utf8 content.
  git add .
  git commit -m convert

  git.compile checkout HEAD^

which yielded:

  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

Now obviously my situation is somewhat nonsense. I was trying to force
the in-repo representation to utf8, but ended up with a mismatched
working tree file. But what's somewhat troubling is that I couldn't
checkout _away_ from that state due to the die() in convert_to_git().
Which is in turn just there as part of refresh_index().

And indeed, other commands hit the same problem:

  $ git.compile diff
  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

  $ git.compile checkout -f
  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

It may make sense to die() during "git add ." (since we're actually
changing the index entry, and we don't want to put nonsense into a
tree). But I'm not sure it's the best thing for operations which just
want to read the content. For them, perhaps it would be more appropriate
to issue a warning and return the untouched content.

-Peff


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-14 Thread Junio C Hamano
Lars Schneider  writes:

> That way you could control the encoding for a text file specific
> for each path similar to the "mode bits". That also means you could
> change the encoding of a file while the blob content stays the same.

That is exactly why I said that I doubt it makes sense.

When you have the same blob object (that is in UTF-8) appear at two
places in a tree (or two different subtrees inside a single tree)
and the two tree entries that point at the blob tells contradicting
story, you would have a checkout of the same contents in two
different encodings.  If you have the same blob object appear in two
adjacent commits at the same path, with one commit's tree recording
its encoding differently from the other's, you would switch
encodings when you switch branches.  I doubt these are "features",
but sources of confusion.

Be it a feature, or a misfeature, it is shared with the existing
solution which is .gitattributes embedded in the tree, so you are
not making anything better or worse by moving the information to
tree entry.

What I actually expected to hear when somebody talks about "ideal"
(which may not even be achievable given existing user base) was to
make blob object declare its desired external representation.  That
would remove the two possible confusion cited above, because once
you have a blob, you have everything needed to externalize it.

In any case, I do not think we'd go there anyway, so ...


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-13 Thread Lars Schneider

> On 13 Dec 2017, at 19:11, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> ... In a perfect world I think I would store
>> the encoding of a file in the tree object. I didn't pursue that solution
>> as this would change the Git data model which would open a can of worms
>> for a problem that not that many people have (almost everyone is on
>> UTF-8 anyways).
> 
> Having that "encoding" trailt recorded in the tree that contains the
> blob would mean that the same blob can be recorded with one
> "encoding" trait in a tree, and in a different tree it can be
> recorded with a different "encoding" trait.  I doubt it really makes
> sense.

The "blob object" would store the text data encoded in a canonical 
UTF-8 form. The "tree object" would store the encoding. On checkin 
Git would convert the text from the stored encoding to UTF-8 and on 
checkout it would do the reverse.

That way you could control the encoding for a text file specific
for each path similar to the "mode bits". That also means you could
change the encoding of a file while the blob content stays the same.

Changing the encoding of a file with the .gitattributes approach
can be difficult if you have configured the attribute with a very 
broad pattern (e.g. *.foo). You would either need to rename the
file or limit the scope of your pattern.

- Lars


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-13 Thread Junio C Hamano
Lars Schneider  writes:

> ... In a perfect world I think I would store
> the encoding of a file in the tree object. I didn't pursue that solution
> as this would change the Git data model which would open a can of worms
> for a problem that not that many people have (almost everyone is on
> UTF-8 anyways).

Having that "encoding" trailt recorded in the tree that contains the
blob would mean that the same blob can be recorded with one
"encoding" trait in a tree, and in a different tree it can be
recorded with a different "encoding" trait.  I doubt it really makes
sense.



Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-13 Thread Lars Schneider

> On 12 Dec 2017, at 20:31, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> Our favorite is "treat-encoding-as". Do you consider this better
>> or worse than "checkout-encoding"?
> 
> I am afraid that "treat as" is not sufficiently specific and would
> invite a misinterpretation, e.g. "You record the bytes I throw at
> you as-is in the object store, but treat them appropriately as
> contents that are encoded in cp1252 when presenting".
> 
> what is missing is at what stage in the overall user experience does
> that "treating" happens.  That causes such a misinterpretation.
> 
> So from that point of view, "checkout-" or" working-tree-" would be
> a better phrase to accompany "encoding" to clarify what this attr is
> for than "treat-as".

Alright. I'll go with "checkout-encoding" then.


> Having said all that, this "feature" would need a moderate amount of
> clear description in the documentation, and between the perfect and
> a suboptimal name, the amount of explanation required would not be
> all that different, I suspect.  We need to say that those who wish
> to use this attribute are buying into recording their contents in
> UTF-8 and when the contents are externalized to the working tree,
> they are converted to the encoding the value of this attribute
> specifies and vice versa.

Absolutely! The docs will come in v2. I just wanted to send out a v1
to make people aware of what I am working on.

I think storing the encoding in gitattributes might not be the perfect
solution but probably a pragmatic one (because it works and the changes
to Git are rather small). In a perfect world I think I would store
the encoding of a file in the tree object. I didn't pursue that solution
as this would change the Git data model which would open a can of worms
for a problem that not that many people have (almost everyone is on
UTF-8 anyways).

- Lars


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-12 Thread Junio C Hamano
Lars Schneider  writes:

> Our favorite is "treat-encoding-as". Do you consider this better
> or worse than "checkout-encoding"?

I am afraid that "treat as" is not sufficiently specific and would
invite a misinterpretation, e.g. "You record the bytes I throw at
you as-is in the object store, but treat them appropriately as
contents that are encoded in cp1252 when presenting".

what is missing is at what stage in the overall user experience does
that "treating" happens.  That causes such a misinterpretation.

So from that point of view, "checkout-" or" working-tree-" would be
a better phrase to accompany "encoding" to clarify what this attr is
for than "treat-as".

Having said all that, this "feature" would need a moderate amount of
clear description in the documentation, and between the perfect and
a suboptimal name, the amount of explanation required would not be
all that different, I suspect.  We need to say that those who wish
to use this attribute are buying into recording their contents in
UTF-8 and when the contents are externalized to the working tree,
they are converted to the encoding the value of this attribute
specifies and vice versa.



Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-12 Thread Lars Schneider

> On 12 Dec 2017, at 00:58, Eric Sunshine  wrote:
> 
> On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneider
>  wrote:
>> On 11 Dec 2017, at 19:39, Eric Sunshine  wrote:
>>> On Mon, Dec 11, 2017 at 10:50 AM,   wrote:
 From: Lars Schneider 
 
 Git and its tools (e.g. git diff) expect all text files in UTF-8
 encoding. Git will happily accept content in all other encodings, too,
 but it might not be able to process the text (e.g. viewing diffs or
 changing line endings).
 
 Add an attribute to tell Git what encoding the user has defined for a
 given file. If the content is added to the index, then Git converts the
 content to a canonical UTF-8 representation. On checkout Git will
 reverse the conversion.
 
 Reviewed-by: Patrick Lühne 
 Signed-off-by: Lars Schneider 
 ---
 +static int encode_to_git(const char *path, const char *src, size_t 
 src_len,
 +struct strbuf *buf, struct encoding *enc)
 +{
 +   if (enc->to_git == invalid_conversion) {
 +   enc->to_git = iconv_open(default_encoding, encoding->name);
 +   if (enc->to_git == invalid_conversion)
 +   warning(_("unsupported encoding %s"), 
 encoding->name);
 +   }
 +
 +   if (enc->to_worktree == invalid_conversion)
 +   enc->to_worktree = iconv_open(encoding->name, 
 default_encoding);
>>> 
>>> Do you need to be calling iconv_close() somewhere on the result of the
>>> iconv_open() calls? [Answering myself after reading the rest of the
>>> patch: You're caching these opened 'iconv' descriptors, so you don't
>>> plan on closing them.]
>> 
>> Should this information go into the commit message to avoid confusing
>> future readers? I think, yes.
> 
> Maybe. However, the code which does the actual caching is so distant
> from these iconv_open() invocations that it might be more helpful to
> have an in-code comment here saying that the "missing" iconv_close()
> invocations is intentional.

Agreed. I will add that in v2.

Thanks,
Lars



Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-12 Thread Lars Schneider

> On 12 Dec 2017, at 08:15, Johannes Sixt  wrote:
> 
> Am 12.12.2017 um 01:59 schrieb Junio C Hamano:
>> Stepping back a bit, what does this thing do you are introducing?
>> And what does the other thing do that J6t is using, that would get
>> confused with this new one?
>> What does the other one do?  "Declare that the contents of this path
>> is in this encoding"?  As opposed to the new one, which tells Git to
>> "run iconv from and to this encoding when checking out and checking
>> in"?
>> If so, any phrase that depends heavily on the word "encode" would
>> not help differenciating the two uses.  The phrase needs to be
>> something that contrasts the new one, which actively modifies things
>> (what is on the filesystem is not what is stored in the object
>> store), with the old one, which does not (passed as a declaration to
>> a viewer what encoding the contents already use and does not change
>> anything).
> 
> Well explained!
> 
>> ...  perhaps "smudge-encoding" would work (we declare that the
>> result of smudge operations are left in this encoding, so the
>> opposite operation "clean" will do the reverse---and we say this
>> without explicitly saying that the other end of the conversion is
>> always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
>> not say the opposite operation "checkin/add" will do the reverse).
> 
> I would favor "checkout-encoding" over "smudge-encoding" only because 
> "checkout" is better known than "smudge", I would think. I do not have better 
> suggestions.

Thanks for your thoughts! "checkout-encoding" would work for me.
However, it might convey that Git "does change the files of the
user in some way" (which is true from Git's perspective but not
from the user perspective).

Patrick and I brainstormed a few more possible alternatives:

apply-encoding
blob-encoding
checkout-encoding
diff-encoding
encoding-hint
external-encoding
handle-as
internal-encoding
keep-encoding
present-as
preserve-encoding
source-encoding
text-conversion
text-encoding
treat-as 
treat-encoding-as

Our favorite is "treat-encoding-as". Do you consider this better
or worse than "checkout-encoding"?

Thanks,
Lars


PS: Naming things is hard ;-)

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Johannes Sixt

Am 12.12.2017 um 01:59 schrieb Junio C Hamano:

Stepping back a bit, what does this thing do you are introducing?
And what does the other thing do that J6t is using, that would get
confused with this new one?

What does the other one do?  "Declare that the contents of this path
is in this encoding"?  As opposed to the new one, which tells Git to
"run iconv from and to this encoding when checking out and checking
in"?

If so, any phrase that depends heavily on the word "encode" would
not help differenciating the two uses.  The phrase needs to be
something that contrasts the new one, which actively modifies things
(what is on the filesystem is not what is stored in the object
store), with the old one, which does not (passed as a declaration to
a viewer what encoding the contents already use and does not change
anything).


Well explained!


...  perhaps "smudge-encoding" would work (we declare that the
result of smudge operations are left in this encoding, so the
opposite operation "clean" will do the reverse---and we say this
without explicitly saying that the other end of the conversion is
always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
not say the opposite operation "checkin/add" will do the reverse).


I would favor "checkout-encoding" over "smudge-encoding" only because 
"checkout" is better known than "smudge", I would think. I do not have 
better suggestions.


-- Hannes


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Johannes Sixt

Am 12.12.2017 um 00:42 schrieb Lars Schneider:

BTW: I am curios, can you share what encoding you use?
My main use case is UTF-16 and I was surprised that I haven't
found a single public repo on github.com with "encoding=utf-16"


Shift-JIS and CP1252. These are used for Windows resource files (*.rc).

-- Hannes


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Junio C Hamano
Lars Schneider  writes:

> I contemplated:
>   - "enc" or "encode" because "eol" and "ident" use abbreviations, too
> (enc could be confused with encryption. plus, a user might ask
>  what is the difference between "enc" and "encoding" attribute :-)
>   - "wte", "wtenc", or "worktree-encoding" to emphasize that this is 
> the encoding used in the worktree 
> (I fear that users think that is git-worktree, the command, related)

In the context of Git, the word "worktree" does have a specific
meaning that is different from working tree.  

Stepping back a bit, what does this thing do you are introducing?
And what does the other thing do that J6t is using, that would get
confused with this new one?

What does the other one do?  "Declare that the contents of this path
is in this encoding"?  As opposed to the new one, which tells Git to
"run iconv from and to this encoding when checking out and checking
in"?

If so, any phrase that depends heavily on the word "encode" would
not help differenciating the two uses.  The phrase needs to be
something that contrasts the new one, which actively modifies things
(what is on the filesystem is not what is stored in the object
store), with the old one, which does not (passed as a declaration to
a viewer what encoding the contents already use and does not change
anything).

Do people who will use this feature familiar with the concept of
smudge/clean?  If you want to avoid "working-tree" (or "worktree",
which definitely you would want to avoid) because you fear confused
users, perhaps "smudge-encoding" would work (we declare that the
result of smudge operations are left in this encoding, so the
opposite operation "clean" will do the reverse---and we say this
without explicitly saying that the other end of the conversion is
always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
not say the opposite operation "checkin/add" will do the reverse).

I personally do not think "working-tree-encoding" is too horrible,
but I do agree that some users may be confused.  So I dunno.





Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Eric Sunshine
On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneider
 wrote:
> On 11 Dec 2017, at 19:39, Eric Sunshine  wrote:
>> On Mon, Dec 11, 2017 at 10:50 AM,   wrote:
>>> From: Lars Schneider 
>>>
>>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>>> encoding. Git will happily accept content in all other encodings, too,
>>> but it might not be able to process the text (e.g. viewing diffs or
>>> changing line endings).
>>>
>>> Add an attribute to tell Git what encoding the user has defined for a
>>> given file. If the content is added to the index, then Git converts the
>>> content to a canonical UTF-8 representation. On checkout Git will
>>> reverse the conversion.
>>>
>>> Reviewed-by: Patrick Lühne 
>>> Signed-off-by: Lars Schneider 
>>> ---
>>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>>> +struct strbuf *buf, struct encoding *enc)
>>> +{
>>> +   if (enc->to_git == invalid_conversion) {
>>> +   enc->to_git = iconv_open(default_encoding, encoding->name);
>>> +   if (enc->to_git == invalid_conversion)
>>> +   warning(_("unsupported encoding %s"), 
>>> encoding->name);
>>> +   }
>>> +
>>> +   if (enc->to_worktree == invalid_conversion)
>>> +   enc->to_worktree = iconv_open(encoding->name, 
>>> default_encoding);
>>
>> Do you need to be calling iconv_close() somewhere on the result of the
>> iconv_open() calls? [Answering myself after reading the rest of the
>> patch: You're caching these opened 'iconv' descriptors, so you don't
>> plan on closing them.]
>
> Should this information go into the commit message to avoid confusing
> future readers? I think, yes.

Maybe. However, the code which does the actual caching is so distant
from these iconv_open() invocations that it might be more helpful to
have an in-code comment here saying that the "missing" iconv_close()
invocations is intentional.


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Lars Schneider

On 11 Dec 2017, at 19:39, Eric Sunshine  wrote:

> On Mon, Dec 11, 2017 at 10:50 AM,   wrote:
>> From: Lars Schneider 
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Reviewed-by: Patrick Lühne 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +   char *dst, *re_src;
>> +   int dst_len, re_src_len;
>> +
>> +   /*
>> +* No encoding is specified or there is nothing to encode.
>> +* Tell the caller that the content was not modified.
>> +*/
>> +   if (!enc || (src && !src_len))
>> +   return 0;
>> +
>> +   /*
>> +* Looks like we got called from "would_convert_to_git()".
>> +* This means Git wants to know if it would encode (= modify!)
>> +* the content. Let's answer with "yes", since an encoding was
>> +* specified.
>> +*/
>> +   if (!buf && !src)
>> +   return 1;
>> +
>> +   if (enc->to_git == invalid_conversion) {
>> +   enc->to_git = iconv_open(default_encoding, encoding->name);
>> +   if (enc->to_git == invalid_conversion)
>> +   warning(_("unsupported encoding %s"), 
>> encoding->name);
>> +   }
>> +
>> +   if (enc->to_worktree == invalid_conversion)
>> +   enc->to_worktree = iconv_open(encoding->name, 
>> default_encoding);
> 
> Do you need to be calling iconv_close() somewhere on the result of the
> iconv_open() calls? [Answering myself after reading the rest of the
> patch: You're caching these opened 'iconv' descriptors, so you don't
> plan on closing them.]

Should this information go into the commit message to avoid confusing
future readers? I think, yes.


>> + [...]
>> +   /*
>> +* Encode dst back to ensure no information is lost. This wastes
>> +* a few cycles as most conversions are round trip conversion
>> +* safe. However, content that has an invalid encoding might not
>> +* match its original byte sequence after the UTF-8 conversion
>> +* round trip. Let's play safe here and check the round trip
>> +* conversion.
>> +*/
>> +   re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, 
>> _src_len);
>> +   if (!re_src || strcmp(src, re_src)) {
> 
> You're using strcmp() as opposed to memcmp() because you expect
> 're_src' will unconditionally be UTF-8-encoded, right?

Yes. But since you mention it, I think it would be better to use
memcmp() here! Plus, it might be a tiny bit faster ;-)

Thanks,
Lars

Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Lars Schneider

On 11 Dec 2017, at 21:47, Johannes Sixt  wrote:

> Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com:
>> From: Lars Schneider 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> Reviewed-by: Patrick Lühne 
>> Signed-off-by: Lars Schneider 
>> ---
>> Hi,
>> here is a WIP patch to add text encoding support for files encoded with
>> something other than UTF-8 [RFC].
>> The 'encoding' attribute is already used to view blobs in gitk. That
>> could be a problem as the content is stored in Git with the defined
>> encoding. This patch would interpret the content as UTF-8 encoded and
> 
> This will be a major drawback for me because my code base stores text files 
> that are not UTF-8 encoded. And I do use the existing 'encoding' attribute to 
> view the text in git-gui and gitk. Repurposing this attribute name is not an 
> option, IMO.

I understand your point of view and I kind of expected that that reply.
Thanks for the feedback!

Question is: Given that "encoding" is not available, how could I name
 the attribute without confusing the user?

I contemplated:
  - "enc" or "encode" because "eol" and "ident" use abbreviations, too
(enc could be confused with encryption. plus, a user might ask
 what is the difference between "enc" and "encoding" attribute :-)
  - "wte", "wtenc", or "worktree-encoding" to emphasize that this is 
the encoding used in the worktree 
(I fear that users think that is git-worktree, the command, related)

I think my favorite is "worktree-encoding".
What do you think?

Thanks,
Lars 


BTW: I am curios, can you share what encoding you use?
My main use case is UTF-16 and I was surprised that I haven't
found a single public repo on github.com with "encoding=utf-16"



Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Johannes Sixt

Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com:

From: Lars Schneider 

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Reviewed-by: Patrick Lühne 
Signed-off-by: Lars Schneider 
---

Hi,

here is a WIP patch to add text encoding support for files encoded with
something other than UTF-8 [RFC].

The 'encoding' attribute is already used to view blobs in gitk. That
could be a problem as the content is stored in Git with the defined
encoding. This patch would interpret the content as UTF-8 encoded and


This will be a major drawback for me because my code base stores text 
files that are not UTF-8 encoded. And I do use the existing 'encoding' 
attribute to view the text in git-gui and gitk. Repurposing this 
attribute name is not an option, IMO.


it would try to reencode it to the defined encoding on checkout > Plus, many repos define the attribute very broad (e.g. "* 

encoding=cp1251").

These folks would see errors like these with my patch:
 error: failed to encode 'foo.bar' from utf-8 to cp1251


-- Hannes


Re: [PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread Eric Sunshine
On Mon, Dec 11, 2017 at 10:50 AM,   wrote:
> From: Lars Schneider 
>
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Reviewed-by: Patrick Lühne 
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/convert.c b/convert.c
> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +struct strbuf *buf, struct encoding *enc)
> +{
> +#ifndef NO_ICONV
> +   char *dst, *re_src;
> +   int dst_len, re_src_len;
> +
> +   /*
> +* No encoding is specified or there is nothing to encode.
> +* Tell the caller that the content was not modified.
> +*/
> +   if (!enc || (src && !src_len))
> +   return 0;
> +
> +   /*
> +* Looks like we got called from "would_convert_to_git()".
> +* This means Git wants to know if it would encode (= modify!)
> +* the content. Let's answer with "yes", since an encoding was
> +* specified.
> +*/
> +   if (!buf && !src)
> +   return 1;
> +
> +   if (enc->to_git == invalid_conversion) {
> +   enc->to_git = iconv_open(default_encoding, encoding->name);
> +   if (enc->to_git == invalid_conversion)
> +   warning(_("unsupported encoding %s"), encoding->name);
> +   }
> +
> +   if (enc->to_worktree == invalid_conversion)
> +   enc->to_worktree = iconv_open(encoding->name, 
> default_encoding);

Do you need to be calling iconv_close() somewhere on the result of the
iconv_open() calls? [Answering myself after reading the rest of the
patch: You're caching these opened 'iconv' descriptors, so you don't
plan on closing them.]

> + [...]
> +   /*
> +* Encode dst back to ensure no information is lost. This wastes
> +* a few cycles as most conversions are round trip conversion
> +* safe. However, content that has an invalid encoding might not
> +* match its original byte sequence after the UTF-8 conversion
> +* round trip. Let's play safe here and check the round trip
> +* conversion.
> +*/
> +   re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, 
> _src_len);
> +   if (!re_src || strcmp(src, re_src)) {

You're using strcmp() as opposed to memcmp() because you expect
're_src' will unconditionally be UTF-8-encoded, right?

> +   die(_("encoding '%s' from %s to %s and back is not the same"),
> +   path, enc->name, default_encoding);
> +   }
> +   free(re_src);
> +
> +   strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +   return 1;
> +#else
> +   warning(_("cannot encode '%s' from %s to %s because "
> +   "your Git was not compiled with encoding support"),
> +   path, enc->name, default_encoding);
> +   return 0;
> +#endif
> +}


[PATCH v1] convert: add support for 'encoding' attribute

2017-12-11 Thread lars . schneider
From: Lars Schneider 

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Reviewed-by: Patrick Lühne 
Signed-off-by: Lars Schneider 
---

Hi,

here is a WIP patch to add text encoding support for files encoded with
something other than UTF-8 [RFC].

The 'encoding' attribute is already used to view blobs in gitk. That
could be a problem as the content is stored in Git with the defined
encoding. This patch would interpret the content as UTF-8 encoded and
it would try to reencode it to the defined encoding on checkout.
Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251").
These folks would see errors like these with my patch:
error: failed to encode 'foo.bar' from utf-8 to cp1251

A quick search on GitHub reveals 2,722 repositories that use the
'encoding' attribute [1]. Using the GitHub API [2], I identified the
following encodings in all publicly accessible repositories:

ANSI<-- garbage (ignore by my implementation)
cp1251
cp866
cp950
iso8859-1
koi8-r
shiftjis<-- garbage (ignore by my implementation)
UTF-8   <-- unnecessary (ignore by my implementation)
utf8<-- garbage (ignore by my implementation)

TODOs:
- The iconv docs mention that "errno == EINVAL" is harmless but
  we don't handle that case in utf8.c:reencode_string_iconv()
- Git does not compile with NO_ICONV=1 right now because of
  compat/precompose_utf8.c. I will send a patch to fix that.

Questions:
- Should I use warning() or error() in the patch?
  (currently I use the warning)
- Do you agree with the approach in general?

Thanks,
Lars


[RFC] http://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com
  [1] 
https://github.com/search?p=1=encoding+filename%3Agitattributes=Code=%E2%9C%93
  [2] curl --user larsxschneider:secret -H 'Accept: 
application/vnd.github.v3.text-match+json' 
'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' 
| jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq
  [3] 
https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples




Notes:
Base Ref: master
Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d
Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && 
git checkout afc9e88a4d

 Documentation/gitattributes.txt |  27 ++
 convert.c   | 192 +++-
 t/t0028-encoding.sh |  60 +
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 t/t0028-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..84de2fa49c 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -146,6 +146,33 @@ Unspecified::
 Any other value causes Git to act as if `text` has been left
 unspecified.

+`encoding`
+^^
+
+By default Git assumes UTF-8 encoding for text files.  The `encoding`
+attribute sets the encoding to be used in the working directory.
+If the path is added to the index, then Git encodes the content to
+UTF-8.  On checkout the content is encoded back to the original
+encoding.  Consequently, you can use all built-in Git text processing
+tools (e.g. git diff, line ending conversions, etc.) with your
+non-UTF-8 encoded file.
+
+Please note that re-encoding content can cause errors and requires
+resources. Use the `encoding` attribute only if you cannot store
+a file in UTF-8 encoding and if you want Git to be able to process
+the text.
+
+
+*.txt  text encoding=UTF-16
+
+
+All `iconv` encodings with a stable round-trip conversion to and from
+UTF-8 are supported.  You can see a full list with the following command:
+
+
+iconv --list
+
+
 `eol`
 ^

diff --git a/convert.c b/convert.c
index 20d7ab67bd..ee19c17104 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"

 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,

 }

+#ifdef NO_ICONV
+#ifndef _ICONV_T
+/* The type is just a placeholder and not actually used. */
+typedef void* iconv_t;
+#endif
+#endif
+
+static struct encoding {
+   const char *name;
+