Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute

2017-12-30 Thread Lars Schneider

> On 29 Dec 2017, at 18:28, Torsten Bögershausen  wrote:
> 
> I probably need to look at convert.c more closer, some other comments inline.
> 
> 
> On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider 
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
> 
> UTF-8 is too strict, the text from below is more correct:
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.

Agreed. I'll replace it.


>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
> 
> Minor question about "canonical":
> Would this mean the same ?
> ...then Git converts the content into  UTF-8.

I feel the word canonical makes sense in this context to express
that UTF-8 is not some randomly chosen encoding. AFAIK UTF-8 and 
LF are the canonical form for text in Git, no?


>> +In these cases you can teach Git the encoding of a file in the working
>  teach ? tell ? 

"tell", agreed!


>> +directory with the `checkout-encoding` attribute. If a file with this
>> +attributes is added to Git, then Git reencodes the content from the
>> +specified encoding to UTF-8 and stores the result in its internal data
>> +structure.
> 
> Minor Q:
>> its internal data structure.
> Should we simply write "stores the result in the index" ?

This text is targeted at the end user and I know a lot of end users
that are confused by the word "index". How about:

  ...stores the result in its internal data structure ("the index").

Would that work?

> 
>> On checkout the content is encoded back to the specified
>> +encoding.
> 
>> +
>> +Please note that using the `checkout-encoding` attribute has a number
>> +of drawbacks:
>> +
>> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>> +  errors as the conversion might not be round trip safe.
>> +
>> +- Reencoding content requires resources that might slow down certain
>> +  Git operations (e.g 'git checkout' or 'git add').
>> +
>> +- Git clients that do not support the `checkout-encoding` attribute or
>> +  the used encoding will checkout the respective files as UTF-8 encoded.
>> +  That means the content appears to be different which could cause
>> +  trouble. Affected clients are older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of January 2018).
>> +
>> +Use the `checkout-encoding` attribute only if you cannot store a file in
>> +UTF-8 encoding and if you want Git to be able to process the content as
>> +text.
>> +
> 
> I would maybe rephrase a little bit (first things first):
> 
> Please note that using the `checkout-encoding` attribute may have a number
> of pitfalls:
> 
> - Git clients that do not support the `checkout-encoding` attribute
>  will checkout the respective files as UTF-8 encoded,  which typically
>  causes trouble.
>  Escpecialy when older Git versions are used or alternative Git
>  implementations such as JGit or libgit2 (as of January 2018).
> 
> - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>  errors as the conversion might not be round trip safe.
> 
> - Reencoding content requires resources that might slow down certain
>  Git operations (e.g 'git checkout' or 'git add').
> 
> Use the `checkout-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.

Agreed!


> -
> Side question: What happens if "the used encoding" is not supported by
> the underlying iconv lib ?
> Will Git fail, delete the file from the working tree ?
> That may be worth to mention. (And I need to do the code-review)

It should checkout the file in UTF-8 and print an error.
If you would try to add a file with an unsupported encoding,
then Git should die() and refuse the operation.
(see t0028: check unsupported encoding)


> 
>> +Use the following attributes if your '*.txt' files are UTF-16 encoded
>> +with byte order mark (BOM) and you want Git to perform automatic line
>> +ending conversion based on your platform.
>> +
>> +
>> +*.txt   text checkout-encoding=UTF-16
>> +
>> +
>> +Use the following attributes if your '*.txt' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory.
>> +
>> +
>> 

Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute

2017-12-29 Thread Torsten Bögershausen
I probably need to look at convert.c more closer, some other comments inline.


On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schnei...@autodesk.com wrote:
> From: Lars Schneider 
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).

UTF-8 is too strict, the text from below is more correct:
 +Git recognizes files encoded with ASCII or one of its supersets (e.g.
 +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
 +interpreted as binary and consequently built-in Git text processing
 +tools (e.g. 'git diff') as well as most Git web front ends do not
 +visualize the content.


> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will

Minor question about "canonical":
Would this mean the same ?
 ...then Git converts the content into  UTF-8.


> reverse the conversion.
> 
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt |  59 
>  apply.c |   2 +-
>  blame.c |   2 +-
>  combine-diff.c  |   2 +-
>  convert.c   | 196 ++-
>  convert.h   |   8 +-
>  diff.c  |   2 +-
>  sha1_file.c |   5 +-
>  t/t0028-checkout-encoding.sh| 197 
> 
>  9 files changed, 460 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0028-checkout-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..0039bd38c3 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,65 @@ few exceptions.  Even though...
>catch potential problems early, safety triggers.
>  
>  
> +`checkout-encoding`
> +^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +

> +In these cases you can teach Git the encoding of a file in the working
  teach ? tell ? 
> +directory with the `checkout-encoding` attribute. If a file with this
> +attributes is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8 and stores the result in its internal data
> +structure.

Minor Q:
> its internal data structure.
Should we simply write "stores the result in the index" ?

> On checkout the content is encoded back to the specified
> +encoding.

> +
> +Please note that using the `checkout-encoding` attribute has a number
> +of drawbacks:
> +
> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
> +  errors as the conversion might not be round trip safe.
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +- Git clients that do not support the `checkout-encoding` attribute or
> +  the used encoding will checkout the respective files as UTF-8 encoded.
> +  That means the content appears to be different which could cause
> +  trouble. Affected clients are older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of January 2018).
> +
> +Use the `checkout-encoding` attribute only if you cannot store a file in
> +UTF-8 encoding and if you want Git to be able to process the content as
> +text.
> +

I would maybe rephrase a little bit (first things first):

Please note that using the `checkout-encoding` attribute may have a number
of pitfalls:

- Git clients that do not support the `checkout-encoding` attribute
  will checkout the respective files as UTF-8 encoded,  which typically
  causes trouble.
  Escpecialy when older Git versions are used or alternative Git
  implementations such as JGit or libgit2 (as of January 2018).

- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
  errors as the conversion might not be round trip safe.

- Reencoding content requires resources that might slow down certain
  Git operations (e.g 'git checkout' or 'git add').

Use the `checkout-encoding` attribute only if you cannot store a file in
UTF-8 encoding and if you want Git to be able to process the content as
text.

-
Side question: What happens if "the used encoding" is not supported by
the underlying iconv lib ?
Will Git fail, delete the file from the working tree ?
That may be worth to mention. (And I need to do the code-review)

> +Use the 

[PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute

2017-12-29 Thread lars . schneider
From: Lars Schneider 

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt |  59 
 apply.c |   2 +-
 blame.c |   2 +-
 combine-diff.c  |   2 +-
 convert.c   | 196 ++-
 convert.h   |   8 +-
 diff.c  |   2 +-
 sha1_file.c |   5 +-
 t/t0028-checkout-encoding.sh| 197 
 9 files changed, 460 insertions(+), 13 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..0039bd38c3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,65 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`checkout-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure. On checkout the content is encoded back to the specified
+encoding.
+
+Please note that using the `checkout-encoding` attribute has a number
+of drawbacks:
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+- Git clients that do not support the `checkout-encoding` attribute or
+  the used encoding will checkout the respective files as UTF-8 encoded.
+  That means the content appears to be different which could cause
+  trouble. Affected clients are older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of January 2018).
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+
+*.txt  text checkout-encoding=UTF-16
+
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+
+*.txt  checkout-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+
 `ident`
 ^^^
 
diff --git a/apply.c b/apply.c
index 321a9fa68d..c4bd5cf1f2 100644
--- a/apply.c
+++ b/apply.c
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 
0);
return 0;
default:
return -1;
diff --git a/blame.c b/blame.c
index 2893f3c103..388b66897b 100644
--- a/blame.c
+++ b/blame.c
@@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct 
diff_options *opt,
if (strbuf_read(, 0, 0) < 0)
die_errno("failed to read from stdin");
}
-   convert_to_git(_index, path, buf.buf, buf.len, , 0);
+   convert_to_git(_index, path, buf.buf, buf.len, , 0, 0);
origin->file.ptr = buf.buf;
origin->file.size = buf.len;
pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..4555e49b5f 100644
--- a/combine-diff.c
+++