Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread Junio C Hamano
Lars Schneider  writes:

> Nice catch. What do you think about this solution using is_encoding_utf8()
> from utf.c?

That helper was invented exactly for a case like this, I would
think.


Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 18:54, Eric Sunshine  wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,   wrote:
>> [...]
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> Documentation/gitattributes.txt  |  80 +++
>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static const char *default_encoding = "UTF-8";
>> @@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const 
>> char *src, size_t len,
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +   const char *value = check->value;
>> +
>> +   if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +   !strlen(value))
>> +   return NULL;
>> +
>> +   /* Don't encode to the default encoding */
>> +   if (!strcasecmp(value, default_encoding))
>> +   return NULL;
> 
> As of v10, the rest of the code accepts encoding names "UTF-xx" and
> "UTFxx" (case insensitive), however, this check recognizes only
> "UTF-8" (case insensitive). For consistency, one would expect this
> also to recognize "UTF8" (case insensitive).

Nice catch. What do you think about this solution using is_encoding_utf8()
from utf.c?

if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) 

- Lars

Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> + const char *value = check->value;
> +
> + if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> + !strlen(value))
> + return NULL;

This means that having "* working-tree-encoding" (without "=value")
in your .gitattributes file is silently ignored.  

Thinking aloud.  I wonder if that is something we would want to warn
about so that the project can fix it, perhaps?  Or would that become
too noisy to use an older version of Git that includes this series
when a newer version that defines new meanings to boolean
(set/unset) values for w-t-e attribute becomes available and
projects adopt it?  On the other hand, with this code as-is or with
an additional warning, such an "older version" of Git by definition
behaves differently from such a "newer version" that does something
different when the attribute is not a non-empty string, so it is
quite likely that we won't be able to redefine or extend the meaning
of w-t-e in a way like that.

Which means that the only sensible way to make sure we _could_ later
extend the meaning of w-t-e and make it behave differently when set
to a non-empty string is to make it an error in _this_ "older"
version.  That way, of course people cannot use the "older" version
on a newer project that depends on the way how "newer" Git treats
w-t-e that is set to, say, "true", but we won't silently (or loudly)
do wrong things to the project's data.



Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread Eric Sunshine
On Wed, Mar 7, 2018 at 12:30 PM,   wrote:
> [...]
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt  |  80 +++
> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
> +static const char *default_encoding = "UTF-8";
> @@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const 
> char *src, size_t len,
> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> +   const char *value = check->value;
> +
> +   if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +   !strlen(value))
> +   return NULL;
> +
> +   /* Don't encode to the default encoding */
> +   if (!strcasecmp(value, default_encoding))
> +   return NULL;

As of v10, the rest of the code accepts encoding names "UTF-xx" and
"UTFxx" (case insensitive), however, this check recognizes only
"UTF-8" (case insensitive). For consistency, one would expect this
also to recognize "UTF8" (case insensitive).

> +
> +   return value;
> +}


[PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  80 +++
 convert.c| 110 +++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 133 +++
 5 files changed, 324 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..80549ff2b5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #in