> On 25 Feb 2018, at 08:15, Eric Sunshine <[email protected]> wrote:
>
> On Sat, Feb 24, 2018 at 11:27 AM, <[email protected]> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>>
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>>
>> Signed-off-by: Lars Schneider <[email protected]>
>> ---
>> diff --git a/Documentation/gitattributes.txt
>> b/Documentation/gitattributes.txt
>> @@ -272,6 +272,84 @@ few exceptions. Even though...
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
>> + versions (as of March 2018) do not support the `working-tree-encoding`
>> + attribute. If you decide to use the `working-tree-encoding` attribute
>> + in your repository, then it is strongly recommended to ensure that all
>> + clients working with the repository support it.
>> +
>> + If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
>> + `working-tree-encoding` enabled Git client, then `foo.proj` will be
>> + stored as UTF-8 internally. A client without `working-tree-encoding`
>> + support will checkout `foo.proj` as UTF-8 encoded file. This will
>> + typically cause trouble for the users of this file.
>
> The above paragraph is giving an example of the scenario described in
> the paragraph above it. It might, therefore, be clearer to start this
> paragraph with "For example,...". Also, as an aid to non-Windows
> developers, it might be helpful to introduce ".proj" files as
> "Microsoft Visual Studio `*.proj` files...".
Agreed. How about this?
For example, Microsoft Visual Studio resources files (`*.rc`) or
PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
stored as UTF-8 internally. A client without `working-tree-encoding`
support will checkout `foo.ps1` as UTF-8 encoded file. This will
typically cause trouble for the users of this file.
>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct
>> text_stat *stats,
>> +static struct encoding {
>> + const char *name;
>> + struct encoding *next;
>> +} *encoding, **encoding_tail;
>
> Why does this struct need to be a linked-list since it contains only a
> name? In fact, why do the struct and linked list exist at all? Back in
> v1 when the struct held more members, it made sense to place them in a
> collection so they could be re-used, but today it just seems an
> unnecessary artifact which buys you nothing.
>
> Is the plan that some future patch series will add members to the
> struct? If not, then it seems that it should be retired altogether.
Absolutely! Thank you!
>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, struct encoding *enc, int
>> conv_flags)
>> +{
>> + [...]
>> + if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> + const char *error_msg = _(
>> + "BOM is prohibited in '%s' if encoded as %s");
>> + /*
>> + * This advise is shown for UTF-??BE and UTF-??LE encodings.
>
> s/advise/advice/
Agreed!
>> + * We truncate the encoding name to 6 chars with %.6s to cut
>> + * off the last two "byte order" characters.
>> + */
>> + const char *advise_msg = _(
>> + "The file '%s' contains a byte order mark (BOM). "
>> + "Please use %.6s as working-tree-encoding.");
>> + advise(advise_msg, path, enc->name);
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(error_msg, path, enc->name);
>> + else
>> + error(error_msg, path, enc->name);
>
> What is the intended behavior in the non-die() case? As implemented,
> this is still going to attempt the conversion even though it threw an
> error. Should it be returning 0 here instead? Same question for the
> couple additional cases below.
Good question. My intention was to print an error and then let iconv try
the conversion anyways. I agree that this is bogus. Let's return in case
of an error right away.
>> +
>> + } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>> + const char *error_msg = _(
>> + "BOM is required in '%s' if encoded as %s");
>> + const char *advise_msg = _(
>> + "The file '%s' is missing a byte order mark (BOM). "
>> + "Please use %sBE or %sLE (depending on the byte
>> order) "
>> + "as working-tree-encoding.");
>> + advise(advise_msg, path, enc->name, enc->name);
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(error_msg, path, enc->name);
>> + else
>> + error(error_msg, path, enc->name);
>> + }
>
> For a function which presumably is agnostic about the working-tree
> encoding (supporting whatever iconv does), it nevertheless has
> unusually intimate knowledge about UTF BOMs, in general, and the
> internals of has_prohibited_utf_bom() and
> is_missing_required_utf_bom(), in particular. It might be cleaner, and
> would simplify this function, if all this UTF BOM-specific gunk was
> moved elsewhere and generalized into a "validate conversion" function.
Agreed! How about this?
/*
* If we convert to an UTF encoding, then check for common BOM errors.
*/
if (!memcmp("UTF-", enc, 4))
if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
return 0;
>> + dst = reencode_string_len(src, src_len, default_encoding, enc->name,
>> + &dst_len);
>> + if (!dst) {
>> + /*
>> + * We could add the blob "as-is" to Git. However, on checkout
>> + * we would try to reencode to the original encoding. This
>> + * would fail and we would leave the user with a messed-up
>> + * working tree. Let's try to avoid this by screaming loud.
>> + */
>> + const char* msg = _("failed to encode '%s' from %s to %s");
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(msg, path, enc->name, default_encoding);
>> + else
>> + error(msg, path, enc->name, default_encoding);
>> + }
>> +
>> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> + return 1;
>> +}
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t
>> src_len,
>> + struct strbuf *buf, struct encoding *enc)
>> +{
>> + [...]
>> + dst = reencode_string_len(src, src_len, enc->name, default_encoding,
>> + &dst_len);
>> + if (!dst) {
>> + error("failed to encode '%s' from %s to %s",
>> + path, enc->name, default_encoding);
>
> The last two arguments need to be swapped.
Nice catch!
>> + return 0;
>> + }
>> +
>> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> + return 1;
>> +}
>> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const
>> char *src, size_t len,
>> +static struct encoding *git_path_check_encoding(struct attr_check_item
>> *check)
>> +{
>> + if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> + !strlen(value))
>> + return NULL;
>> +
>> + for (enc = encoding; enc; enc = enc->next)
>> + if (!strcasecmp(value, enc->name))
>> + return enc;
>> +
>> + /* Don't encode to the default encoding */
>> + if (!strcasecmp(value, default_encoding))
>> + return NULL;
>
> You could handle this easy-to-detect case before attempting the more
> expensive loop just above. (Not necessarily worth a re-roll.)
True, but I delete the loop anyways when I removed the "encoding
linked list" as suggested by your comment above.
>> + enc = xcalloc(1, sizeof(*enc));
>> + /*
>> + * Ensure encoding names are always upper case (e.g. UTF-8) to
>> + * simplify subsequent string comparisons.
>> + */
>> + enc->name = xstrdup_toupper(value);
>> + *encoding_tail = enc;
>> + encoding_tail = &(enc->next);
>> +
>> + return enc;
>> +}
>> diff --git a/t/t0028-working-tree-encoding.sh
>> b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,226 @@
>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> + git cat-file -p :test.utf16 >test.utf16.git &&
>> + test_cmp_bin test.utf8.raw test.utf16.git &&
>> +
>> + # cleanup
>> + rm test.utf16.git
>> +'
>
> This cleanup won't take place if test_cmp_bin() fails. Instead,
> cleanups are typically handled by test_when_finished() to ensure that
> the cleanup action occurs even if the test fails.
>
> test_expect_success 'ensure UTF-8 is stored in Git' '
> test_when_finished "rm -f test.utf16.git" &&
> ...
>
> Same comment for remaining tests.
Agreed! That's the proper way to cleanup the tests. I'll fix my tests.
>> +test_expect_success 'check prohibited UTF BOM' '
>> + printf "\0a\0b\0c" >nobom.utf16be.raw &&
>> + printf "a\0b\0c\0" >nobom.utf16le.raw &&
>> + printf "\376\777\0a\0b\0c" >bebom.utf16be.raw &&
>> + printf "\777\376a\0b\0c\0" >lebom.utf16le.raw &&
>> +
>> + printf "\0\0\0a\0\0\0b\0\0\0c" >nobom.utf32be.raw &&
>> + printf "a\0\0\0b\0\0\0c\0\0\0" >nobom.utf32le.raw &&
>> + printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
>> + printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
>
> These resources seem to be used by multiple tests. Perhaps create them
> in a distinct "setup" test instead of bundling them in this test?
Good idea!
>> + echo "*.utf16be text working-tree-encoding=utf-16be"
>> >>.gitattributes &&
>> + echo "*.utf16le text working-tree-encoding=utf-16le"
>> >>.gitattributes &&
>> + echo "*.utf32be text working-tree-encoding=utf-32be"
>> >>.gitattributes &&
>> + echo "*.utf32le text working-tree-encoding=utf-32le"
>> >>.gitattributes &&
>> +
>> + # Here we add a UTF-16 files with BOM (big-endian and little-endian)
>> + # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
>> + # the BOM is prohibited.
>> + cp bebom.utf16be.raw bebom.utf16be &&
>> + test_must_fail git add bebom.utf16be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> + cp lebom.utf16le.raw lebom.utf16be &&
>> + test_must_fail git add lebom.utf16be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> + cp bebom.utf16be.raw bebom.utf16le &&
>> + test_must_fail git add bebom.utf16le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> + cp lebom.utf16le.raw lebom.utf16le &&
>> + test_must_fail git add lebom.utf16le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> + # ... and the same for UTF-32
>> + cp bebom.utf32be.raw bebom.utf32be &&
>> + test_must_fail git add bebom.utf32be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> + cp lebom.utf32le.raw lebom.utf32be &&
>> + test_must_fail git add lebom.utf32be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> + cp bebom.utf32be.raw bebom.utf32le &&
>> + test_must_fail git add bebom.utf32le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> + cp lebom.utf32le.raw lebom.utf32le &&
>> + test_must_fail git add lebom.utf32le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> + # cleanup
>> + git reset --hard HEAD
>> +'
>
> Clumping all these checks into the same test makes it difficult to
> determine which one failed (if one does fail). Also, the amount of
> copy/paste code is easy to get wrong and a bit onerous to review.
> Automating via for-loops would address these concerns. For instance,
> to check all combinations (not just 8 combinations tested above):
>
> for i in 16be 16le 32be 32le
> do
> for j in 16be 16le 32be 32le
> do
> test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
> test_when_finished "git reset --hard HEAD" &&
> cp bebom.utf$i.raw bebom.utf$j &&
> test_must_fail git add bebom.utf$j 2>err.out &&
> J=$(echo $j | tr bel BEL) &&
> test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
> '
> done
> done
>
> Alternately, the test could be moved to a function and called for each
> combination:
>
> check_prohibited_bom () {
> i=$1
> j=$2
> test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
> ...
> '
> }
> check_prohibited_bom 16be 16be
> check_prohibited_bom 16be 16le
> ...
>
> Same comment regarding copy/paste in tests below.
In general I share your concern about code duplication.
However, I don't want be "too clever" in testing code
as it becomes hard to read it. Therefore, I implemented
a loop over "16 32" which is a good compromise I think.
Thanks you very much for this thorough review,
Lars