Re: [PATCH v9 0/8] convert: add support for different encodings

2018-03-06 Thread Eric Sunshine
On Sun, Mar 4, 2018 at 3:14 PM,   wrote:
> Changes since v8: [...]
>
> Thanks a lot Eric for your great review! I think I fixed everything you
> objected with one exception. You noticed that the current code only
> checks for BOMs corresponding to the declared size (16 or 32 bits) [1].
> I understand your point of view and I agree that any BOM in these cases
> is *most likely* an error. However, according to the spec it might
> still be valid. The comments on my related question on StackOverflow
> seem to support that view [2]. Therefore, I would like to leave it as
> it is in this series. If it turns out to be a problem in practice, then
> I am happy to change it later. OK for you?

Fine. As this is not a correctness issue with the conversion itself,
but rather an attempt to detect misconfiguration by the user, it can
always be revisited later if someone comes up with a more solid
argument that one interpretation is more correct than the other.

I did leave a few review comments on v9, but I don't think any are
necessarily worth a re-roll. If anything, and if they seem actionable,
then perhaps a patch or two atop the series could address them (at
some point).


[PATCH v9 0/8] convert: add support for different encodings

2018-03-04 Thread lars . schneider
From: Lars Schneider 

Hi,

Patches 1-4,7 are preparation and helper functions.
Patch 5,6,8 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v8:

* move UTF BOM error checks in a new dedicated function
  validate_encoding() and into a dedicated commit (6)
* remove unnecessary encoding struct (became a plain char*)
* fail early and do not try to run the reencode content in case of a
  validation error
* return early if roundtrip encoding was not found (avoid undefined
  pointer arithmetic)
* fix wrong argument order in encode_to_worktree() error message
* use test_when_finished to cleanup tests
* move UTF-16/32 BOM test file generation into "setup test"
* reduce code duplication in tests
* improve documentation:
- use *.rc and *.ps1 as examples as they are usually UTF-16 encoded
* fix comment: /advise/advice/

Thanks a lot Eric for your great review! I think I fixed everything you
objected with one exception. You noticed that the current code only
checks for BOMs corresponding to the declared size (16 or 32 bits) [1].
I understand your point of view and I agree that any BOM in these cases
is *most likely* an error. However, according to the spec it might
still be valid. The comments on my related question on StackOverflow
seem to support that view [2]. Therefore, I would like to leave it as
it is in this series. If it turns out to be a problem in practice, then
I am happy to change it later. OK for you?


Thanks,
Lars

[1] https://public-inbox.org/git/df6f3855-efe7-4c13-aa53-819aae0de...@gmail.com/
[2] 
https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream


  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
   v8: 
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/fdf0d63188
Checkout: git fetch https://github.com/larsxschneider/git encoding-v9 && git 
checkout fdf0d63188


### Interdiff (v8..v9):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 11315054f4..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -297,14 +297,16 @@ number of pitfalls:
   in your repository, then it is strongly recommended to ensure that all
   clients working with the repository support it.

-  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
-  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
   stored as UTF-8 internally. A client without `working-tree-encoding`
-  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
   typically cause trouble for the users of this file.

   If a Git client, that does not support the `working-tree-encoding`
-  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
   stored "as-is" internally (in this example probably as UTF-16).
   A client with `working-tree-encoding` support will interpret the
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
@@ -325,22 +327,22 @@ Use the `working-tree-encoding` attribute only if you 
cannot store a file
 in UTF-8 encoding and if you want Git to be able to process the content
 as text.

-As an example, use the following attributes if your '*.proj' files are
+As an example, use the following attributes if your '*.ps1' files are
 UTF-16 encoded with byte order mark (BOM) and you want Git to perform
 automatic line ending conversion based on your platform.

 
-*.proj text working-tree-encoding=UTF-16
+*.ps1  text working-tree-encoding=UTF-16
 

-Use the following attributes if your '*.proj' files are UTF-16 little
+Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and y