From: Lars Schneider
Hi,
Patches 1-4,7 are preparation and helper functions.
Patch 5,6,8 are the actual change.
This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.
Changes since v8:
* move UTF BOM error checks in a new dedicated function
validate_encoding() and into a dedicated commit (6)
* remove unnecessary encoding struct (became a plain char*)
* fail early and do not try to run the reencode content in case of a
validation error
* return early if roundtrip encoding was not found (avoid undefined
pointer arithmetic)
* fix wrong argument order in encode_to_worktree() error message
* use test_when_finished to cleanup tests
* move UTF-16/32 BOM test file generation into "setup test"
* reduce code duplication in tests
* improve documentation:
- use *.rc and *.ps1 as examples as they are usually UTF-16 encoded
* fix comment: /advise/advice/
Thanks a lot Eric for your great review! I think I fixed everything you
objected with one exception. You noticed that the current code only
checks for BOMs corresponding to the declared size (16 or 32 bits) [1].
I understand your point of view and I agree that any BOM in these cases
is *most likely* an error. However, according to the spec it might
still be valid. The comments on my related question on StackOverflow
seem to support that view [2]. Therefore, I would like to leave it as
it is in this series. If it turns out to be a problem in practice, then
I am happy to change it later. OK for you?
Thanks,
Lars
[1] https://public-inbox.org/git/df6f3855-efe7-4c13-aa53-819aae0de...@gmail.com/
[2]
https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream
RFC:
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
v1:
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
v2:
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
v3:
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
v4:
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
v6:
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
v7:
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
v8:
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/
Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/fdf0d63188
Checkout: git fetch https://github.com/larsxschneider/git encoding-v9 && git
checkout fdf0d63188
### Interdiff (v8..v9):
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 11315054f4..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -297,14 +297,16 @@ number of pitfalls:
in your repository, then it is strongly recommended to ensure that all
clients working with the repository support it.
- If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
- `working-tree-encoding` enabled Git client, then `foo.proj` will be
+ For example, Microsoft Visual Studio resources files (`*.rc`) or
+ PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+ If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+ a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
stored as UTF-8 internally. A client without `working-tree-encoding`
- support will checkout `foo.proj` as UTF-8 encoded file. This will
+ support will checkout `foo.ps1` as UTF-8 encoded file. This will
typically cause trouble for the users of this file.
If a Git client, that does not support the `working-tree-encoding`
- attribute, adds a new file `bar.proj`, then `bar.proj` will be
+ attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
stored "as-is" internally (in this example probably as UTF-16).
A client with `working-tree-encoding` support will interpret the
internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
@@ -325,22 +327,22 @@ Use the `working-tree-encoding` attribute only if you
cannot store a file
in UTF-8 encoding and if you want Git to be able to process the content
as text.
-As an example, use the following attributes if your '*.proj' files are
+As an example, use the following attributes if your '*.ps1' files are
UTF-16 encoded with byte order mark (BOM) and you want Git to perform
automatic line ending conversion based on your platform.
-*.proj text working-tree-encoding=UTF-16
+*.ps1 text working-tree-encoding=UTF-16
-Use the following attributes if your '*.proj' files are UTF-16 little
+Use the following attributes if your '*.ps1' files are UTF-16 little
endian encoded without BOM and y