Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-07 Thread Torsten Bögershausen
On Tue, Mar 06, 2018 at 03:37:16PM -0800, Junio C Hamano wrote:
> Lars Schneider  writes:
> 
> > After thinking about it I wonder if we should barf on "utf16" without
> > dash. Your Linux iconv would handle this correctly. My macOS iconv would 
> > not.
> > That means the repo would checkout correctly on your machine but not on 
> > mine.
> >
> > What do you think?
> 
> To be bluntly honest, I prefer not to have excess "sanity checks";
> there is no need to barf on utf16 in a project run by those who are
> without macOS friends, for example.  For that matter, while I do not
> hate it so much to reject it, I am not all that supportive of this
> "The consortium says without -LE or -BE suffix there must be BOM,
> and this lacks one, so barf loudly" step in this topic myself.

Loosing or adding a BOM couild render a file useless, depending
on the SW that reads and processes it.

The main reason for being critacal is to make sure that the
material in the working-tree does a safe roundtrip when commited,
pushed, pulled, checked out...

The best thing we can do is probably to follow the specification as
good as possible.

Having a clear specification (UTF-16LE/BE has no BOM, UTF-16 has none,
UTF-16 has a BOM) would even open the chance to work around a buggy
iconv library, because Git can check the BOM.
If, and only if, a platform messes it up, and we want to
make a (compile time) workaround in Git for this very platform.

A consistant behavior across all platforms is a good thing,
sometimes I have a network share mounted to Linux, MacOS and
Windows and it doesn't matter under which Git on which
machine I checkout or commit.

Oh, I see, there is a new patch coming...


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Junio C Hamano
Lars Schneider  writes:

> After thinking about it I wonder if we should barf on "utf16" without
> dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
> That means the repo would checkout correctly on your machine but not on mine.
>
> What do you think?

To be bluntly honest, I prefer not to have excess "sanity checks";
there is no need to barf on utf16 in a project run by those who are
without macOS friends, for example.  For that matter, while I do not
hate it so much to reject it, I am not all that supportive of this
"The consortium says without -LE or -BE suffix there must be BOM,
and this lacks one, so barf loudly" step in this topic myself.


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 07 Mar 2018, at 00:07, Junio C Hamano  wrote:
> 
> Junio C Hamano  writes:
> 
>> Lars Schneider  writes:
>> 
 Also "UTF16" or other spelling
 the platform may support but this code fails to recognise will go
 unchecked.
>>> 
>>> That is true. However, I would assume all iconv implementations use the
>>> same encoding names for UTF encodings, no? That means UTF16 would never be
>>> valid. Would you agree?
>> 
>> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
>> I am not sure what you mean by "Would you agree?"  People can say in
>> their .gitattributes file that this path is in "UTF16" without dash
>> and that is what will be fed to this code, no?
> 
> FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am
> reasonably sure that people expect downcased ones to work as well.

Sure! That's why I normalized it to upper case:
https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/

After thinking about it I wonder if we should barf on "utf16" without
dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
That means the repo would checkout correctly on your machine but not on mine.

What do you think?

- Lars



Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Junio C Hamano
Junio C Hamano  writes:

> Lars Schneider  writes:
>
>>>  Also "UTF16" or other spelling
>>> the platform may support but this code fails to recognise will go
>>> unchecked.
>>
>> That is true. However, I would assume all iconv implementations use the
>> same encoding names for UTF encodings, no? That means UTF16 would never be
>> valid. Would you agree?
>
> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
> I am not sure what you mean by "Would you agree?"  People can say in
> their .gitattributes file that this path is in "UTF16" without dash
> and that is what will be fed to this code, no?

FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am
reasonably sure that people expect downcased ones to work as well.


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 23:53, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>>> Also "UTF16" or other spelling
>>> the platform may support but this code fails to recognise will go
>>> unchecked.
>> 
>> That is true. However, I would assume all iconv implementations use the
>> same encoding names for UTF encodings, no? That means UTF16 would never be
>> valid. Would you agree?
> 
> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
> I am not sure what you mean by "Would you agree?"  People can say in
> their .gitattributes file that this path is in "UTF16" without dash
> and that is what will be fed to this coe, no?

On macOS I don't see UTF16... but I just checked on my Linux box and
there it is. Consequently, we need to check both versions: with and
without dash.

I'll reroll ASAP (I try to do it first thing in the morning).

Thanks,
Lars


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Junio C Hamano
Lars Schneider  writes:

>>  Also "UTF16" or other spelling
>> the platform may support but this code fails to recognise will go
>> unchecked.
>
> That is true. However, I would assume all iconv implementations use the
> same encoding names for UTF encodings, no? That means UTF16 would never be
> valid. Would you agree?

After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
I am not sure what you mean by "Would you agree?"  People can say in
their .gitattributes file that this path is in "UTF16" without dash
and that is what will be fed to this coe, no?


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 21:50, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t 
>> len)
>> +{
>> +return (
>> +   !strcmp(enc, "UTF-16") &&
>> +   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +) || (
>> +   !strcmp(enc, "UTF-32") &&
>> +   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +);
>> +}
> 
> These strcmp() calls seem inconsistent with the principle embodied
> by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
> and make the interface uneven. I am wondering if we also want to
> complain when the user gave us "utf16" and there is no byte order
> mark in the contents, for example?

Well, if I use stricmp() then I don't need to call and cleanup
xstrdup_toupper() as discussed with Eric [1]. Is there a case
insensitive starts_with() method?

[1] 
https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/


>  Also "UTF16" or other spelling
> the platform may support but this code fails to recognise will go
> unchecked.

That is true. However, I would assume all iconv implementations use the
same encoding names for UTF encodings, no? That means UTF16 would never be
valid. Would you agree?

- Lars

Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t 
> len)
> +{
> + return (
> +!strcmp(enc, "UTF-16") &&
> +!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> + ) || (
> +!strcmp(enc, "UTF-32") &&
> +!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> + );
> +}

These strcmp() calls seem inconsistent with the principle embodied
by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
and make the interface uneven.  I am wondering if we also want to
complain when the user gave us "utf16" and there is no byte order
mark in the contents, for example?  Also "UTF16" or other spelling
the platform may support but this code fails to recognise will go
unchecked.

Which actually may be a feature, not a bug, to be able to bypass
this check---I dunno.

The same comment applies to the previous step.



[PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-04 Thread lars . schneider
From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
---
 utf8.c | 13 +
 utf8.h | 19 +++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2