Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-02-25 Thread Lars Schneider

> On 25 Feb 2018, at 04:52, Eric Sunshine  wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> is_missing_required_utf_bom() function returns true if a required BOM
>> is missing.
>> 
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/utf8.h b/utf8.h
>> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
>> position, unsigned int wid
>> +/*
>> + * If the endianness is not defined in the encoding name, then we
>> + * require a BOM. The function returns true if a required BOM is missing.
>> + *
>> + * The Unicode standard instructs to assume big-endian if there
>> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
>> + * encoding standard used in HTML5 recommends to assume
>> + * little-endian to "deal with deployed content" [3].
> 
> Perhaps you could tack on to the comment here the final bit of
> explanation from the commit message which ties these conflicting
> recommendations together. In particular:
> 
>Therefore, strictly requiring a BOM seems to be the
>safest option for content in Git.

Agreed. I'll change it.

Thanks,
Lars

Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> is_missing_required_utf_bom() function returns true if a required BOM
> is missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/utf8.h b/utf8.h
> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
> position, unsigned int wid
> +/*
> + * If the endianness is not defined in the encoding name, then we
> + * require a BOM. The function returns true if a required BOM is missing.
> + *
> + * The Unicode standard instructs to assume big-endian if there
> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
> + * encoding standard used in HTML5 recommends to assume
> + * little-endian to "deal with deployed content" [3].

Perhaps you could tack on to the comment here the final bit of
explanation from the commit message which ties these conflicting
recommendations together. In particular:

Therefore, strictly requiring a BOM seems to be the
safest option for content in Git.

> + */
> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t 
> len);


[PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-02-24 Thread lars . schneider
From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
---
 utf8.c | 13 +
 utf8.h | 16 
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1