Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
> On 25 Feb 2018, at 04:52, Eric Sunshinewrote: > > On Sat, Feb 24, 2018 at 11:27 AM, wrote: >> If the endianness is not defined in the encoding name, then let's >> be strict and require a BOM to avoid any encoding confusion. The >> is_missing_required_utf_bom() function returns true if a required BOM >> is missing. >> >> The Unicode standard instructs to assume big-endian if there in no BOM >> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used >> in HTML5 recommends to assume little-endian to "deal with deployed >> content" [3]. Strictly requiring a BOM seems to be the safest option >> for content in Git. >> >> Signed-off-by: Lars Schneider >> --- >> diff --git a/utf8.h b/utf8.h >> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type >> position, unsigned int wid >> +/* >> + * If the endianness is not defined in the encoding name, then we >> + * require a BOM. The function returns true if a required BOM is missing. >> + * >> + * The Unicode standard instructs to assume big-endian if there >> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG >> + * encoding standard used in HTML5 recommends to assume >> + * little-endian to "deal with deployed content" [3]. > > Perhaps you could tack on to the comment here the final bit of > explanation from the commit message which ties these conflicting > recommendations together. In particular: > >Therefore, strictly requiring a BOM seems to be the >safest option for content in Git. Agreed. I'll change it. Thanks, Lars
Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
On Sat, Feb 24, 2018 at 11:27 AM,wrote: > If the endianness is not defined in the encoding name, then let's > be strict and require a BOM to avoid any encoding confusion. The > is_missing_required_utf_bom() function returns true if a required BOM > is missing. > > The Unicode standard instructs to assume big-endian if there in no BOM > for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used > in HTML5 recommends to assume little-endian to "deal with deployed > content" [3]. Strictly requiring a BOM seems to be the safest option > for content in Git. > > Signed-off-by: Lars Schneider > --- > diff --git a/utf8.h b/utf8.h > @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type > position, unsigned int wid > +/* > + * If the endianness is not defined in the encoding name, then we > + * require a BOM. The function returns true if a required BOM is missing. > + * > + * The Unicode standard instructs to assume big-endian if there > + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG > + * encoding standard used in HTML5 recommends to assume > + * little-endian to "deal with deployed content" [3]. Perhaps you could tack on to the comment here the final bit of explanation from the commit message which ties these conflicting recommendations together. In particular: Therefore, strictly requiring a BOM seems to be the safest option for content in Git. > + */ > +int is_missing_required_utf_bom(const char *enc, const char *data, size_t > len);
[PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
From: Lars SchneiderIf the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider --- utf8.c | 13 + utf8.h | 16 2 files changed, 29 insertions(+) diff --git a/utf8.c b/utf8.c index 914881cd1f..5113d26e56 100644 --- a/utf8.c +++ b/utf8.c @@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) ); } +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len) +{ + return ( + !strcmp(enc, "UTF-16") && + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || +has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) + ) || ( + !strcmp(enc, "UTF-32") && + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || +has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) + ); +} + /* * Returns first character length in bytes for multi-byte `text` according to * `encoding`. diff --git a/utf8.h b/utf8.h index 4711429af9..62f86fba64 100644 --- a/utf8.h +++ b/utf8.h @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid */ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len); +/* + * If the endianness is not defined in the encoding name, then we + * require a BOM. The function returns true if a required BOM is missing. + * + * The Unicode standard instructs to assume big-endian if there + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG + * encoding standard used in HTML5 recommends to assume + * little-endian to "deal with deployed content" [3]. + * + * [1] http://unicode.org/faq/utf_bom.html#gen6 + * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf + * Section 3.10, D98, page 132 + * [3] https://encoding.spec.whatwg.org/#utf-16le + */ +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len); + #endif -- 2.16.1