On Sat, Feb 24, 2018 at 11:27 AM, <[email protected]> wrote:
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> is_missing_required_utf_bom() function returns true if a required BOM
> is missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.
>
> Signed-off-by: Lars Schneider <[email protected]>
> ---
> diff --git a/utf8.h b/utf8.h
> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type
> position, unsigned int wid
> +/*
> + * If the endianness is not defined in the encoding name, then we
> + * require a BOM. The function returns true if a required BOM is missing.
> + *
> + * The Unicode standard instructs to assume big-endian if there
> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
> + * encoding standard used in HTML5 recommends to assume
> + * little-endian to "deal with deployed content" [3].
Perhaps you could tack on to the comment here the final bit of
explanation from the commit message which ties these conflicting
recommendations together. In particular:
Therefore, strictly requiring a BOM seems to be the
safest option for content in Git.
> + */
> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t
> len);