32 BOM

Lars Schneider Tue, 30 Jan 2018 12:59:10 -0800

> On 30 Jan 2018, at 20:15, Junio C Hamano <[email protected]> wrote:
> 
> [email protected] writes:
> 
>> From: Lars Schneider <[email protected]>
>> 
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> has_missing_utf_bom() function returns true if a required BOM is
>> missing.
>> 
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
> 
> I do not have strong opinion on encoding such policy-ish behaviour
> as our default, but am I alone to find that "has missing X" is a
> confusing name for a helper function?  "is missing X" (or "lacks
> X") is a bit more understandable, I guess.


That might be a german/english translation thingy but I think I get
your point. "has" implies there is something and "missing" implies
there is nothing :)

"is_missing_utf_bom()" might be even a bit unspecific as UTF-8
is usually missing a UTF BOM but the function would still return 
"false". Therefore, "is_missing_required_utf_bom()" might be 
lengthy but should fit.

OK for you?

- Lars


> 
>> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +    return (
>> +       !strcmp(enc, "UTF-16") &&
>> +       !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +         has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +    ) || (
>> +       !strcmp(enc, "UTF-32") &&
>> +       !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +         has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +    );
>> +}

Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

Reply via email to