Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

Lars Schneider Wed, 07 Mar 2018 14:13:10 -0800

> On 07 Mar 2018, at 20:49, Junio C Hamano <[email protected]> wrote:
> 
> [email protected] writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +                  const char *data, size_t len, int die_on_error)
>> +{
>> +    /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +    if (startscase_with(enc, "UTF")) {
>> +            /*
>> +             * Check for detectable errors in UTF encodings
>> +             */
>> +            if (has_prohibited_utf_bom(enc, data, len)) {
>> +                    const char *error_msg = _(
>> +                            "BOM is prohibited in '%s' if encoded as %s");
>> +                    /*
>> +                     * This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> +                     * We cut off the last two characters of the encoding 
>> name
>> +                     # to generate the encoding name suitable for BOMs.
>> +                     */
> 
> Yuck.  The code pretends to abstract away the details in a helper
> has_prohibited_x() yet the caller still knows quite a lot.


True, but has_prohibited_x() cannot create a proper error/advise
message unless we give it more parameters (e.g. path name).
Therefore, I don't see a better way right now.


>> +                    const char *advise_msg = _(
>> +                            "The file '%s' contains a byte order "
>> +                            "mark (BOM). Please use %s as "
>> +                            "working-tree-encoding.");
>> +                    char *upper_enc = xstrdup_toupper(enc);
>> +                    upper_enc[strlen(upper_enc)-2] = '\0';
>> +                    advise(advise_msg, path, upper_enc);
>> +                    free(upper_enc);
> 
> I think this up-casing is more problematic than without, not from
> the point of view of the internal code, but from the point of view
> of the end user experience.  When the user writes utf16le or
> utf-16le and the data does not trigger the BOM check, we are likely
> to successfully convert it.  I do not see the merit of suggesting
> UTF16 or UTF-16 in such a case, over telling them to just drop the
> byte-order suffix from the encoding names (i.e. utf16 or utf-16).
> 
> If you are trying to force/nudge people in the direction of
> canonical way of spelling things (which may not be a bad idea), then
> "utf16le" as the original input would want to result in "UTF-16"
> with dash in the advise, no?

Correct. In the error messages I kept the encoding name "as-is" and
only in the advise message I used the uppercase variant to steer
people into the canonical direction. My initial reason for this was
that in is_missing_required_utf_bom() we add "BE/LE" to the encoding
in the advise message. Let's say the user used "Utf-16" as encoding.
 Should "BE/LE" be upper case or lower case? To avoid that question 
I made it always upper case.

I also would have liked to advise "UTF-16" instead of "UTF16" as
you suggested. However, that required a few more lines and I wanted
to keep the change to a minimum. I feel this could be added in a
follow up patch.


> On the other hand, if we are not enforcing such a policy decision
> but merely explaining a way to work around this check, then it may
> be better to give a variant with the smaller difference from the
> original (i.e. without up-casing).

See example mentioned above: "Utf-16". How would you handle that?


Thanks,
Lars

Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

Reply via email to