On Tue, Nov 11, 2014 at 9:00 AM, Stephan Beal <sgb...@googlemail.com> wrote:

> On Tue, Nov 11, 2014 at 1:12 PM, Jan Nijtmans <jan.nijtm...@gmail.com>
> wrote:
>
>> The convention on Windows is to assume CP1252, unless the file
>> starts with the UTF-8 BOM. That's exactly what fossil is doing here:
>>      <http://fossil-scm.org/index.html/artifact/cbd7a598c8?ln=1745-1747>
>> So, make sure that the file starts with the UTF-8 BOM, otherwise
>> fossil cannot make any valid guess on what encoding is used.
>> Assuming UTF-8 on windows is wrong, because if it is really CP1252
>> then that leads to invalid utf-8 byte sequences.
>>
>
> i think those very lines are the fix. The poster was using 1.29 (from
> June, 2014), which is much newer than the last of those lines. The poster
> claims that the text is in UTF-8. So it sounds to me like the fix for him
> is, "use a BOM or CP1252" (but i assume CP1252 is not Chinese-capable). It
> seems to me that Fossil is doing all that it can there (namely, following a
> heuristic for determining the encoding, and no heuristic is infallible).
>
> Regarding the BOM: the Unicode consortium recommends against using a BOM
> because (A) it's senseless (for its original purpose) in UTF-8 and (B)
> because so many tools don't deal well with it (i've seen PHP sites go
> offline when someone checked a BOM into one of the source files). i
> understand that it's probably the lesser of several evils here, though.
>

The Unicode consortium also said that 16 bits would be enough for
everything, which "bit" Microsoft in the backside when they went all in on
Unicode before the expansion to 20+ bits (or even the development of
UTF-8). :)

Regardless, if someone is using valid UTF-8, it is almost guaranteed not to
be CP1252. Or maybe more precisely, if someone is using CP1252, it almost
certainly isn't valid UTF-8. It would be possible (and I think code already
exists in Fossil) to check a buffer for UTF-8 "well formedness". Perhaps
that would be a better check than (or an additional check to) just "does it
have a BOM".

Note: The preceding does not attempt to deal with anything other than
CP1252. Sniffing an encoding is a Hard Problem, and short of mandating
linking with ICU (which itself does not get it right every time, not
because it is a bad library but because it is a hard problem) which
provides statistical models for sniffing encodings, I think checking for a
well formed UTF-8 buffer and switching between it and "raw code page style
encoding" isn't a bad option.

-- 
Scott Robison
_______________________________________________
fossil-dev mailing list
fossil-dev@lists.fossil-scm.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/fossil-dev

Reply via email to