On Tue, Nov 11, 2014 at 9:00 AM, Stephan Beal <sgb...@googlemail.com> wrote:
> On Tue, Nov 11, 2014 at 1:12 PM, Jan Nijtmans <jan.nijtm...@gmail.com> > wrote: > >> The convention on Windows is to assume CP1252, unless the file >> starts with the UTF-8 BOM. That's exactly what fossil is doing here: >> <http://fossil-scm.org/index.html/artifact/cbd7a598c8?ln=1745-1747> >> So, make sure that the file starts with the UTF-8 BOM, otherwise >> fossil cannot make any valid guess on what encoding is used. >> Assuming UTF-8 on windows is wrong, because if it is really CP1252 >> then that leads to invalid utf-8 byte sequences. >> > > i think those very lines are the fix. The poster was using 1.29 (from > June, 2014), which is much newer than the last of those lines. The poster > claims that the text is in UTF-8. So it sounds to me like the fix for him > is, "use a BOM or CP1252" (but i assume CP1252 is not Chinese-capable). It > seems to me that Fossil is doing all that it can there (namely, following a > heuristic for determining the encoding, and no heuristic is infallible). > > Regarding the BOM: the Unicode consortium recommends against using a BOM > because (A) it's senseless (for its original purpose) in UTF-8 and (B) > because so many tools don't deal well with it (i've seen PHP sites go > offline when someone checked a BOM into one of the source files). i > understand that it's probably the lesser of several evils here, though. > The Unicode consortium also said that 16 bits would be enough for everything, which "bit" Microsoft in the backside when they went all in on Unicode before the expansion to 20+ bits (or even the development of UTF-8). :) Regardless, if someone is using valid UTF-8, it is almost guaranteed not to be CP1252. Or maybe more precisely, if someone is using CP1252, it almost certainly isn't valid UTF-8. It would be possible (and I think code already exists in Fossil) to check a buffer for UTF-8 "well formedness". Perhaps that would be a better check than (or an additional check to) just "does it have a BOM". Note: The preceding does not attempt to deal with anything other than CP1252. Sniffing an encoding is a Hard Problem, and short of mandating linking with ICU (which itself does not get it right every time, not because it is a bad library but because it is a hard problem) which provides statistical models for sniffing encodings, I think checking for a well formed UTF-8 buffer and switching between it and "raw code page style encoding" isn't a bad option. -- Scott Robison
_______________________________________________ fossil-dev mailing list fossil-dev@lists.fossil-scm.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/fossil-dev