Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
On Tue, Jul 8, 2014 at 9:37 PM, Stephan Beal sgb...@googlemail.com wrote: No characters between 128 and 255 are valid UTF-8, to avoid confusion with the many encodings which use that range. For the record, that's apparently wrong. My local man pages (and experimentation with the termbox API) say otherwise: Encoding The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x - 0x007F: 0xxx 0x0080 - 0x07FF: 110x 10xx So the range is used, but it encodes to two UTF-8 characters. -- - stephan beal http://wanderinghorse.net/home/stephan/ http://gplus.to/sgbeal Freedom is sloppy. But since tyranny's the only guaranteed byproduct of those who insist on a perfect world, freedom will have to do. -- Bigby Wolf ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
On Tue, Jul 22, 2014 at 11:48 AM, Stephan Beal sgb...@googlemail.com wrote: On Tue, Jul 8, 2014 at 9:37 PM, Stephan Beal sgb...@googlemail.com wrote: No characters between 128 and 255 are valid UTF-8, to avoid confusion with the many encodings which use that range. For the record, that's apparently wrong. My local man pages (and experimentation with the termbox API) say otherwise: So the range is used, but it encodes to two UTF-8 characters. Actually, 1 Unicode character encoded in to 2 UTF-8 bytes. FWIW, FYI, UTF-8 has an optional Byte Order Mark, 0xEF 0xBB 0xBF,that can appear at the beginning of a file. This just the UTF-8 encoding of code point U-00FEFF, which is the actual Unicode Byte Order Mark. For UTF-8, this mark is really only useful as a suggestion that the following text might be UFT-8 encoded Unicode. For UFT-16 and UTF-32 encodings, this mark is used to inform the receiver of the text the order of bytes within the 16 or 32 bit encoding units (presuming that the file is actually UTF-16 or 32 encoded text). ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Thus said Stephan Beal on Tue, 22 Jul 2014 19:01:27 +0200: One would think i'd be more conscious of how i throw around byte vs character :/. i'm still not clear on the whole char-vs-code point bit, though. The whole char-vs-codepoint has always been unclear for me, no matter how many times I read technical descriptions of it. :-) Andy -- TAI64 timestamp: 400053cf4d1f ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
2014-07-09 0:05 GMT+02:00 Andy Bradford amb-fos...@bradfords.org: Or perhaps just making the documentation more clear that all files must be valid UTF-8. Oh no, fossil doesn't require at all that all files are valid UTF-8. Only fossil ui assumes UTF-8 encoding for non-binary files, otherwise it cannot display the file content in a reasonable way to the user. If the file is not UTF-8, it just might look strange in the UI, that's all. I added the possibility to convert file to a valid UTF-8 stream when doing a fossil commit. That will not always be what's desired, fossil will convert è (0xe8) to è (0xc3 0xa8) for you if you answer 'c' to the prompt. If you want something else (e.g. escaping) fossil cannot do that for you. Regards, Jan Nijtmans ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] File contains invalid UTF-8, but is not UTF-8.
Hello, I have some Tcl scripts (for IRC) that previously had no problems when I committed. They don't have UTF-8 characters at all, but when I try to commit them I get the warning: ./test.tcl contains invalid UTF-8. Use --no-warnings or the encoding-glob setting to disable this warning. Prior to [0cb00c0b8f4e5b03] I was able to commit these without errors/warnings and without encoding-glob settings. I'm not sure why it thinks they have UTF-8 characters (or invalid ones at that). They are just ASCII with a few non-printable characters (0x03 primarily) for IRC colors and one è (0xe8) character. If I remove the the è (0xe8) character I can commit. I didn't think 0xe8 was UTF-8, but maybe I'm mistaken? Thanks, Andy -- TAI64 timestamp: 400053bc3cec ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
2014-07-08 20:47 GMT+02:00 Andy Bradford amb-fos...@bradfords.org: Hello, I have some Tcl scripts (for IRC) that previously had no problems when I committed. They don't have UTF-8 characters at all, but when I try to commit them I get the warning: ./test.tcl contains invalid UTF-8. Use --no-warnings or the encoding-glob setting to disable this warning. If you don't want this warning, just set 'encoding-glob' to '*'. But did you ever view this file in the fossil UI? Did the è really look like è there? The warning is meant to warn you that Tcl scripts containing an 0xe8 byte are not portable, it depends on the encoding which character it really is. Better replace that by \0xe8, that will make your script portable, working identical no matter what Tcl's system encoding is set to. I didn't think 0xe8 was UTF-8, but maybe I'm mistaken? In the fossil UI, all files are displayed assuming the encoding is UTF-8. Invalid bytes are displayed in the browser as the replacement character. If you want that, that's OK, just answer 'y' to the question. More likely is that people are not aware that such characters can cause unexpected problems. Hope this helps. Regards, Jan Nijtmans ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
On Tue, Jul 8, 2014 at 8:47 PM, Andy Bradford amb-fos...@bradfords.org wrote: If I remove the the è (0xe8) character I can commit. I didn't think 0xe8 was UTF-8, but maybe I'm mistaken? No characters between 128 and 255 are valid UTF-8, to avoid confusion with the many encodings which use that range. -- - stephan beal http://wanderinghorse.net/home/stephan/ http://gplus.to/sgbeal Freedom is sloppy. But since tyranny's the only guaranteed byproduct of those who insist on a perfect world, freedom will have to do. -- Bigby Wolf ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Thus said Jan Nijtmans on Tue, 08 Jul 2014 21:35:07 +0200: If you don't want this warning, just set 'encoding-glob' to '*'. I might actually want encoding warnings though... But did you ever view this file in the fossil UI? Did the è really look like è there? I did not, however, if I put the same file in any web server and serve it up it displays correctly, probably because my browser figured out to use ISO-8859-1, or the server defaulted to it. If I try to view it with Fossil UI it refuses and instead says ``10062 bytes of binary data.'' I suppose this is largely true---all data is binary. :-) It would be nice if there were a button that said, ``display the bytes anyway.'' If I annotate the file it puts a different character there than was included in the .tcl script. If I then change my browser to ISO-8859-1 it displays fine. Also, I notice that you converted (or your email client did) the è from my email to è which are not the same characters (at least not as far as the bytes are concerned). How did it manage to convert è to è? Better replace that by \0xe8, that will make your script portable, working identical no matter what Tcl's system encoding is set to. That's a good suggestion for fixing the Tcl script, but I'm still not sure why Fossil thinks that è is UTF-8. I thought it was extended ASCII. I didn't think 0xe8 was UTF-8, but maybe I'm mistaken? In the fossil UI, all files are displayed assuming the encoding is UTF-8. That explains the strange character displayed in the browser. If I switch my browser to ISO-8859-1 it displays fine. More likely is that people are not aware that such characters can cause unexpected problems. The only thing unexpected has been the warning from Fossil for a file that previously had no warnings. :-) Sounds like my options are either to answer Yes, or update the Tcl file that I have stored in a Fossil repository to use \xe8. Thanks, Andy -- TAI64 timestamp: 400053bc6507 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200: No characters between 128 and 255 are valid UTF-8, to avoid confusion with the many encodings which use that range. If no characters between 128 and 255 are valid UTF-8, and they can never be valid UTF-8 characters, and are used by many encodings, why doesn't Fossil simply ignore them when they are committed? I guess I'm confused why they are being treated specially as to warrant either a setting or a prompt to continue. Thanks, Andy -- TAI64 timestamp: 400053bc6637 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
On Tue, Jul 8, 2014 at 3:38 PM, Andy Bradford amb-fos...@bradfords.org wrote: That's a good suggestion for fixing the Tcl script, but I'm still not sure why Fossil thinks that è is UTF-8. I thought it was extended ASCII. I didn't think 0xe8 was UTF-8, but maybe I'm mistaken? In the fossil UI, all files are displayed assuming the encoding is UTF-8. That explains the strange character displayed in the browser. If I switch my browser to ISO-8859-1 it displays fine. More likely is that people are not aware that such characters can cause unexpected problems. The only thing unexpected has been the warning from Fossil for a file that previously had no warnings. :-) Sounds like my options are either to answer Yes, or update the Tcl file that I have stored in a Fossil repository to use \xe8. UTF-8 characters are encoded as a series of strictly formatted bytes, from 1 to 4 bytes in length. The bit patterns of the bytes control whether a stream is considered valid UTF-8 or not. For UTF-8 the 0xE8 byte must be followed by two bytes of the form 0b10xx. The warning you are seeing is that the stream is invalid UTF-8. 0xE8 byte could be an extended ASCII character from one of the ISO-8859-X code pages. Or it could be real binary data that just happens to mostly have ASCII text in it. I think the best idea is to encode these special characters as escaped sequences whenever possible. -- Scott Robison ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Interesting question/option, but i have no answer. Something to possibly consider? (sent from a mobile device - please excuse brevity, typos, and top-posting) - stephan beal http://wanderinghorse.net On Jul 8, 2014 11:43 PM, Andy Bradford amb-fos...@bradfords.org wrote: Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200: No characters between 128 and 255 are valid UTF-8, to avoid confusion with the many encodings which use that range. If no characters between 128 and 255 are valid UTF-8, and they can never be valid UTF-8 characters, and are used by many encodings, why doesn't Fossil simply ignore them when they are committed? I guess I'm confused why they are being treated specially as to warrant either a setting or a prompt to continue. Thanks, Andy -- TAI64 timestamp: 400053bc6637 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Thus said Scott Robison on Tue, 08 Jul 2014 15:48:05 -0600: The warning you are seeing is that the stream is invalid UTF-8. 0xE8 byte could be an extended ASCII character from one of the ISO-8859-X code pages. Or it could be real binary data that just happens to mostly have ASCII text in it. Until today, I didn't fully realize that Fossil treated all files as UTF-8 streams. Now it is being enforced and that will reveal some files that have extended ASCII in them and the user will have to choose to either deal with the prompt each time they modify the file, consider altering the glob to ignore encodings or change the character if possible. It's starting to sink in though. Thanks for the explanation. Andy -- TAI64 timestamp: 400053bc6971 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Andy, If no characters between 128 and 255 are valid UTF-8, and they can never be valid UTF-8 characters, and are used by many encodings, why doesn't Fossil simply ignore them when they are committed? I think Stephan said it poorly. A solitary byte in that range is never valid UTF-8, but UTF-8 represents all code points higher than 127 as a sequence of bytes in the 128 to 255 range. Those byte sequences have a structure, so it is possible to tell if a string of bytes in that range represents a valid UTF-8 sequence. -- Shal -- Shal Farley s...@cheshireeng.com Cheshire Engineering Corporation http://www.CheshireEng.com 120 West Olive Avenue+1 626 303 1602 Monrovia, CA 91016 FAX +1 626 303 1590 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.
Thus said Stephan Beal on Tue, 08 Jul 2014 23:50:40 +0200: Interesting question/option, but i have no answer. Something to possibly consider? Or perhaps just making the documentation more clear that all files must be valid UTF-8. There is already an option to control how encodings are handled, and it is versionable (which means that I can include it with the repository so others won't get warnings if they modify it). I see the setting mentions it here, but I don't recall ever reading anything that indicated that I would get a warning for invalid UTF-8. http://fossil-scm.org/index.html/help/setting This could just be a tool acclimation problem on my part. :-) Andy -- TAI64 timestamp: 400053bc6b46 ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users