Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-22 Thread Stephan Beal
On Tue, Jul 8, 2014 at 9:37 PM, Stephan Beal sgb...@googlemail.com wrote:

 No characters between 128 and 255 are valid UTF-8, to avoid confusion with
 the many encodings which use that range.


For the record, that's apparently wrong. My local man pages (and
experimentation with the termbox API) say otherwise:

   Encoding
   The  following  byte sequences are used to represent a character.
 The sequence to be used depends on the UCS
   code number of the character:

   0x - 0x007F:
   0xxx

   0x0080 - 0x07FF:
   110x 10xx

So the range is used, but it encodes to two UTF-8 characters.


-- 
- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do. -- Bigby Wolf
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-22 Thread Ron W
On Tue, Jul 22, 2014 at 11:48 AM, Stephan Beal sgb...@googlemail.com
wrote:

 On Tue, Jul 8, 2014 at 9:37 PM, Stephan Beal sgb...@googlemail.com
 wrote:

 No characters between 128 and 255 are valid UTF-8, to avoid confusion
 with the many encodings which use that range.


 For the record, that's apparently wrong. My local man pages (and
 experimentation with the termbox API) say otherwise:
 
 So the range is used, but it encodes to two UTF-8 characters.


Actually, 1 Unicode character encoded in to 2 UTF-8 bytes.

FWIW, FYI, UTF-8 has an optional Byte Order Mark, 0xEF 0xBB 0xBF,that can
appear at the beginning of a file. This just the UTF-8 encoding of code
point U-00FEFF, which is the actual Unicode Byte Order Mark. For UTF-8,
this mark is really only useful as a suggestion that the following text
might be UFT-8 encoded Unicode. For UFT-16 and UTF-32 encodings, this mark
is used to inform the receiver of the text the order of bytes within the 16
or 32 bit encoding units (presuming that the file is actually UTF-16 or 32
encoded text).
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-22 Thread Andy Bradford
Thus said Stephan Beal on Tue, 22 Jul 2014 19:01:27 +0200:

 One would think  i'd be more conscious  of how i throw  around byte vs
 character :/. i'm still not clear on the whole char-vs-code point bit,
 though.

The whole  char-vs-codepoint has always  been unclear for me,  no matter
how many times I read technical descriptions of it. :-)

Andy
-- 
TAI64 timestamp: 400053cf4d1f


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-10 Thread Jan Nijtmans
2014-07-09 0:05 GMT+02:00 Andy Bradford amb-fos...@bradfords.org:
 Or perhaps just making the documentation  more clear that all files must
 be valid UTF-8.

Oh no, fossil doesn't require at all that all files are valid UTF-8. Only
fossil ui assumes UTF-8 encoding for non-binary files, otherwise it
cannot display the file content in a reasonable way to the user. If the
file is not UTF-8, it just might look strange in the UI, that's all.

I added the possibility to convert file to a valid UTF-8 stream when
doing a fossil commit. That will not always be what's desired, fossil
will convert è (0xe8) to è (0xc3 0xa8) for you if you answer 'c' to
the prompt. If you want something else (e.g. escaping)  fossil cannot
do that for you.

Regards,
   Jan Nijtmans
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


[fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Andy Bradford
Hello,

I have some Tcl scripts (for IRC) that previously had no problems when I
committed. They  don't have UTF-8 characters  at all, but when  I try to
commit them I get the warning:

./test.tcl contains invalid UTF-8. Use --no-warnings or the encoding-glob 
setting to disable this warning.

Prior  to  [0cb00c0b8f4e5b03]  I  was   able  to  commit  these  without
errors/warnings and without encoding-glob settings.  I'm not sure why it
thinks they  have UTF-8 characters (or  invalid ones at that).  They are
just ASCII with a few  non-printable characters (0x03 primarily) for IRC
colors and one è (0xe8) character.

If I remove the the è (0xe8) character I can commit.

I didn't think 0xe8 was UTF-8, but maybe I'm mistaken?

Thanks,

Andy
--
TAI64 timestamp: 400053bc3cec
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Jan Nijtmans
2014-07-08 20:47 GMT+02:00 Andy Bradford amb-fos...@bradfords.org:
 Hello,

 I have some Tcl scripts (for IRC) that previously had no problems when I
 committed. They  don't have UTF-8 characters  at all, but when  I try to
 commit them I get the warning:

 ./test.tcl contains invalid UTF-8. Use --no-warnings or the encoding-glob 
 setting to disable this warning.

If you don't want this warning, just set 'encoding-glob' to '*'. But
did you ever view this file in the fossil UI? Did the è really
look like è there? The warning is meant to warn you that
Tcl scripts containing an 0xe8 byte are not portable, it depends
on the encoding which character it really is. Better replace
that by \0xe8, that will make your script portable,
working identical no matter what Tcl's system encoding is set to.

 I didn't think 0xe8 was UTF-8, but maybe I'm mistaken?

In the fossil UI, all files are displayed assuming the encoding
is UTF-8. Invalid bytes are displayed in the browser as the
replacement character. If you want that, that's OK, just answer
'y' to the question. More likely is that people are not aware
that such characters can cause unexpected problems.

Hope this helps.

Regards,
   Jan Nijtmans
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Stephan Beal
On Tue, Jul 8, 2014 at 8:47 PM, Andy Bradford amb-fos...@bradfords.org
wrote:

 If I remove the the è (0xe8) character I can commit.

 I didn't think 0xe8 was UTF-8, but maybe I'm mistaken?


No characters between 128 and 255 are valid UTF-8, to avoid confusion with
the many encodings which use that range.

-- 
- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do. -- Bigby Wolf
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Andy Bradford
Thus said Jan Nijtmans on Tue, 08 Jul 2014 21:35:07 +0200:

 If you don't want this warning, just set 'encoding-glob' to '*'.

I might actually want encoding warnings though...

 But did you ever view this file in the fossil UI?
 Did the è  really look like è there?

I did not, however,  if I put the same file in any  web server and serve
it up it displays correctly, probably  because my browser figured out to
use ISO-8859-1, or the server defaulted to it.

If I try to  view it with Fossil UI it refuses  and instead says ``10062
bytes of  binary data.'' I  suppose this  is largely true---all  data is
binary. :-) It would be nice if there were a button that said, ``display
the bytes anyway.''

If I  annotate the  file it  puts a different  character there  than was
included in the  .tcl script. If I then change  my browser to ISO-8859-1
it displays fine.

Also, I notice that you converted (or  your email client did) the è from
my email to è which are not the same characters (at least not as far as
the bytes are concerned). How did it manage to convert è to è?

 Better replace  that by  \0xe8, that will  make your  script portable,
 working identical no matter what Tcl's system encoding is set to.

That's a  good suggestion for fixing  the Tcl script, but  I'm still not
sure why Fossil thinks that è is UTF-8. I thought it was extended ASCII.

  I didn't think 0xe8 was UTF-8, but maybe I'm mistaken?

 In the  fossil UI, all  files are  displayed assuming the  encoding is
 UTF-8.

That  explains the  strange character  displayed  in the  browser. If  I
switch my browser to ISO-8859-1 it displays fine.

 More likely  is that  people are  not aware  that such  characters can
 cause unexpected problems.

The only  thing unexpected has been  the warning from Fossil  for a file
that previously had no warnings. :-)

Sounds like my options are either to  answer Yes, or update the Tcl file
that I have stored in a Fossil repository to use \xe8.

Thanks,

Andy
--
TAI64 timestamp: 400053bc6507
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Andy Bradford
Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200:

 No characters between 128 and 255  are valid UTF-8, to avoid confusion
 with the many encodings which use that range.

If no characters between 128 and 255 are valid UTF-8, and they can never
be valid UTF-8  characters, and are used by many  encodings, why doesn't
Fossil simply ignore them when they  are committed? I guess I'm confused
why they are being treated specially as to warrant either a setting or a
prompt to continue.

Thanks,

Andy
--
TAI64 timestamp: 400053bc6637
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Scott Robison
On Tue, Jul 8, 2014 at 3:38 PM, Andy Bradford amb-fos...@bradfords.org
wrote:

 That's a  good suggestion for fixing  the Tcl script, but  I'm still not
 sure why Fossil thinks that è is UTF-8. I thought it was extended ASCII.

   I didn't think 0xe8 was UTF-8, but maybe I'm mistaken?
 
  In the  fossil UI, all  files are  displayed assuming the  encoding is
  UTF-8.

 That  explains the  strange character  displayed  in the  browser. If  I
 switch my browser to ISO-8859-1 it displays fine.

  More likely  is that  people are  not aware  that such  characters can
  cause unexpected problems.

 The only  thing unexpected has been  the warning from Fossil  for a file
 that previously had no warnings. :-)

 Sounds like my options are either to  answer Yes, or update the Tcl file
 that I have stored in a Fossil repository to use \xe8.


UTF-8 characters are encoded as a series of strictly formatted bytes, from
1 to 4 bytes in length. The bit patterns of the bytes control whether a
stream is considered valid UTF-8 or not. For UTF-8 the 0xE8 byte must be
followed by two bytes of the form 0b10xx. The warning you are seeing is
that the stream is invalid UTF-8. 0xE8 byte could be an extended ASCII
character from one of the ISO-8859-X code pages. Or it could be real binary
data that just happens to mostly have ASCII text in it.

I think the best idea is to encode these special characters as escaped
sequences whenever possible.

-- 
Scott Robison
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Stephan Beal
Interesting question/option, but i have no answer. Something to possibly
consider?

(sent from a mobile device - please excuse brevity, typos, and top-posting)
- stephan beal
http://wanderinghorse.net
On Jul 8, 2014 11:43 PM, Andy Bradford amb-fos...@bradfords.org wrote:

 Thus said Stephan Beal on Tue, 08 Jul 2014 21:37:50 +0200:

  No characters between 128 and 255  are valid UTF-8, to avoid confusion
  with the many encodings which use that range.

 If no characters between 128 and 255 are valid UTF-8, and they can never
 be valid UTF-8  characters, and are used by many  encodings, why doesn't
 Fossil simply ignore them when they  are committed? I guess I'm confused
 why they are being treated specially as to warrant either a setting or a
 prompt to continue.

 Thanks,

 Andy
 --
 TAI64 timestamp: 400053bc6637

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Andy Bradford
Thus said Scott Robison on Tue, 08 Jul 2014 15:48:05 -0600:

 The warning you  are seeing is that the stream  is invalid UTF-8. 0xE8
 byte could be an extended ASCII character from one of the ISO-8859-X
 code  pages. Or  it could  be real  binary data  that just  happens to
 mostly have ASCII text in it.

Until today,  I didn't fully  realize that  Fossil treated all  files as
UTF-8 streams. Now it is being  enforced and that will reveal some files
that have  extended ASCII in  them and the user  will have to  choose to
either deal  with the prompt  each time  they modify the  file, consider
altering  the  glob to  ignore  encodings  or  change the  character  if
possible. It's starting to sink in though.

Thanks for the explanation.

Andy
--
TAI64 timestamp: 400053bc6971
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Shal Farley

Andy,

 If no characters between 128 and 255 are valid UTF-8, and they can
 never be valid UTF-8  characters, and are used by many  encodings,
 why doesn't Fossil simply ignore them when they  are committed?

I think Stephan said it poorly. A solitary byte in that range is never 
valid UTF-8, but UTF-8 represents all code points higher than 127 as a 
sequence of bytes in the 128 to 255 range. Those byte sequences have a 
structure, so it is possible to tell if a string of bytes in that range 
represents a valid UTF-8 sequence.


-- Shal

--
Shal Farley s...@cheshireeng.com
Cheshire Engineering Corporation  http://www.CheshireEng.com
120 West Olive Avenue+1 626 303 1602
Monrovia, CA 91016   FAX +1 626 303 1590
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] File contains invalid UTF-8, but is not UTF-8.

2014-07-08 Thread Andy Bradford
Thus said Stephan Beal on Tue, 08 Jul 2014 23:50:40 +0200:

 Interesting  question/option,  but  i  have no  answer.  Something  to
 possibly consider?

Or perhaps just making the documentation  more clear that all files must
be valid UTF-8. There is already  an option to control how encodings are
handled, and it  is versionable (which means that I  can include it with
the repository so others won't get warnings if they modify it).

I see  the setting  mentions it  here, but I  don't recall  ever reading
anything that indicated that I would get a warning for invalid UTF-8.

http://fossil-scm.org/index.html/help/setting

This could just be a tool acclimation problem on my part. :-)

Andy
--
TAI64 timestamp: 400053bc6b46
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users