<to...@acm.org> wrote: > -----Original Message----- > From: Will Parsons > >>And how could one possibly distinguish a file containing all 256 byte >> bit patterns from a binary file? > > That's the point, in effect you can't. It's up to you to decide how to > interpret a file. > >>Referring to "all 256 ASCII codes" is a misnomer. There is *no* *such* >>*thing* as "8-bit ASCII", and using that term is a source of confusion. > > Fine detail if you want to pedantic but irrelevant to the issue! I should > have said all possible 8-bit values because whether you decide to interpret > them as ASCII or EBCDIC or whatever is another matter. > So, we can limit the problem to the actual 7-bit ASCII or change ASCII to > EBCDIC and keep it 8 bits. Now, if you have a file containing bytes values > between 0 and 127, is the file binary or text? There is no correct answer, > is there?
No there isn't (although for any substantial content where the bytes were in the printable ASCII range [plus some expected no-printable exceptions such as space and LF, one might reasonably *assume* it was ASCII). >>using the term "8-bit ASCII" indicates a fundamental misunderstanding. > > Actually, the term 8-bit ASCII is very commonly used (even if it is not > pedantically correct), although you still need to know the code page to use > for the upper half of the codes, but it usually refers to ISO Latin-1. You are correct in stating that the term "8-bit ASCII" is commonly used. But, when I've seen it used, it almost always indicates a misunderstanding on the user's part. Perhaps you regard this as "pedantic", but I ask you to think about this a little bit more before totally dismissing my remarks. >>It certainly could be. Given an arbitrary block of bits, there is no >>certain way to determine whether it is >> intended as text or not without knowing what encoding is being used for >> text. > > Agreed. > >>Since SQLite (which underlies Fossil) uses UTF-8 encoded Unicode (which >>includes [7-bit] ASCII as a subset) >>as its text encoding, *any* byte that is not part of a UTF-8 encoding makes >>the file "binary", whether >>you intended it to be or not. > > I don’t think so. SQLite is just a storage medium. Its job is to save what > I give it, not to interpret its meaning. > So, if SQLite decided to use Chinese for its encoding, all my ASCII text > files would become binary? What do you mean by "to use Chinese"? The supported method for storing Chinese is to use Unicode (encoded as UTF-8, or, as I neglected to mention before , UTF-16). If you do that, then I would think Chinese is text just like a European language encoded in a UTF-encoded Unicode is text. >>I'd rather have simple rule (such as "text" is UTF-8 encoded) rather >> than some fuzzy heuristic that is sure to fail when you don't want it to. > > One rule that never fails is to let me decide what a file is because I know > what my files contain better than anyone else! If you're asking that Fossil should let you mark a given file as "text" my some option, then I think that's a reasonable request, but if you're asking that Fossil should guess that arbitrary content that is non-valid UTF should be magically recognized as "text", the no, I don't think that's a reasonable expectation. (That brings up a meta-question - why don't you simply encode your files as UTF-8 encoded Unicode? It would be better in the long run.) -- Will _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users