Re: [fossil-users] How to force text for all files?

Will Parsons Sun, 07 Dec 2014 18:57:52 -0800

<to...@acm.org> wrote:
> -----Original Message----- 
> From: Will Parsons
>
>>And how could one possibly distinguish a file containing all 256 byte
>> bit patterns from a binary file?
>
> That's the point, in effect you can't.  It's up to you to decide how to 
> interpret a file.
>
>>Referring to "all 256 ASCII codes" is a misnomer.  There is *no* *such* 
>>*thing* as "8-bit ASCII", and using that term is a source of confusion.
>
> Fine detail if you want to pedantic but irrelevant to the issue!  I should 
> have said all possible 8-bit values because whether you decide to interpret 
> them as ASCII or EBCDIC or whatever is another matter.
> So, we can limit the problem to the actual 7-bit ASCII or change ASCII to 
> EBCDIC and keep it 8 bits.  Now, if you have a file containing bytes values 
> between 0 and 127, is the file binary or text?  There is no correct answer, 
> is there?


No there isn't (although for any substantial content where the bytes
were in the printable ASCII range [plus some expected no-printable
exceptions such as space and LF, one might reasonably *assume* it was
ASCII).

>>using the term "8-bit ASCII" indicates a fundamental misunderstanding.
>
> Actually, the term 8-bit ASCII is very commonly used (even if it is not 
> pedantically correct), although you still need to know the code page to use 
> for the upper half of the codes, but it usually refers to ISO Latin-1.

You are correct in stating that the term "8-bit ASCII" is commonly
used.  But, when I've seen it used, it almost always indicates a
misunderstanding on the user's part.  Perhaps you regard this as
"pedantic", but I ask you to think about this a little bit more before
totally dismissing my remarks.

>>It certainly could be.  Given an arbitrary block of bits, there is no 
>>certain way to determine whether it is
>> intended as text or not without knowing what encoding is being used for 
>> text.
>
> Agreed.
>
>>Since SQLite (which underlies Fossil) uses UTF-8 encoded Unicode (which 
>>includes [7-bit] ASCII as a subset)
>>as its text encoding, *any* byte that is not part of a UTF-8 encoding makes 
>>the file "binary", whether
>>you intended it to be or not.
>
> I don’t think so.  SQLite is just a storage medium.  Its job is to save what 
> I give it, not to interpret its meaning.
> So, if SQLite decided to use Chinese for its encoding, all my ASCII text 
> files would become binary?

What do you mean by "to use Chinese"?  The supported method for
storing Chinese is to use Unicode (encoded as UTF-8, or, as I
neglected to mention before , UTF-16).  If you do that, then I would
think Chinese is text just like a European language encoded in a
UTF-encoded Unicode is text.

>>I'd rather have simple rule (such as "text" is UTF-8 encoded) rather
>> than some fuzzy heuristic that is sure to fail when you don't want it to.
>
> One rule that never fails is to let me decide what a file is because I know 
> what my files contain better than anyone else!

If you're asking that Fossil should let you mark a given file as
"text" my some option, then I think that's a reasonable request, but
if you're asking that Fossil should guess that arbitrary content that
is non-valid UTF should be magically recognized as "text", the no, I
don't think that's a reasonable expectation.

(That brings up a meta-question - why don't you simply encode your
files as UTF-8 encoded Unicode?  It would be better in the long run.)

-- 
Will

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] How to force text for all files?

Reply via email to