<to...@acm.org> wrote: > This is a multi-part message in MIME format. > > --===============0379594130== > Content-Type: multipart/alternative; > boundary="----=_NextPart_000_001C_01D01261.2A93EE90" > > This is a multi-part message in MIME format. > > ------=_NextPart_000_001C_01D01261.2A93EE90 > Content-Type: text/plain; > charset="UTF-8" > Content-Transfer-Encoding: quoted-printable > > No irony at all. Certainly a file is either *assumed* to be text or = > binary, not both at the same time. But isn=E2=80=99t = >=E2=80=98binary=E2=80=99 (or =E2=80=98text=E2=80=99) a matter of = > perspective/interpretation? I don=E2=80=99t know of any formal = > definition or international standard of what constitutes = >=E2=80=98binary=E2=80=99 or =E2=80=98text=E2=80=99 files. For example, = > since CR=E2=80=99s are not expected in Linux text files (unlike with = > Windows), having them in your file makes it binary instead of text? > > I could claim that a file containing all 256 ASCII codes is a text file = > for my use.
And how could one possibly distinguish a file containing all 256 byte bit patterns from a binary file? Referring to "all 256 ASCII codes" is a misnomer. There is *no* *such* *thing* as "8-bit ASCII", and using that term is a source of confusion. ASCII is by definition a 7-bit code. Sure, there are a host of 8-bit ISO standard encodings that embed ASCII within them, such as what's commonly referred to as Latin-1 (perhaps that's what you're using?), so I suppose you could call that an "extended ASCII encoding", but using the term "8-bit ASCII" indicates a fundamental misunderstanding. > On the contrary I could also claim that a file containing = > the string =E2=80=98Hello=E2=80=99 is binary because of how it is = > treated by my app. Maybe it=E2=80=99s an accidental display of a = > certain integer pattern. It certainly could be. Given an arbitrary block of bits, there is no certain way to determine whether it is intended as text or not without knowing what encoding is being used for text. Since SQLite (which underlies Fossil) uses UTF-8 encoded Unicode (which includes [7-bit] ASCII as a subset) as its text encoding, *any* byte that is not part of a UTF-8 encoding makes the file "binary", whether you intended it to be or not. > At any rate, the distinction seems less = > important these days when most =E2=80=98text=E2=80=99 editors can load = > and let you edit practically any file, regardless of content. In the = > old days, text editors would usually choke if there were given binary = > files. They may not choke, but they generally don't let you edit random bit patterns as if it were text either. Try running "diff" on two binary files. You will get a message "files differ", or some such. Not a real comparison. ("cmp" gives you the first byte where a difference was detected.) > To clarify, by =E2=80=98mostly=E2=80=99 I meant that even though looking = > at a file with a text editor you can easily determine the file is text, = > there may be a couple of characters inside it that do not conform to = > what most (?) people would expect in a text file. However, is that = > enough to claim the file is not text when =E2=80=98obviously=E2=80=99 it = > is (i.e., you can read it in a text editor and just skip over the couple = > of few funny looking special characters)? Yes it is. How much is enough to determine if a file is "really" text? It may be easy (or not) for a human judgement call, but a program is something else, and I'd rather have simple rule (such as "text" is UTF-8 encoded) rather than some fuzzy heuristic that is sure to fail when you don't want it to. -- Will _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users