Re: [fossil-users] How to force text for all files?

Will Parsons Sun, 07 Dec 2014 15:02:13 -0800

<to...@acm.org> wrote:
> This is a multi-part message in MIME format.
>
> --===============0379594130==
> Content-Type: multipart/alternative;
>       boundary="----=_NextPart_000_001C_01D01261.2A93EE90"
>
> This is a multi-part message in MIME format.
>
> ------=_NextPart_000_001C_01D01261.2A93EE90
> Content-Type: text/plain;
>       charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> No irony at all.  Certainly a file is either *assumed* to be text or =
> binary, not both at the same time.  But isn=E2=80=99t =
>=E2=80=98binary=E2=80=99 (or =E2=80=98text=E2=80=99) a matter of =
> perspective/interpretation?  I don=E2=80=99t know of any formal =
> definition or international standard of what constitutes =
>=E2=80=98binary=E2=80=99 or =E2=80=98text=E2=80=99 files.  For example, =
> since CR=E2=80=99s are not expected in Linux text files (unlike with =
> Windows), having them in your file makes it binary instead of text?
>
> I could claim that a file containing all 256 ASCII codes is a text file =
> for my use.


And how could one possibly distinguish a file containing all 256 byte
bit patterns from a binary file?  Referring to "all 256 ASCII codes"
is a misnomer.  There is *no* *such* *thing* as "8-bit ASCII", and
using that term is a source of confusion.  ASCII is by definition a
7-bit code.  Sure, there are a host of 8-bit ISO standard encodings
that embed ASCII within them, such as what's commonly referred to as
Latin-1 (perhaps that's what you're using?), so I suppose you could
call that an "extended ASCII encoding", but using the term "8-bit
ASCII" indicates a fundamental misunderstanding.

> On the contrary I could also claim that a file containing =
> the string =E2=80=98Hello=E2=80=99 is binary because of how it is =
> treated by my app.  Maybe it=E2=80=99s an accidental display of a =
> certain integer pattern.

It certainly could be.  Given an arbitrary block of bits, there is no
certain way to determine whether it is intended as text or not without
knowing what encoding is being used for text.  Since SQLite (which
underlies Fossil) uses UTF-8 encoded Unicode (which includes [7-bit]
ASCII as a subset) as its text encoding, *any* byte that is not part
of a UTF-8 encoding makes the file "binary", whether you intended it
to be or not.

> At any rate, the distinction seems less =
> important these days when most =E2=80=98text=E2=80=99 editors can load =
> and let you edit practically any file, regardless of content.  In the =
> old days, text editors would usually choke if there were given binary =
> files.

They may not choke, but they generally don't let you edit random bit
patterns as if it were text either.  Try running "diff" on two binary
files.  You will get a message "files differ", or some such.  Not a
real comparison.  ("cmp" gives you the first byte where a difference
was detected.)

> To clarify, by =E2=80=98mostly=E2=80=99 I meant that even though looking =
> at a file with a text editor you can easily determine the file is text, =
> there may be a couple of characters inside it that do not conform to =
> what most (?) people would expect in a text file.  However, is that =
> enough to claim the file is not text when =E2=80=98obviously=E2=80=99 it =
> is (i.e., you can read it in a text editor and just skip over the couple =
> of few funny looking special characters)?

Yes it is.  How much is enough to determine if a file is "really" text?
It may be easy (or not) for a human judgement call, but a program is
something else, and I'd rather have simple rule (such as "text" is
UTF-8 encoded) rather than some fuzzy heuristic that is sure to fail
when you don't want it to.

-- 
Will

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] How to force text for all files?

Reply via email to