On Mon, Dec 08, 2014 at 12:09:57PM -0800, Shal Farley wrote:
> Stephan,
> 
> > If it has ANY bytes above 127, it's not, by definition, ASCII. i.e.
> > "it's binary."
> 
> I would disagree with part of this statement. I agree that ASCII
> defines only the 7-bit code values, but I think this whole thread
> has run off the rails in talking about the content values as
> determining whether the file is "text" or "binary".
> 
> But this discussion of content heuristics misses the point of why
> there is a distinction to be made in the first place. And that I
> think has more to do with whether the content is organized into
> "lines".
> 
> In a functional sense for Fossil, a "text" file is one for which it
> is useful to display a line-oriented difference. For all other files
> ("binary" files) the difference can only be displayed in a way that
> is agnostic of the internal structure (if any) of the content.
> 
> Given that there is no universal heuristic for discriminating "text"
> from "binary" files based on content, that determination must be
> treated as a bit of metadata about the file.
> 
> Likewise, it is necessary to know for a given file what
> representation is used to separate lines. Knowledge of the line
> separator is seldom carried as metadata, because it is usually
> uniform in a given system. But in these days of interoperable
> systems and multi-platform support, this detail also may be a
> necessary piece of metadata to know about a file. ASCII code calls
> out the CR (carriage return) and LF (line-feed) control characters.
> DOS-based systems (including Windows) follow the direct ASCII
> tradition of using CR and LF, paired in that order (and often
> represented as CRLF) as the line separator. That tradition is also
> embodied in the Internet Mail standards for message content, header
> and body (absent MIME extensions). Unix-based systems use the LF
> character alone as the line separator in files (aka "newline").
> Other systems have used CR alone.
> 
> And additionally, the character set used to represent text in a file
> must also be carried as metadata (because of the ISO-8859 and other
> code-page based character sets).
> 
> Only if all these items of metadata are known can the file content,
> or differences in the file content, be displayed in a useful form.
> So returning to this thread, it is convenient to have a heuristic
> that works most of the time to discriminate "text" from "binary"
> files, but it is necessary to also have a way for the user to
> explicitly provide that metadata (and ideally the character set
> metadata).
> 

+1

So if I summarize, we could implement a kind of: text-glob setting that
work like binary-glob, except it would force text instead of binary. If
a file doesn't match any of those 2 glob setting, the default fossil
heuristic would be used.

Does it make sense ?

-- 
Martin G.

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to