On Mon, Dec 08, 2014 at 12:09:57PM -0800, Shal Farley wrote: > Stephan, > > > If it has ANY bytes above 127, it's not, by definition, ASCII. i.e. > > "it's binary." > > I would disagree with part of this statement. I agree that ASCII > defines only the 7-bit code values, but I think this whole thread > has run off the rails in talking about the content values as > determining whether the file is "text" or "binary". > > But this discussion of content heuristics misses the point of why > there is a distinction to be made in the first place. And that I > think has more to do with whether the content is organized into > "lines". > > In a functional sense for Fossil, a "text" file is one for which it > is useful to display a line-oriented difference. For all other files > ("binary" files) the difference can only be displayed in a way that > is agnostic of the internal structure (if any) of the content. > > Given that there is no universal heuristic for discriminating "text" > from "binary" files based on content, that determination must be > treated as a bit of metadata about the file. > > Likewise, it is necessary to know for a given file what > representation is used to separate lines. Knowledge of the line > separator is seldom carried as metadata, because it is usually > uniform in a given system. But in these days of interoperable > systems and multi-platform support, this detail also may be a > necessary piece of metadata to know about a file. ASCII code calls > out the CR (carriage return) and LF (line-feed) control characters. > DOS-based systems (including Windows) follow the direct ASCII > tradition of using CR and LF, paired in that order (and often > represented as CRLF) as the line separator. That tradition is also > embodied in the Internet Mail standards for message content, header > and body (absent MIME extensions). Unix-based systems use the LF > character alone as the line separator in files (aka "newline"). > Other systems have used CR alone. > > And additionally, the character set used to represent text in a file > must also be carried as metadata (because of the ISO-8859 and other > code-page based character sets). > > Only if all these items of metadata are known can the file content, > or differences in the file content, be displayed in a useful form. > So returning to this thread, it is convenient to have a heuristic > that works most of the time to discriminate "text" from "binary" > files, but it is necessary to also have a way for the user to > explicitly provide that metadata (and ideally the character set > metadata). >
+1 So if I summarize, we could implement a kind of: text-glob setting that work like binary-glob, except it would force text instead of binary. If a file doesn't match any of those 2 glob setting, the default fossil heuristic would be used. Does it make sense ? -- Martin G. _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users