Does the "mode" (text/binary) factor into the sha1 fingerprint hash of
an artifact (single file) or commit ?

-bch


On 12/8/14, Martin Gagnon <eme...@gmail.com> wrote:
> On Mon, Dec 08, 2014 at 12:09:57PM -0800, Shal Farley wrote:
>> Stephan,
>>
>> > If it has ANY bytes above 127, it's not, by definition, ASCII. i.e.
>> > "it's binary."
>>
>> I would disagree with part of this statement. I agree that ASCII
>> defines only the 7-bit code values, but I think this whole thread
>> has run off the rails in talking about the content values as
>> determining whether the file is "text" or "binary".
>>
>> But this discussion of content heuristics misses the point of why
>> there is a distinction to be made in the first place. And that I
>> think has more to do with whether the content is organized into
>> "lines".
>>
>> In a functional sense for Fossil, a "text" file is one for which it
>> is useful to display a line-oriented difference. For all other files
>> ("binary" files) the difference can only be displayed in a way that
>> is agnostic of the internal structure (if any) of the content.
>>
>> Given that there is no universal heuristic for discriminating "text"
>> from "binary" files based on content, that determination must be
>> treated as a bit of metadata about the file.
>>
>> Likewise, it is necessary to know for a given file what
>> representation is used to separate lines. Knowledge of the line
>> separator is seldom carried as metadata, because it is usually
>> uniform in a given system. But in these days of interoperable
>> systems and multi-platform support, this detail also may be a
>> necessary piece of metadata to know about a file. ASCII code calls
>> out the CR (carriage return) and LF (line-feed) control characters.
>> DOS-based systems (including Windows) follow the direct ASCII
>> tradition of using CR and LF, paired in that order (and often
>> represented as CRLF) as the line separator. That tradition is also
>> embodied in the Internet Mail standards for message content, header
>> and body (absent MIME extensions). Unix-based systems use the LF
>> character alone as the line separator in files (aka "newline").
>> Other systems have used CR alone.
>>
>> And additionally, the character set used to represent text in a file
>> must also be carried as metadata (because of the ISO-8859 and other
>> code-page based character sets).
>>
>> Only if all these items of metadata are known can the file content,
>> or differences in the file content, be displayed in a useful form.
>> So returning to this thread, it is convenient to have a heuristic
>> that works most of the time to discriminate "text" from "binary"
>> files, but it is necessary to also have a way for the user to
>> explicitly provide that metadata (and ideally the character set
>> metadata).
>>
>
> +1
>
> So if I summarize, we could implement a kind of: text-glob setting that
> work like binary-glob, except it would force text instead of binary. If
> a file doesn't match any of those 2 glob setting, the default fossil
> heuristic would be used.
>
> Does it make sense ?
>
> --
> Martin G.
>
> _______________________________________________
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to