Re: [git-users] SHA-1 checksum

Konstantin Khomoutov Mon, 08 Aug 2016 08:51:16 -0700

On Sun, 7 Aug 2016 09:26:30 -0700 (PDT)
Sharan Basappa <[email protected]> wrote:


> I would like to know why GIT calculates checksum of a file.
> Typically, checksum is used for the purpose of integrity.

Well, Git does this for two reasons:

1) It's what makes "D" in the "DVCS" ("Distributed Version Control
System") possible.  When two Git instances exchange histories from
their repositories over the wire, they need to have a way to figure out
what parts of them they share.  Now suppose that the user of the first
repository created a file containing the string "Hello world" and named
that file "foo.txt".  The user of the second repository created a file
with identical contents but named it "bar.txt" and placed it in a
directory named "stuff".  If we look at file names only, these files
are clearly different.  But they have identical contents, and that is
what DVCSes exchange with each other.

Enter cryptographic hashes.  They have two major properties:
* Identical sets of data "compress" to identical hash values.
* No two different sets of data compress to identical hash values
  (well, in fact it's theoretically possible for real-world hash
  functions to fail keeping this invariant, and it's called
  "a collision", but such an event is quite improbable for real-world
  applications).

So cryptographic hashes allow to neatly serve as short "handles" to
chunks of data of arbitrary size: for my toy example of the data string
"Hello world", it not quite obvious, but a cryptographic hash is
perfectly able to uniquely identify the contents of a multi-megabyte
file as well.

2) At its very bottom, Git implements the so-called
"content-addressable filesystem".  Its chief principle is that every
unique piece of data is stored exactly once, and these pieces are
identified by their contents.  Since use the contents "as is" is
unwieldy, its being addressed using -- again -- the cryptographic hashes
calculated over those contents.  This what makes Git effectively
implement its paradigm where each commit refers to a complete state of
all the project's files: even though like 99.9% of the content of each
commit a typical big project is the same as its parent commit, each
unique chunk of information -- a file or a tree referring to a set of
files -- is stored in the repository exactly once.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [git-users] SHA-1 checksum

Reply via email to