On Sun, 1 Nov 2015 12:31:59 -0800
Michael <keybou...@gmail.com> wrote:
> So I'm starting to be aware that the "index file" isn't a file, but
> effectively a full commit that isn't finalized.
> Can someone point me to a good explanation of what the index actually
> is, and is not, so I'm not trying to understand it by trial and error?
First, while currently, in stock Git, the index is indeed a file, this
is a so-called implementation detail. That is, you never need to know
that the index is implemented this way, and it might be reimplemented
some other way in some future Git release. (Of course, care will be
taken so that Git understands old ways of representing the index,
and/or will provide an option to upgrade the data representation
format in your existing repos. Oh, and while we're on it, current Git
release implements something like 3th or 4th iteration of the index
format, that is, it gets improved over time, transparently).
Second, no, the index is not "a full commit that isn't finalized".
I think your attempt to give another meaning to what's already
well-defined works against you. "The index" is an unfortunate name
which originated from the initial purpose of the index (see below) but
stuck. The contemporary "official" name for this entity is "the staging
OK, so what's there in the staging area?
The staging area contains entries for all the trees and blobs
(representations of filesystem directories and files) recorded in the
commit which is currently checked out -- what the HEAD reference
points at. The staging area organized in a way which allows it to help
Git carry out certain tasks efficiently:
* In addition to the references to the data stored in the Git repository
each entry contains the so-called "stat information" which is grabbed
from the filesystem when a file is created in the work tree from its
entry in the index. The most important bit of this information is the
file's last modification time (usually dubbed "mtime").
Certain Git commands which compare the state of files in the work tree
to that of their corresponding entries in the staging area make use of
this information: if the file's mtime is exactly equal to that
recorded in the index for that file, there's no need to calculate the
SHA-1 hash over the copy of the file in the work tree to check if its
contents has been modified compared to what's in the index.
* Each file entry in the index is in fact capable of keeping three
references to its data, not one. This is useful to handle merge
conflicts: if, when merging, Git detects a merge conflict for a
particular file, it populates the index entry for that file with
three references to the different versions of the data of that
file: "ours" (local), "theirs" (remote, being merged in) and
"base" (the latest common version for the both lines of history).
This provides for fast data inspection, comparison etc.
* The index is organized in a way which provides super-fast lookup
of the data by pathname and otherwise.
 and  -- in this order -- are good at explaining what the staging
area is all about.
OK, so why the name "index"?
An oft-forgotten thing about Git is that it began life as an
implementation of a so-called "content-addressable filesystem".
That is, the initial vision was to provide a set of low-level tools
which would provide very efficient means for maintaining several
connected "versions" (or "snapshots") of the states of a filesystem
(typically a directory on a conventional filesystem known as "the work
tree", but this does not matter for the concepts).
A new commit could be in theory created by directly taking each file
from the work tree. Comparison of what's in the work tree with what's
in the tip commit can be carried out this -- direct -- way as well.
Incremental comitting ("staging") can also be done directly -- by
creating a new commit each time you `git add` new changes and replacing
the prospective new tip commit with it.
While this could work, it's inefficient. If you consider the "target
project" for managing with Git -- the Linux kernel, which contains
several hundreds of thousands of files worth several hundreds megabytes
-- you will understand that the straightforward approach outlined above
will suck big time performance-wise. Enter the index: it sits between
the object store (the repository's data) and the work tree and *caches*
the data which needs to be accessed quick. Notice this name: for some
time in past Git revisions, "the index" was gradually renamed to "the
cache" (and hence you can still run `git diff --cached` as well as
`git diff --staged`).
Later, when Git UI got overhauled several times to make it more
accessible to "outsiders" not familiar with low-level Git concepts, it
was deemed that "the staging area" is the most understandable naming as
it directly conveys the higher-level concept: a place to gradually
prepare the next commit by "staging" and "unstaging" the changes which
should go into it.
You received this message because you are subscribed to the Google Groups "Git
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.