On Sun, 1 Nov 2015 12:31:59 -0800
Michael <keybou...@gmail.com> wrote:

> So I'm starting to be aware that the "index file" isn't a file, but
> effectively a full commit that isn't finalized.
> 
> Can someone point me to a good explanation of what the index actually
> is, and is not, so I'm not trying to understand it by trial and error?

First, while currently, in stock Git, the index is indeed a file, this
is a so-called implementation detail.  That is, you never need to know
that the index is implemented this way, and it might be reimplemented
some other way in some future Git release. (Of course, care will be
taken so that Git understands old ways of representing the index,
and/or will provide an option to upgrade the data representation
format in your existing repos.  Oh, and while we're on it, current Git
release implements something like 3th or 4th iteration of the index
format, that is, it gets improved over time, transparently).

Second, no, the index is not "a full commit that isn't finalized".
I think your attempt to give another meaning to what's already
well-defined works against you.  "The index" is an unfortunate name
which originated from the initial purpose of the index (see below) but
stuck.  The contemporary "official" name for this entity is "the staging
area".

OK, so what's there in the staging area?

The staging area contains entries for all the trees and blobs
(representations of filesystem directories and files) recorded in the
commit which is currently checked out -- what the HEAD reference
points at.  The staging area organized in a way which allows it to help
Git carry out certain tasks efficiently:

* In addition to the references to the data stored in the Git repository
  each entry contains the so-called "stat information" which is grabbed
  from the filesystem when a file is created in the work tree from its
  entry in the index.  The most important bit of this information is the
  file's last modification time (usually dubbed "mtime").

  Certain Git commands which compare the state of files in the work tree
  to that of their corresponding entries in the staging area make use of
  this information: if the file's mtime is exactly equal to that
  recorded in the index for that file, there's no need to calculate the
  SHA-1 hash over the copy of the file in the work tree to check if its
  contents has been modified compared to what's in the index.

* Each file entry in the index is in fact capable of keeping three
  references to its data, not one.  This is useful to handle merge
  conflicts:  if, when merging, Git detects a merge conflict for a
  particular file, it populates the index entry for that file with
  three references to the different versions of the data of that
  file: "ours" (local), "theirs" (remote, being merged in) and
  "base" (the latest common version for the both lines of history).

  This provides for fast data inspection, comparison etc.

* The index is organized in a way which provides super-fast lookup
  of the data by pathname and otherwise.

[1] and [2] -- in this order -- are good at explaining what the staging
area is all about.


OK, so why the name "index"?

An oft-forgotten thing about Git is that it began life as an
implementation of a so-called "content-addressable filesystem".
That is, the initial vision was to provide a set of low-level tools
which would provide very efficient means for maintaining several
connected "versions" (or "snapshots") of the states of a filesystem
(typically a directory on a conventional filesystem known as "the work
tree", but this does not matter for the concepts).

A new commit could be in theory created by directly taking each file
from the work tree.  Comparison of what's in the work tree with what's
in the tip commit can be carried out this -- direct -- way as well.
Incremental comitting ("staging") can also be done directly -- by
creating a new commit each time you `git add` new changes and replacing
the prospective new tip commit with it.

While this could work, it's inefficient.  If you consider the "target
project" for managing with Git -- the Linux kernel, which contains
several hundreds of thousands of files worth several hundreds megabytes
-- you will understand that the straightforward approach outlined above
will suck big time performance-wise.  Enter the index: it sits between
the object store (the repository's data) and the work tree and *caches*
the data which needs to be accessed quick.  Notice this name: for some
time in past Git revisions, "the index" was gradually renamed to "the
cache" (and hence you can still run `git diff --cached` as well as
`git diff --staged`).

Later, when Git UI got overhauled several times to make it more
accessible to "outsiders" not familiar with low-level Git concepts, it
was deemed that "the staging area" is the most understandable naming as
it directly conveys the higher-level concept: a place to gradually
prepare the next commit by "staging" and "unstaging" the changes which
should go into it.

1. https://jwiegley.github.io/git-from-the-bottom-up/
2. https://git-scm.com/blog/2011/07/11/reset.html

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to