Re: Understanding Git Under The Hood: Trees

2013-08-16 Thread Andreas Ericsson

On 2013-08-15 21:32, Erik Bernoth wrote:

On Thu, Aug 15, 2013 at 7:31 PM, Junio C Hamano gits...@pobox.com wrote:

While the last statement applies to other parts of the system, it is
not true for the in-core index design.  We always had a flat index,
and it is not cheating at all.  The original tree was also a flat
representation of everything under the sun, and hierarchical tree
objects came much later.


To some degree that revalidates my interpretation of Andreas'
statements. If I understand it correctly eacht time a shell command is
executed, which requires tree interaction, the corresponding tree is
read from filesystem to memory completely before anything is done?



More or less, yes, but please don't confuse directory tree with git
tree. They're not the same. A directory tree can contain multiple
levels of directories, whereas a git tree can only contain a list
of objects. The index (aka staging area) represents a directory tree, 
but when it gets stored on-disk a directory tree gets broken down into

as many git trees as is necessary.

The index is just a cache though. Until changes have been staged to it
in preparation for the next commit, it can be recreated exactly from
the currently checked out commit. As Junio pointed out, the index has 
been flat from the very beginning. Don't confuse the index with the

git tree objects found in the object storage though, or the working tree
with git trees. They're really not the same.

To illustrate the differences, here's a few commands and what they do
and operate on, with regards to the three different kinds of trees that
have come up in this discussion.


Ignore everything git-related and only print the worktree:
  find .

Ignores everything index- and worktree-related and only print the root
git tree of the currently checked out commit. You won't see any
relative paths or directories in there; Just a list of trees and blobs:
  git cat-file -p $(git cat-file -p HEAD | sed -n 's/^tree //p;q')


List staged files only, regardless of what you have in the worktree or
what the latest commit looks like. This will look pretty much like the
last command, but with files located in subdirectories as well, and an
additional field where the index-state is stored:
  git ls-files -s


 So
 if I git-add a file, the whole index is read first, then the memory
 object is changed and then the resulting change is written to disk
 bottom-up from the point of view of the tree?


When you git-add a file, we read in the index, update it with the new
contents of the file you pointed to, or add the new file to it if the
file isn't known to us since before. We also add the blob to the
object store and write out the new tree(s) to the object store as well.
Then we write out the new index, and then we're done. We do all that
bottom up, as you say, or the object store will be inconsistent after
we started writing root objects but before we're done writing leaf
objects.

For a simple git-add, that's it, and you'll now see git status list
files as added to the index without being committed. They're what we
call staged at this point.

If you also do git commit after having done git-add, we write out a
commit object, pointing to its parent commit and the root tree we
created in the git-add stage. git cat-file -p HEAD will give you an
idea of how that looks.

--
Andreas Ericsson   andreas.erics...@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Git Under The Hood: Trees

2013-08-16 Thread Erik Bernoth
Hi Andreas,

you gave me a lot of new insight and keywords I can google (Junio as
well!). Thanks a lot!

On Fri, Aug 16, 2013 at 11:12 AM, Andreas Ericsson a...@op5.se wrote:
 More or less, yes, but please don't confuse directory tree with git
 tree. They're not the same. A directory tree can contain multiple
 levels of directories, whereas a git tree can only contain a list
 of objects. The index (aka staging area) represents a directory tree, but
 when it gets stored on-disk a directory tree gets broken down into
 as many git trees as is necessary.

I was confusing it in a way, that I didn't even realize that it was
confusion. I thought from the Git-Book chapter 9 that the index itself
would store a git-tree reference, like a commit does. Therefore in my
own git add implementation I always started with reading and writing
git-trees without any cache in the middle. With a simple, flat index
in the middle of course the whole problem becomes much simpler!

Also after working through the man files of the plumbing commands you
showed I can use that much better in my daily git usage. Can't thank
you guys enough!

I think I start my python-git from scratch again, now that I
understand everything much better. For now I assume the following
algorithms:

 a) git add path/to/file

   1. read the index into a memory object (probably a dictionary like
index = { 'path/to/file' : (mode, sha1), ... })
   2. write a blob of path/to/file to the object store
   3. update index[path/to/file] with the new sha1
   4. write updated index to filesystem

b) git commit -m msg

   1. read index to memory
   2. recursively create memory git-tree objects top-down
   3. write git-tree objects to object store recursively bottom-up,
tracking the sha1 of child trees for parent trees
   4. add root-sha1 and commit-msg to memory commit object (author,
committer and so on can be added later)
   5. write commit object to object store
   6. update HEAD (branches will be added later)
   7. clean index (index = {})
   8. write empty index to filesystem

Cheers
Erik
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Git Under The Hood: Trees

2013-08-15 Thread Andreas Ericsson

On 2013-08-15 12:29, Erik Bernoth wrote:

Hi,

I'm currently trying to understand the inner workings of git better by
writing a git clone in Python. I find it rather hard to understand how
to efficiently use trees.

What I understand is this: Trees are in essence blobs with a specific
content. The content is a relation between other blobs names, types
and modes. With these lines they can refer to other blobs and trees
and thus make a tine filesystem.

Now I constructed a system to read and write blobs and have an index
file that potentially references to a top tree object, which
represents the currently cached repository state. I can add and remove
files from the index manually, by creating the blob of the file and
working my way up adding or updating trees until I hit the one
referenced in INDEX. My algorithm to automate it is really ugly, error
prone and inefficient, though. I also have a hard time to find my way
around in C files, so maybe some people here in the list could explain
the algorithm in Git to me.

suppose we have the following Index:

INDEX
  - tree
- tree a/
- blob b.txt
- tree c/
- blob d.txt

Now you want to stage a file with the following reference a/e/g.txt.

One approach would be to walk top-down, splitting the path into its
elements and looking for the corresponding trees, either retrieving an
existing tree or creating a new one. Then finally create the blob
g.txt and be done with it. This seems rather inefficient, though,
because each created or updated tree means that all trees way back up
need to be updated as well, once for every step in the loop.

The other way is to go bottom-up, first creating the blob, then
creating trees up to the project root folder. But then I don't see a
way to find which tree elements already exist and need to be updated.

So the only algorithm I can come up with is this:
  1. walk down the tree with help of the path string to the tree that
is closest to the file I want to store. On the way remember all the
trees on the path from INDEX to the resulting file. (In the example
above I'd like to get the hash of the a/ tree)
  2. create the blob (in the example with the context of g.txt)
  3. create the trees bottom-up until one step before the tree found in
1. (in the example create a e/ tree, containing the g.txt's blob)
  4. Add the resulting tree from 3. to the one found in 1. and create
the updated hash
  5. Now with help of the list from 1. walk the existing trees
bottom-up and update each one with the new hashes until INDEX is hit.
  6. Update INDEX.


Alltogether the idea of trees looked really simple and recursive which
makes me quite unhappy with the algorithm I came up with.


What is the algorithm to stage single files in Git itself?


You seem to believe that the in-memory representation of trees have to
be the same as the on-disk one. That's simply not true. Git cheats
outrageously with internal formats for pretty much everything in order
to squeeze out more performance.

You also seem to believe that a tree is more than one directory.
That's not true either.

So... Each tree that's updated can (and does) have an in-memory link
to its parent directory. Whenever we update a directory and create a
commit from it, we make sure to write them out from right to left (ie, 
children before parents), so that we're never in a state where a tree on 
disk can't find any of its content blobs in the same object storage.


With that information in hand, I'm sure you can quite easily create an
algorithm that turns out a bit prettier than what you have now.



Also: I thought to myself, why not just make a function that returns
the relative path to the repository root folder and consider that
string the file name, drop the idea of trees at all and add the
information that is traditionally stored in tree objects directly in a
commit object.


Because commit objects must have parents and things like that. Also,
the idea of trees containing trees saves a *huge* amount of disk I/O
and space when updating a project with many subdirectories.


Wouldn't that be much simpler and still accomplish the
same?


Simpler; Yes. More efficient; No.

The commit that changed the tree structure from flat to hierarchical
dates back to April of 2005. It's commit number 25 in the history of
git and solves one of the first real problems from live-testing git
on the kernel repository, which was that tree-files got so huge when
they had to be rewritten in their entirety that creating a new commit
took several seconds.


I think the idea of keeping information in separate small files
instead of single big files was dropped at one point anyway, when
pack-files were introduced.



Not really. We strive hard to minimize disk I/O whenever we can. The
packfiles actually reduce I/O, since it lets the kernel use the disk's
builtin caches a lot better, and read-ahead works to our advantage.

Btw, when I say minimize disk I/O, I really mean minimize user 

Re: Understanding Git Under The Hood: Trees

2013-08-15 Thread Junio C Hamano
Andreas Ericsson a...@op5.se writes:

 You seem to believe that the in-memory representation of trees have to
 be the same as the on-disk one. That's simply not true. Git cheats
 outrageously with internal formats for pretty much everything in order
 to squeeze out more performance.

While the last statement applies to other parts of the system, it is
not true for the in-core index design.  We always had a flat index,
and it is not cheating at all.  The original tree was also a flat
representation of everything under the sun, and hierarchical tree
objects came much later.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Git Under The Hood: Trees

2013-08-15 Thread Erik Bernoth
On Thu, Aug 15, 2013 at 7:31 PM, Junio C Hamano gits...@pobox.com wrote:
 While the last statement applies to other parts of the system, it is
 not true for the in-core index design.  We always had a flat index,
 and it is not cheating at all.  The original tree was also a flat
 representation of everything under the sun, and hierarchical tree
 objects came much later.

To some degree that revalidates my interpretation of Andreas'
statements. If I understand it correctly eacht time a shell command is
executed, which requires tree interaction, the corresponding tree is
read from filesystem to memory completely before anything is done? So
if I git-add a file, the whole index is read first, then the memory
object is changed and then the resulting change is written to disk
bottom-up from the point of view of the tree?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html