On Mon, 13 Jun 2016 21:24:55 -0700 (PDT)
Jack Poon <j...@atcipher.com> wrote:

> I am trying to parallelize git activities for a large git repository. 
>  Would like to know if the inner core of git-add is thread-safe?  For 
> example,
> 
> % git add file1 &
> % git add file2 &
> % git add file3 &
> ...
> % git add file100 &
> 
> assuming file1 - file100 are large files... 
> 
> Is there a hidden switch in 'git-add' that will do multiple files in 
> parallel?  For example,
> % git add --thread=10 file1 file2 file3 file4 .... file100

(Disclaimer: these details are pretty hard-core and are better
discussed on the main Git list.)

`git add` basically works like this (for each file name passed to it
on the command-line):

The SHA-1 hash is calculated over the file's contents -- this requires
reading the entire file -- and then that contents ("the blob") gets
written into the object database hierarchy (which is kept under the
".git" directory) and an appropriate entry is inserted in the index or
updated there -- if it already exists.  Presently, in stock Git
implementation the index a single file on the file system, and writing
an entry in the index is basically writing an updated copy of the index
file followed by atomic renaming it over the existing index file.

The operation of shoveling the file's contents into the object store
while calculating the SHA-1 hash over it is done using the
`git hash-object -w` plumbing command.
This command works in such a way that is a blob which hashed to the
same SHA-1 hash value already exists in the object store, nothing is
written there and the blob gets "shared".

So what we're dealing with here is that `git add` is mostly I/O-bound,
not CPU-bound (calculating SHA-1 hash is cheap on today's hardware),
and making several `git add` in parallel might actually make things
worse, not better -- especially if your work tree is kept on a rotating
media, and especially if both the work tree and the object store are on
the same media (the most typical case).  The database is most certainly
locked at key points of `git add` operations but I don't know at which
points exactly.

You could try to see where the contention is by checking out your work
tree on a ramdisk (Git understands the "--work-tree" command-line
option and the GIT_WORK_TREE environment variable which cover such
cases) and see whether that makes your multiple `git add` commands run
really in parallel.  If so, the disk I/O is the bottleneck.

As to your imaginary "--thread" option, AFAIK no, `git add` does not
anything like this -- supposedly due to the reason I explained.  

A possible kludge which could help a bit in your case is running
`git hash-object -w` on your files ahead of `git add`-ing them (which
is typically done right before committing).  Assuming you have plenty
of RAM and the files won't change, this could make the following runs of
`git add` be served from the filesystem cache.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to