On Mon, 13 Jun 2016 21:24:55 -0700 (PDT) Jack Poon <j...@atcipher.com> wrote:
> I am trying to parallelize git activities for a large git repository. > Would like to know if the inner core of git-add is thread-safe? For > example, > > % git add file1 & > % git add file2 & > % git add file3 & > ... > % git add file100 & > > assuming file1 - file100 are large files... > > Is there a hidden switch in 'git-add' that will do multiple files in > parallel? For example, > % git add --thread=10 file1 file2 file3 file4 .... file100 (Disclaimer: these details are pretty hard-core and are better discussed on the main Git list.) `git add` basically works like this (for each file name passed to it on the command-line): The SHA-1 hash is calculated over the file's contents -- this requires reading the entire file -- and then that contents ("the blob") gets written into the object database hierarchy (which is kept under the ".git" directory) and an appropriate entry is inserted in the index or updated there -- if it already exists. Presently, in stock Git implementation the index a single file on the file system, and writing an entry in the index is basically writing an updated copy of the index file followed by atomic renaming it over the existing index file. The operation of shoveling the file's contents into the object store while calculating the SHA-1 hash over it is done using the `git hash-object -w` plumbing command. This command works in such a way that is a blob which hashed to the same SHA-1 hash value already exists in the object store, nothing is written there and the blob gets "shared". So what we're dealing with here is that `git add` is mostly I/O-bound, not CPU-bound (calculating SHA-1 hash is cheap on today's hardware), and making several `git add` in parallel might actually make things worse, not better -- especially if your work tree is kept on a rotating media, and especially if both the work tree and the object store are on the same media (the most typical case). The database is most certainly locked at key points of `git add` operations but I don't know at which points exactly. You could try to see where the contention is by checking out your work tree on a ramdisk (Git understands the "--work-tree" command-line option and the GIT_WORK_TREE environment variable which cover such cases) and see whether that makes your multiple `git add` commands run really in parallel. If so, the disk I/O is the bottleneck. As to your imaginary "--thread" option, AFAIK no, `git add` does not anything like this -- supposedly due to the reason I explained. A possible kludge which could help a bit in your case is running `git hash-object -w` on your files ahead of `git add`-ing them (which is typically done right before committing). Assuming you have plenty of RAM and the files won't change, this could make the following runs of `git add` be served from the filesystem cache. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.