Borrowing objects from nearby repositories

Andrew Keller Tue, 11 Mar 2014 20:38:22 -0700

Hi all,

I am considering developing a new feature, and I'd like to poll the group for 
opinions.


Background: A couple years ago, I wrote a set of scripts that speed up cloning 
of frequently used repositories.  The scripts utilize a bare Git repository 
located at a known location, and automate providing a --reference parameter to 
`git clone` and `git submodule update`.  Recently, some coworkers of mine 
expressed an interest in using the scripts, so I published the current version 
of my scripts, called `git repocache`, described at the bottom of 
<https://github.com/andrewkeller/ak-git-tools>.

Slowly, it has occurred to me that this feature, or something similar to it, 
may be worth adding to Git, so I've been thinking about the best approach.  
Here's my best idea so far:

1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to 
'--reference', except that it operates on a temporary basis, and does not 
assume that the reference repository will exist after the operation completes, 
so any used objects are copied into the local objects database.  In theory, 
this mechanism would be distinct from '--reference', so if both are used, some 
objects would be copied, and some objects would be accessible via a reference 
repository referenced by the alternates file.

2)  Teach `git fetch` to read 'repocache.path' (or a better-named 
configuration), and use it to automatically activate borrowing.

3)  For consistency, `git clone`, `git pull`, and `git submodule update` should 
probably all learn '--borrow', and forward it to `git fetch`.

4)  In some scenarios, it may be necessary to temporarily not automatically 
borrow, so `git fetch`, and everything that calls it may need an argument to do 
that.

Intended outcome: With 'repocache.path' set, and the cached repository properly 
updated, one could run `git clone <url>`, and the operation would complete much 
faster than it does now due to less load on the network.

Things I haven't figured out yet:

*  What's the best approach to copying the needed objects?  It's probably 
inefficient to copy individual objects out of pack files one at a time, but it 
could be wasteful to copy entire pack files just because you need one object.  
Hard-linking could help, but that won't always be available.  One of my 
previous ideas was to add a '--auto-repack' option to `git-clone`, which solves 
this problem better, but introduces some other front-end usability problems.
*  To maintain optimal effectiveness, users would have to regularly run a fetch 
in the cache repository.  Not all users know how to set up a scheduled task on 
their computer, so this might become a maintenance problem for the user.  This 
kind of problem I think brings into question the viability of the underlying 
design here, assuming that the ultimate goal is to clone faster, with very 
little or no change in the use of git.


Thoughts?

Thanks,
Andrew Keller

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borrowing objects from nearby repositories

Reply via email to