A design for subrepositories

Lauri Alanko Sat, 13 Oct 2012 06:56:44 -0700

Hello.

I intend to work on a "subrepository" tool for git, but before Iembark on the actual programming, I thought to first invite commentson the general design.

Some background first. I know that there are several existingapproaches already for managing nested repositories, but none of themquite seems to fit my purposes. My primary goal is to use git for homedirectory backup and mirroring, while the home directory itself may ofcourse contain repositories.

Git-subtree doesn't quite fit the bill. It allows merging a subtreeinto a larger tree and then again splitting it out for exporting, butthis is tedious. More importantly, a merged tree gets branched alongwith the containing tree, whereas I want to have subrepositoriesprecisely because the subtrees need to be branched independently ofthe container.

Submodules are a bit closer to what I want, but they have clearly beendesigned for a different purpose: a repository with submodules is onlysupposed to collate existing repositories, not act as a source forthem. So they aren't really faithful to the distributed nature of git:there's no easy way to completely clone a repository and its submodules.

Moreover, submodules have some other annoyances like not supportingbare repositories and checking out the submodules in detached heads.

Now, in other circumstances I might just patch git-submodule to addthe features I want, but it turns out that it is written in shell. Iknow that is a git tradition, but I'm going to get a bit religioushere: anything longer than a screenful shouldn't be written in shell,and I'm certainly not going to add more lines to an already overlongscript. Hence I'm going to write a separate tool using something a bitmore... structured. Probably Python with Dulwich.


So here are some preliminary thoughts on how the tool should work.


* Repository layout

Every subrepository has a unique identifier. The heads ofsubrepository <subname> are simply stored as heads in a subdirectoryof the main repository: e.g.refs/heads/subrepos/<subname>/<branchname>. Likewise for tags:refs/tags/subrepos/<subname>/<tagname>.

Rationale: if we had fully independent repositories under the mainrepository directory, like what git-submodule uses, there would be noeasy way to enumerate all the existing subrepositories to copy them.Since the only thing we can directly list from a remote repository arereferences, it makes sense to store the subrepositories just as abunch of them.

The reason for storing the subrepo references under refs/heads/ andrefs/tags/ (instead of, say, refs/subrepos/) is simply that this wayeverything is directly compatible with standard git tools: one can doa normal git clone/push/pull for mirroring and backup purposes withoutany need for special tools. You only need tools once you operate on aworking tree.



* Tree layout

A tree can mount references of subrepositories. There are twocomponents to a mount: a gitlink under <path> to a particular commitof a subrepo, and an entry in .gitrepos. This is very similar to howgit-submodule works.

The entry in .gitrepos specifies two things: the name of thesubrepository mounted under <path>, and the active branch in thatmount at the time of commit. So .gitrepos would look like this:


[mount "<path>"]
   subrepo = <subname>
   branch = <branchname>

Rationale: by storing the active branch name we can cater for the verycommon case where we check out a gitlink pointing to the current headof the branch. Then, when we check out the subrepository at the mountpoint, we can adjust HEAD to point to the correct branch.

By associating from a path to a subrepository (instead of the otherway, as git-submodule does), we can have multiple mount points for thesame subrepository, presumably with different active branches.Sometimes we want to have separate working trees for various branches,and it's good to be able to store this configuration in the containingtree.



* Working tree layout

When a tree containing mount points is checked out, a repository iscreated at each of those mount points. For every <path> specified in.gitrepos with subrepo <subname> and active branch <branchname>, and agitlink in <path> pointing to <commit>, we do the following:


- Create a repository under <path>/.git

- Add the object store of the containing repository to<path>/.git/objects/info/alternates

- Pull (just copy, really) the containing repository's references tothe subrepository as follows:


 - refs/heads/subrepos/<subname>/* -> refs/heads/*
 - refs/tags/subrepos/<subname>/* -> refs/tags/*
 - refs/remotes/<remote>/subrepos/<subname>/* -> refs/remotes/<remote>/*

- If now in the subrepository refs/heads/<branchname> points to<commit>, set HEAD as a symref to it. Otherwise set a detached HEADdirectly to <commit>.


- Check out HEAD in the subrepository.

Rationale: it was a tempting idea to make refs/heads and refs/tags tobe symlinks directly to the correct subdirectories in the containingrepository, and likewise make objects/ directly a symlink to thecontaining repository's object store. However, this is not reallyfeasible due to packed-refs, and it would make symlinks a requirement,something that git tries to avoid. (Of course "directory symrefs"would be a simple addition to the core.)

More importantly, a symlink to the object store would break git-gc.Also, it would be ugly to have ref manipulations under the mount pointdirectly affect the refs in the containing repository. It's betterthat none of the changes under the mount point affect the containingrepository in any way before an explicit add and check-in. At thispoint the refs are pulled back in the reverse direction.



* Basic commands


** git subrepo add <path> [<subname>]

Add a subrepository to the containing repository, or add the changesin a subrepository to the index.

If <path> is not yet found in .gitrepos, <subname> must be specified.Otherwise <subname> is looked up from .gitrepos.


The command performs the following:

- Add or update the gitlink to the index: git add <path>

- Add or change an entry in .gitmodules, setting mount.<path>.subnameto <subname> and mount.<path>.branch to the active branch under <path>(if any).

- git add .gitmodules


** git subrepo checkin [-f] [<path>...]

Update the subrepo references in the containing repository to thereferences in the mount points. This is meant to be run as apre-commit hook with no arguments.

If no paths are given, <path>... defaults to every mount path in.gitrepos that has been changed in the index. For each <path> mounting<subname>, perform the following:


- git fetch [-f] <path> refs/heads/*:refs/heads/subrepos/<subname>/
- git fetch [-f] <path> refs/tags/*:refs/tags/subrepos/<subname>/

If [-f] is given, it is passed to git fetch.

The operation can fail in the unlikely case that there are multiplemount points for the same subrepository, and a branch has divergedbetween those mount points.

Note: after this operation, any new objects that were added under themount point are now duplicated in the containing repository. A git gcin the containing repository followed by a git gc in the mount pointshould remove the now-redundant objects from the mount point.

Note: the default paths overlook the spurious case where have modifiedthe head of a non-active branch under the mount point, but the activebranch (and hence the commit in the gitlink) have remained unchanged.I don't know if there's a reasonable way to make "git subrepo add"somehow stage even these kinds of changes.



** git subrepo checkout [<path>...]

Check out the subrepositories at mount points <path>..., or at all themount points if none are specified. This is meant to be run as apost-checkout hook with no arguments.

This is described above in "Working tree layout". If this is not aninitial checkout, then the first two steps are skipped and just therefs and working tree are updated.



** git subrepo mv <path> <path>

Move a mount point: git mv the actual directory and adjust the path in.gitrepos and possibly the relative path in<path>/.git/objects/info/alternates. (An absolute path would fix thelatter, but then we couldn't move the entire containing repository.This is the lesser evil, IMHO.)

Gripe: why doesn't git support arbitrary metadata for tree entries?Then we wouldn't need to worry about syncing various path attributesthat are stored in separate files, but a simple git mv couldautomatically move everything associated with the path.



** git subrepo rm <path>

Remove the mount point and its entry in .gitrepos.


* A variant design

The above design is straightforward to implement, but it has a bit ofan ad-hoc feel in that we have these magic commands that transfer refsbetween the containing repository and the mount points. But there arealready standard tools for transferring refs: push and pull/fetch. Itwould be more "git-like" to use these directly, and make thecontaining repository be simply a remote for the mount point. We needa special remote for this purpose: git-remote-subrepo gives a "view"of the refs of a particular subrepo within the ref tree of thecontaining repository. It just makes the following translations forpush and fetch:


subrepo://<URL>/<subname> refs/heads/<branchname>
-> <URL> heads/subrepos/<subname>/<branchname>

subrepo://<URL>/<subname> refs/tags/<tagname>
-> <URL> tags/subrepos/<subname>/<tagname>

subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>

subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>

Then subrepo://<containingrepo>/<subname> is set as the origin in themount point, so one can just do a normal git push to push the changesto the containing repository. Likewise, for all the remotes in thecontaining repository, a remote with the same name is created underthe mount point with the urlsubrepo://<containingrepo>/<subname>/<remote>. Or it can be set todirectly access the actual remote:subrepo://<url-of-remote>/<subname>. It's a matter of taste.

The problem with explicit pushing to the containing repository is thatthen changes to the refs happen completely independently of changes tothe gitlinks, and ideally these should be synchronized in a singlecommit. So I'm not quite sure if the additional complexity of a remotehelper is warranted.

I hope I managed to make some sense of what this is about. Questionsand comments are appreciated.


Cheers,


Lauri

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

A design for subrepositories

Reply via email to