I have started working on pluggable ref backends. In this email I
would like to share my plans and solicit feedback.
(This morning I removed this project from the GSoC ideas page, because
it is unfair to ask a student to shoot at a moving target.)
Why?
====
Currently, the reference- and reflog-handling code in Git is too
coupled to the rest of the system. There are too many places that
know, for example, the difference between loose and packed refs, or
that loose references are stored as files directly under
$GIT_DIR/refs/heads/, or the locking protocols that have to be adhered
to when managing references. This tight coupling, in turn, makes it
nearly impossible to experiment with alternate reference storage
schemes.
But there is a lot of potential to use alternate reference storage
schemes to fix some currently-unfixable problems, and to implement
some cool new features.
Unfixable problems
------------------
The on-disk format that we currently use to store references makes
some problems impossible to fix:
* It is impossible to get a self-consistent snapshot of all references
at a given moment in time. This makes it impossible, even in
principle, to do object pruning in a 100% race-free way. (Our
current workaround of not deleting objects that are less than two
weeks works in most cases but, aside from being ugly, has holes.
* There are awkward filesystem-imposed constraints on reference
naming, for example:
* D/F conflicts (I): it is not possible to have branches named
"my-feature" and "my-feature/base" at the same time.
* D/F conflicts (II): it is not possible to have reflogs for
branches named "my-feature" and "my-feature/base" at the same
time. This leads to the problem that it is not, in general,
possible to retain reflogs for branches that have been deleted.
* There are additional constraints on reference names depending on
the filesystem used to store them. For example, a Git repository
on a case-insensitive filesystem fails in confusing ways if there
are two loose references whose names differ only in case; however,
packed references differing in case might work for a while. Also,
reference names that include Unicode characters can have their
normalization form changed if they are written on Mac OS.
* The packed-refs file has to be rewritten whenever a packed reference
is deleted. It might be nice to write 0{40} to a loose reference
file to indicate that the reference has been deleted, but that would
open the way for more D/F conflicts.)
Wild new ideas
--------------
So, I would like to reorganize the Git code to allow pluggable
reference backends. If we had this, we could try out ideas like
* Retain the idea of loose/packed references, but encode loose
reference names using a portable naming scheme before storing them
to the filesystem; maybe something like
refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42
logs/refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42.log
Yes, it looks uglier. But users shouldn't be looking in these
directories anyway. This single change would prevent D/F conflicts,
allow a reference to be deleted by writing 0{40} to its loose
reference file, allow reflogs to be kept for deleted refs, and
remove the problem of filesystem-dependent naming constraints.
* Store references in a SQLite database, to get correct transaction
handling.
* Store references directly in the Git object database.
* Implement repository "groups" that share a common object database
and also a common reference store. Each repository in a group would
get a sub-namespace in the shared database, and store its references
in names like "refs/member/$MEMBERID/refs/heads/...". The member
repos would act like restricted views of the shared database. This
would be like a combination between alternates (with lowered risk of
corruption) and gitnamespaces(7) (but usable for all git commands).
* Reference transactions that can be used across multiple Git
commands. Imagine,
export GIT_TRANSACTION=$(git transaction begin)
trap 'git transaction rollback' ERR
git foo ...
git bar ...
git baz ...
if ! git transaction commit
then
# Transaction failed; all references rolled back
else
# Transaction succeeded; all references updated atomically
fi
trap '' ERR
unset GIT_TRANSACTION
The "GIT_TRANSACTION" environment variable would tell git to read
from the usual references, overridden with any reference changes
that have occurred during the transaction, but write any changes
(including both old and new values) to the transaction. The command
"git transaction commit" would verify that the old values listed in
the transaction still agree with the current values, and then make
all of the changes atomically.
Such transactions could also be broadcast to mirrors when they are
committed to keep multiple Git repositories in sync.
* One alternate backend might even be a shim that delegates to libgit2
to do the actual reading/writing of references. Then new backends
could be implemented in libgit2 to allow both git and libgit2 to
benefit.
The plan
========
It is currently not possible to experiment with any of these things
because of the tight coupling between the reference code and the rest
of git. The goal of this project is first to choke the interactions
down to a coherent interface, and second to make the implementation
selectable at runtime. The implementation of specific alternate
backends will hopefully follow.
quagga references
-----------------
The overriding task is to isolate the reference-handling code; i.e.,
make sure that only code within refs.c touches git references, and
that the refs API provides all of the features that other code needs
to do its work.
So as a whimsical first milestone, I want to make it possible to
choose a different directory name for storing references and reflogs
by changing one #define statement in refs.c. The goal is to get the
test suite to run correctly regardless of how this variable is set,
which would be a pretty good check that all reference-handling code
paths go though the refs API. For no special reason I've been using
"quagga" as the new place, so references go to "$GIT_DIR/quagga/HEAD",
"$GIT_DIR/quagga/refs/heads/master", etc. (Of course we wouldn't
actually *change* this name; it is only for testing purposes.) I've
started working on this but there is a lot of code to change
(including test code).
Reference transactions
----------------------
I want to orient the new reference API as much as possible around
transactions. I think a transaction is a flexible abstraction that
should be implementable by any backend (albeit not always with 100%
ACID compliance) and will allow a couple of existing races to be
fixed.
So as a first step, I will soon submit a patch series that starts
fleshing out the concept of a ref_transaction, and rewrites "git
update-ref --stdin" to use the new API. For now, ref_transaction will
only be usable within a single git command invocation, but I want to
leave the way open to the GIT_TRANSACTION idea mentioned above.
Transition
==========
The current project is only to isolate the reference-handling code and
make it, in principle, exchangeable with another implementation. It
doesn't require any transition.
Moreover, the changes will improve the modularity of the Git code, and
will be beneficial purely on those grounds.
When/if alternate backends are implemented, then the transition will
have to be handled on a case-by-case basis. How references are stored
is mostly a decision internal to a single repository. Any new
repository storage formats should be supported *in addition to* the
traditional storage scheme, to prevent the need for a flag day when
all repositories have to be converted simultaneously.
Git hosters [1] will be likely to take advantage of alternate
reference backends pretty easily, because they know which tools touch
their repositories and need only update those tools. It is expected
that alternate reference backends will be useful for hosters even if
they don't become practical for end-users.
For end-users it is important that their repository be readable by all
of the tools that they use. So if we want to make a new format a
viable option for normal Git users (let alone make it the new default
format), some coordination will be needed between all of the
commonly-used Git implementations (git-core, libgit2, JGit, and maybe
Dulwich, Grit, ...). Whether or not this happens in real life depends
on how advantageous the hypothetical new format is to Git users and is
beyond the scope of this proposal.
Michael
[1] Full discloser: this includes my employer, GitHub.
--
Michael Haggerty
[email protected]
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html