[RFC/WIP] Pluggable reference backends

Michael Haggerty Mon, 10 Mar 2014 04:02:03 -0700

I have started working on pluggable ref backends.  In this email I
would like to share my plans and solicit feedback.


(This morning I removed this project from the GSoC ideas page, because
it is unfair to ask a student to shoot at a moving target.)

Why?
====

Currently, the reference- and reflog-handling code in Git is too
coupled to the rest of the system.  There are too many places that
know, for example, the difference between loose and packed refs, or
that loose references are stored as files directly under
$GIT_DIR/refs/heads/, or the locking protocols that have to be adhered
to when managing references.  This tight coupling, in turn, makes it
nearly impossible to experiment with alternate reference storage
schemes.

But there is a lot of potential to use alternate reference storage
schemes to fix some currently-unfixable problems, and to implement
some cool new features.

Unfixable problems
------------------

The on-disk format that we currently use to store references makes
some problems impossible to fix:

* It is impossible to get a self-consistent snapshot of all references
  at a given moment in time.  This makes it impossible, even in
  principle, to do object pruning in a 100% race-free way.  (Our
  current workaround of not deleting objects that are less than two
  weeks works in most cases but, aside from being ugly, has holes.

* There are awkward filesystem-imposed constraints on reference
  naming, for example:

  * D/F conflicts (I): it is not possible to have branches named
    "my-feature" and "my-feature/base" at the same time.

  * D/F conflicts (II): it is not possible to have reflogs for
    branches named "my-feature" and "my-feature/base" at the same
    time.  This leads to the problem that it is not, in general,
    possible to retain reflogs for branches that have been deleted.

  * There are additional constraints on reference names depending on
    the filesystem used to store them.  For example, a Git repository
    on a case-insensitive filesystem fails in confusing ways if there
    are two loose references whose names differ only in case; however,
    packed references differing in case might work for a while.  Also,
    reference names that include Unicode characters can have their
    normalization form changed if they are written on Mac OS.

* The packed-refs file has to be rewritten whenever a packed reference
  is deleted.  It might be nice to write 0{40} to a loose reference
  file to indicate that the reference has been deleted, but that would
  open the way for more D/F conflicts.)

Wild new ideas
--------------

So, I would like to reorganize the Git code to allow pluggable
reference backends.  If we had this, we could try out ideas like

* Retain the idea of loose/packed references, but encode loose
  reference names using a portable naming scheme before storing them
  to the filesystem; maybe something like

      refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42
      logs/refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42.log

  Yes, it looks uglier.  But users shouldn't be looking in these
  directories anyway.  This single change would prevent D/F conflicts,
  allow a reference to be deleted by writing 0{40} to its loose
  reference file, allow reflogs to be kept for deleted refs, and
  remove the problem of filesystem-dependent naming constraints.

* Store references in a SQLite database, to get correct transaction
  handling.

* Store references directly in the Git object database.

* Implement repository "groups" that share a common object database
  and also a common reference store.  Each repository in a group would
  get a sub-namespace in the shared database, and store its references
  in names like "refs/member/$MEMBERID/refs/heads/...".  The member
  repos would act like restricted views of the shared database.  This
  would be like a combination between alternates (with lowered risk of
  corruption) and gitnamespaces(7) (but usable for all git commands).

* Reference transactions that can be used across multiple Git
  commands.  Imagine,

      export GIT_TRANSACTION=$(git transaction begin)
      trap 'git transaction rollback' ERR
      git foo ...
      git bar ...
      git baz ...
      if ! git transaction commit
      then
          # Transaction failed; all references rolled back
      else
          # Transaction succeeded; all references updated atomically
      fi
      trap '' ERR
      unset GIT_TRANSACTION

  The "GIT_TRANSACTION" environment variable would tell git to read
  from the usual references, overridden with any reference changes
  that have occurred during the transaction, but write any changes
  (including both old and new values) to the transaction.  The command
  "git transaction commit" would verify that the old values listed in
  the transaction still agree with the current values, and then make
  all of the changes atomically.

  Such transactions could also be broadcast to mirrors when they are
  committed to keep multiple Git repositories in sync.

* One alternate backend might even be a shim that delegates to libgit2
  to do the actual reading/writing of references.  Then new backends
  could be implemented in libgit2 to allow both git and libgit2 to
  benefit.


The plan
========

It is currently not possible to experiment with any of these things
because of the tight coupling between the reference code and the rest
of git. The goal of this project is first to choke the interactions
down to a coherent interface, and second to make the implementation
selectable at runtime.  The implementation of specific alternate
backends will hopefully follow.

quagga references
-----------------

The overriding task is to isolate the reference-handling code; i.e.,
make sure that only code within refs.c touches git references, and
that the refs API provides all of the features that other code needs
to do its work.

So as a whimsical first milestone, I want to make it possible to
choose a different directory name for storing references and reflogs
by changing one #define statement in refs.c.  The goal is to get the
test suite to run correctly regardless of how this variable is set,
which would be a pretty good check that all reference-handling code
paths go though the refs API.  For no special reason I've been using
"quagga" as the new place, so references go to "$GIT_DIR/quagga/HEAD",
"$GIT_DIR/quagga/refs/heads/master", etc.  (Of course we wouldn't
actually *change* this name; it is only for testing purposes.)  I've
started working on this but there is a lot of code to change
(including test code).

Reference transactions
----------------------

I want to orient the new reference API as much as possible around
transactions.  I think a transaction is a flexible abstraction that
should be implementable by any backend (albeit not always with 100%
ACID compliance) and will allow a couple of existing races to be
fixed.

So as a first step, I will soon submit a patch series that starts
fleshing out the concept of a ref_transaction, and rewrites "git
update-ref --stdin" to use the new API.  For now, ref_transaction will
only be usable within a single git command invocation, but I want to
leave the way open to the GIT_TRANSACTION idea mentioned above.


Transition
==========

The current project is only to isolate the reference-handling code and
make it, in principle, exchangeable with another implementation.  It
doesn't require any transition.

Moreover, the changes will improve the modularity of the Git code, and
will be beneficial purely on those grounds.

When/if alternate backends are implemented, then the transition will
have to be handled on a case-by-case basis.  How references are stored
is mostly a decision internal to a single repository.  Any new
repository storage formats should be supported *in addition to* the
traditional storage scheme, to prevent the need for a flag day when
all repositories have to be converted simultaneously.

Git hosters [1] will be likely to take advantage of alternate
reference backends pretty easily, because they know which tools touch
their repositories and need only update those tools.  It is expected
that alternate reference backends will be useful for hosters even if
they don't become practical for end-users.

For end-users it is important that their repository be readable by all
of the tools that they use.  So if we want to make a new format a
viable option for normal Git users (let alone make it the new default
format), some coordination will be needed between all of the
commonly-used Git implementations (git-core, libgit2, JGit, and maybe
Dulwich, Grit, ...).  Whether or not this happens in real life depends
on how advantageous the hypothetical new format is to Git users and is
beyond the scope of this proposal.

Michael

[1] Full discloser: this includes my employer, GitHub.

-- 
Michael Haggerty
[email protected]
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/WIP] Pluggable reference backends

Reply via email to