Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-21 Thread Junio C Hamano
Johan Herland  writes:

> On Mon, May 20, 2013 at 6:37 PM, Junio C Hamano  wrote:
> ...
>> That "backing out" can be made more intelligently than just dying
>> with "compare and swap failed--please retry" message, e.g. you at
>> that point notice what you started with, what the other party did
>> while you were working on (i.e. updating refs/tags/), and three-way
>> merge the refs tree, and in cases where "all refs recorded as loose
>> refs" scheme wouldn't have resulted in problematic conflict, such a
>> three-way merge would resolve trivially (you updated refs/heads/ and
>> the update by the other process to refs/tags/ would not conflict
>> with what you did).  But the same three-way merge scheme can be
>> employed with the current flat single packed-refs scheme, can't it?
>
> Yes. (albeit without reusing the machinery we already have for doing
> three-way merges)

I do not think that is relevant, as you do *not* want to reuse the
usual three-way xdiff merge machinery that leaves "<=>" markers for
this usage anyway.

But that is not the primary reason why I am beating this dead horse;
it is the following.

Your version of packed-refs file, and his version of packed-refs
file, and the original you started with, are all sorted in a known
order.  You just look at lines from all and merge them line by line.

I think that logic should become a new blob-level merge driver that
can be reused for normal in-working-tree objects that are sorted.
For that kind of "sorted wordlist" file, conflicts that may
artificially arise only because two sides updated adjacent lines in
the list and you used the normal xdiff merge machinery is unwanted,
and users would benefit by having such a specialized merge driver
outside the context of merging two proposed list of refs.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Junio C Hamano
Jeff King  writes:

> But more importantly, it introduces contention between two unrelated
> refs that are being updated. Even if we reconcile the differences
> automatically (e.g., with a "merge-and-retry" strategy), that is going
> to be a serious performance regression for a busy repository, as we
> repeatedly try to reconcile the serialized updates to the refs/ root
> tree.

I think we are on the same page.

> Any transactional solution needs to have the equivalent of ref-level
> locking (i.e., row-level locking in a relational setting).

Not necessarily if we can exploit assumptions such as deletion is
far rare compared to update and creation which in turn are far rare
compared to lookup.  I've been wondering if we can find a cheap
reader-writer lock mechanism, use a single instance of it to govern
the whole repository, and have everybody but ref deleters and "git
pack-ref" take the read side of the lock.

Then only while ref deletion or ref packing is going on, everybody
need to stall, but otherwise the most common "read or update
recently touched refs (aka loose refs)" will be taking the reader
side of that cheap read-write lock and reading a single loose ref
file.

Perhaps such a "cheap on reader side" reader-write lock is hard to
come by?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Johan Herland
On Mon, May 20, 2013 at 6:59 PM, Jeff King  wrote:
> On Mon, May 20, 2013 at 09:37:30AM -0700, Junio C Hamano wrote:
>> Johan Herland  writes:
>>
>> > For server-class installations we need ref storage that can be read
>> > (and updated?) atomically, and the current system of loose + packed
>> > files won't work since reading (and updating) more than a single file
>> > is not an atomic operation. Trivially, one could resolve this by
>> > dropping loose refs, and always using a single packed-refs file, but
>> > that would make it prohibitively expensive to update refs (the entire
>> > packed-refs file must be rewritten for every update).
>> >
>> > Now, observe that we don't have these race conditions in the object
>> > database, because it is an add-only immutable data store.
>> >
>> > What if we stored the refs as a tree object in the object database,
>> > referenced by a single (loose) ref?
>>
>> What is the cost of updating a single branch with that scheme?
>
> Yeah, I had the same thoughts as you here; I don't think this is really
> much better than updating a single packed-refs file. You may only
> have to touch log(N) of the entries to update the trees along the path
> to your ref (assuming you have a bushy refs/ tree; in practice, of
> course, you have a lot of entries in refs/heads/, so you do not quite
> get log behavior). So algorithmically it may be slightly better, but you
> pay a much higher constant, in that you are creating many tree objects
> along the path (and reading them on lookup).

True.

> But more importantly, it introduces contention between two unrelated
> refs that are being updated. Even if we reconcile the differences
> automatically (e.g., with a "merge-and-retry" strategy), that is going
> to be a serious performance regression for a busy repository, as we
> repeatedly try to reconcile the serialized updates to the refs/ root
> tree.
>
> Any transactional solution needs to have the equivalent of ref-level
> locking (i.e., row-level locking in a relational setting). It is OK for
> two simultaneous updates to different rows to happen in random order; we
> don't care about that level of atomicity. The important thing is that
> the _readers_ see transactions in consistent order; if we know that
> update B started after update A finished, then we must never see update
> A without update B reflected. And I think that is more or less a solved
> problem in the database world.
>
> That is what prevents:
>
>   git update-ref A B
>   git update-ref -d B
>
> from creating a view where the reader sees neither A nor B (which is
> what can happen with the current filesystem view).

All true. Just a last spasm from the dying notion that we could solve
this in the filesystem, without using "proper" database...

...Johan

-- 
Johan Herland, 
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Johan Herland
On Mon, May 20, 2013 at 6:37 PM, Junio C Hamano  wrote:
> Johan Herland  writes:
>> For server-class installations we need ref storage that can be read
>> (and updated?) atomically, and the current system of loose + packed
>> files won't work since reading (and updating) more than a single file
>> is not an atomic operation. Trivially, one could resolve this by
>> dropping loose refs, and always using a single packed-refs file, but
>> that would make it prohibitively expensive to update refs (the entire
>> packed-refs file must be rewritten for every update).
>>
>> Now, observe that we don't have these race conditions in the object
>> database, because it is an add-only immutable data store.
>>
>> What if we stored the refs as a tree object in the object database,
>> referenced by a single (loose) ref?
>
> What is the cost of updating a single branch with that scheme?
>
> Doesn't it end up recording roughly the same amount of information
> as updating a single packed-refs file that is flat, but with the
> need to open a few tree objects (top-level, refs/, and refs/heads/),
> writing out a blob that stores the object name at the tip, computing
> the updated trees (refs/heads/, refs/ and top-level), and then
> finally doing the compare-and-swap of that single loose ref?

Yes, except that when you update packed-refs, you have to write out
the _entire_ file, whereas with this scheme you only have to write out
the part of the refs hierarchy you actually touched (e.g. rewriting
refs/heads/foo would not have to write out anything inside
refs/tags/*). If you have small number of branches, and a huge number
of tags, this scheme might end up being cheaper than writing the
entire packed-refs. But in general, it's probably much more expensive
to go via the odb.

> You may guarantee atomicity but it is the same granularity of
> atomicity as a single packed-refs file.

Yes, as I argued elsewhere in this thread: It seems that _any_
filesystem-based solution must resort to having all updates depend on
a single file in order to guarantee atomicity.

> When you are updating a
> branch while somebody else is updating a tag, of course you do not
> have to look at refs/tags/ in your operation and you can write out
> your final packed-refs equivalent tree to the object database
> without racing with the other process, but the top-level you come up
> with and the top-level the other process comes up with (which has
> an up-to-date refs/tags/ part, but has a stale refs/heads/ part from
> your point of view) have to race to update that single loose ref,
> and one of you have to back out.

True.

> That "backing out" can be made more intelligently than just dying
> with "compare and swap failed--please retry" message, e.g. you at
> that point notice what you started with, what the other party did
> while you were working on (i.e. updating refs/tags/), and three-way
> merge the refs tree, and in cases where "all refs recorded as loose
> refs" scheme wouldn't have resulted in problematic conflict, such a
> three-way merge would resolve trivially (you updated refs/heads/ and
> the update by the other process to refs/tags/ would not conflict
> with what you did).  But the same three-way merge scheme can be
> employed with the current flat single packed-refs scheme, can't it?

Yes. (albeit without reusing the machinery we already have for doing
three-way merges)

> Even worse, what is the cost of looking up the value of a single
> branch?  You would need to open a few tree objects and the leaf blob
> that records the object name the ref points at, wouldn't you?

Yes. Probably a showstopper, although single branch/ref lookup might
not be so common on server side, as it is on the user side.

> Right now, such a look-up is either opening a single small file and
> reading the first 41 bytes off of it, and falling back (when the ref
> in question is packed) to read a single packed-refs file and finding
> the ref you want from it.
>
> So...

Yep...


...Johan

-- 
Johan Herland, 
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Jeff King
On Mon, May 20, 2013 at 09:37:30AM -0700, Junio C Hamano wrote:

> Johan Herland  writes:
> 
> > For server-class installations we need ref storage that can be read
> > (and updated?) atomically, and the current system of loose + packed
> > files won't work since reading (and updating) more than a single file
> > is not an atomic operation. Trivially, one could resolve this by
> > dropping loose refs, and always using a single packed-refs file, but
> > that would make it prohibitively expensive to update refs (the entire
> > packed-refs file must be rewritten for every update).
> >
> > Now, observe that we don't have these race conditions in the object
> > database, because it is an add-only immutable data store.
> >
> > What if we stored the refs as a tree object in the object database,
> > referenced by a single (loose) ref?
> 
> What is the cost of updating a single branch with that scheme?

Yeah, I had the same thoughts as you here; I don't think this is really
much better than updating a single packed-refs file. You may only
have to touch log(N) of the entries to update the trees along the path
to your ref (assuming you have a bushy refs/ tree; in practice, of
course, you have a lot of entries in refs/heads/, so you do not quite
get log behavior). So algorithmically it may be slightly better, but you
pay a much higher constant, in that you are creating many tree objects
along the path (and reading them on lookup).

But more importantly, it introduces contention between two unrelated
refs that are being updated. Even if we reconcile the differences
automatically (e.g., with a "merge-and-retry" strategy), that is going
to be a serious performance regression for a busy repository, as we
repeatedly try to reconcile the serialized updates to the refs/ root
tree.

Any transactional solution needs to have the equivalent of ref-level
locking (i.e., row-level locking in a relational setting). It is OK for
two simultaneous updates to different rows to happen in random order; we
don't care about that level of atomicity. The important thing is that
the _readers_ see transactions in consistent order; if we know that
update B started after update A finished, then we must never see update
A without update B reflected. And I think that is more or less a solved
problem in the database world.

That is what prevents:

  git update-ref A B
  git update-ref -d B

from creating a view where the reader sees neither A nor B (which is
what can happen with the current filesystem view).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Junio C Hamano
Johan Herland  writes:

> For server-class installations we need ref storage that can be read
> (and updated?) atomically, and the current system of loose + packed
> files won't work since reading (and updating) more than a single file
> is not an atomic operation. Trivially, one could resolve this by
> dropping loose refs, and always using a single packed-refs file, but
> that would make it prohibitively expensive to update refs (the entire
> packed-refs file must be rewritten for every update).
>
> Now, observe that we don't have these race conditions in the object
> database, because it is an add-only immutable data store.
>
> What if we stored the refs as a tree object in the object database,
> referenced by a single (loose) ref?

What is the cost of updating a single branch with that scheme?

Doesn't it end up recording roughly the same amount of information
as updating a single packed-refs file that is flat, but with the
need to open a few tree objects (top-level, refs/, and refs/heads/),
writing out a blob that stores the object name at the tip, computing
the updated trees (refs/heads/, refs/ and top-level), and then
finally doing the compare-and-swap of that single loose ref?

You may guarantee atomicity but it is the same granularity of
atomicity as a single packed-refs file.  When you are updating a
branch while somebody else is updating a tag, of course you do not
have to look at refs/tags/ in your operation and you can write out
your final packed-refs equivalent tree to the object database
without racing with the other process, but the top-level you come up
with and the top-level the other process comes up with (which has
an up-to-date refs/tags/ part, but has a stale refs/heads/ part from
your point of view) have to race to update that single loose ref,
and one of you have to back out.

That "backing out" can be made more intelligently than just dying
with "compare and swap failed--please retry" message, e.g. you at
that point notice what you started with, what the other party did
while you were working on (i.e. updating refs/tags/), and three-way
merge the refs tree, and in cases where "all refs recorded as loose
refs" scheme wouldn't have resulted in problematic conflict, such a
three-way merge would resolve trivially (you updated refs/heads/ and
the update by the other process to refs/tags/ would not conflict
with what you did).  But the same three-way merge scheme can be
employed with the current flat single packed-refs scheme, can't it?

Even worse, what is the cost of looking up the value of a single
branch?  You would need to open a few tree objects and the leaf blob
that records the object name the ref points at, wouldn't you?

Right now, such a look-up is either opening a single small file and
reading the first 41 bytes off of it, and falling back (when the ref
in question is packed) to read a single packed-refs file and finding
the ref you want from it.

So...
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Storing refs in the odb (was: Re: [PATCH 00/17] Remove assumptions about refname lifetimes)

2013-05-20 Thread Johan Herland
On Mon, May 20, 2013 at 2:15 PM, Michael Haggerty  wrote:
> This is a very interesting idea.  "It's turtles all the way down."

:)

> On 05/20/2013 12:28 PM, Johan Herland wrote:
>> For server-class installations we need ref storage that can be read
>> (and updated?) atomically, and the current system of loose + packed
>> files won't work since reading (and updating) more than a single file
>> is not an atomic operation. Trivially, one could resolve this by
>> dropping loose refs, and always using a single packed-refs file, but
>> that would make it prohibitively expensive to update refs (the entire
>> packed-refs file must be rewritten for every update).
>
> Correct, or the "packed-refs" file would have to be updated in place
> using some database-style approach for locking/transactions/whatever.
>
>> Now, observe that we don't have these race conditions in the object
>> database, because it is an add-only immutable data store.
>
> Except for prune, of course, which can cause race conditions WRT to writers.

Yes, but that is a different race, in need of a different solution.
E.g. that race is concerned with pruning unreachable objects that are
about to become reachable by a concurrent operation, which is AFAICS
independent from the ref update race that we're discussing here.

>> What if we stored the refs as a tree object in the object database,
>> referenced by a single (loose) ref? There would be a _single_ (albeit
>> highly contentious) file outside the object database that represent
>> the current state of the refs, but hopefully we can guarantee
>> atomicity when reading (and updating?) that one file. Transactions can
>> be done by:
>>  1. Recording the tree id holding the refs before starting manipulation.
>>  2. Creating a new tree object holding the manipulated state.
>>  3. Re-checking the tree id before replacing the loose ref. If
>> unchanged: commit, else: rollback/error out.
>
> There are two closely related possibilities and I'm not sure which one
> you mean:
>
> * Effectively treat all of the refs as loose refs, but stored not in the
> filesystem but rather in a hierarchical tree structure in the object
> database.  E.g., all of the refs directly under "refs/heads" would be in
> one tree object, those in refs/remotes/foo in a second, those for
> refs/remotes/bar in another etc. and all of them linked up together in a
> tree object representing "refs".
>
> * Effectively treat all of the refs as packed refs, but store the single
> "packed-refs" file as a single object in the object database.
>
> (The first alternative sounds more practical to me.  I also guess that's
> what you mean, since down below you say that each change would require
> producing "a few objects".)

The first alternative is what I had in mind.

Initially I thought to record it as if one were to record a new tree
using .git/refs as the root of your worktree (having exploded all
packed-refs into loose refs). I.e. you would have "heads", "tags",
"remotes" as subtrees of "reference tree", and then e.g. in the
"heads" subtree, there would be an entry named "master" pointing to a
_blob_, and the contents of that blob would be the commit id of the
current tip of the master branch.

Obviously the next optimization would be to drop the "master" -> blob
-> commit indirection, and use "master" -> commit instead, i.e. the
"master" tree entry corresponds directly to the commit to which it
points (symrefs would naturally be recorded as symlinks). This would
automatically provide reachability for all refs, but as you correctly
observe:

> Of course in either case we couldn't use a tree object directly, because
> these new "reference tree" objects would refer not only to blobs and
> other trees but also to commits and tags.

Indeed. I don't know if the best solution would be to actually _allow_
that (which would complicate the object parsing code somewhat; a tree
entry pointing to a commit is usually interpreted as a submodule, but
that is not what we'd want for the ref tree, and a tree entry pointing
at a tag has AFAIK not yet been done), or whether it means we need to
come up with a different kind of structure.

> [I know this is not what you are suggesting, but I am reminded of
> Subversion, which stores trunk, branches, and tags in the same "tree"
> space as the contents of the working trees.  A Subversion commit
> references a gigantic tree encompassing all branches of development and
> all files on all of those branches (with cheap copies to reduce the
> redundancy):
>
> /
> /trunk/
> /trunk/Makefile
> /trunk/src/
> /trunk/src/foo.c
> /branches/
> /branches/next/
> /branches/next/Makefile
> /branches/next/src/
> /branches/next/src/foo.c
> /branches/pu/
> /branches/pu/Makefile
> /branches/pu/src/
> /branches/pu/src/foo.c
> /tags/
> /tags/v1.8.2/
> /tags/v1.8.2/Makefile
> /tags/v1.8.2/src/
> /tags/v1.8.2/src/foo.c
> etc...
>
> A Subversion commit thus desc

Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Michael Haggerty
This is a very interesting idea.  "It's turtles all the way down."

On 05/20/2013 12:28 PM, Johan Herland wrote:
> (Sorry for going slightly off-topic and returning to the general
> discussion on how to resolve the race conditions...)
> 
> For server-class installations we need ref storage that can be read
> (and updated?) atomically, and the current system of loose + packed
> files won't work since reading (and updating) more than a single file
> is not an atomic operation. Trivially, one could resolve this by
> dropping loose refs, and always using a single packed-refs file, but
> that would make it prohibitively expensive to update refs (the entire
> packed-refs file must be rewritten for every update).

Correct, or the "packed-refs" file would have to be updated in place
using some database-style approach for locking/transactions/whatever.

> Now, observe that we don't have these race conditions in the object
> database, because it is an add-only immutable data store.

Except for prune, of course, which can cause race conditions WRT to writers.

> What if we stored the refs as a tree object in the object database,
> referenced by a single (loose) ref? There would be a _single_ (albeit
> highly contentious) file outside the object database that represent
> the current state of the refs, but hopefully we can guarantee
> atomicity when reading (and updating?) that one file. Transactions can
> be done by:
>  1. Recording the tree id holding the refs before starting manipulation.
>  2. Creating a new tree object holding the manipulated state.
>  3. Re-checking the tree id before replacing the loose ref. If
> unchanged: commit, else: rollback/error out.

There are two closely related possibilities and I'm not sure which one
you mean:

* Effectively treat all of the refs as loose refs, but stored not in the
filesystem but rather in a hierarchical tree structure in the object
database.  E.g., all of the refs directly under "refs/heads" would be in
one tree object, those in refs/remotes/foo in a second, those for
refs/remotes/bar in another etc. and all of them linked up together in a
tree object representing "refs".

* Effectively treat all of the refs as packed refs, but store the single
"packed-refs" file as a single object in the object database.

(The first alternative sounds more practical to me.  I also guess that's
what you mean, since down below you say that each change would require
producing "a few objects".)

Of course in either case we couldn't use a tree object directly, because
these new "reference tree" objects would refer not only to blobs and
other trees but also to commits and tags.

> All readers would trivially have access to a consistent refs view,
> since the state of the entire refs hierarchy is held in the tree id
> read from that single loose ref.
> 
> It seems to me this should be somewhat less prohibitively expensive
> than maintaining all refs in a single packed-refs file. That said, we
> do end up producing a few new objects for every single ref update,
> most of which would be thrown away by a future "gc". This might bog
> things down, but I'm not sure how much.
> 
> I'm sure someone must have had this idea before (although I don't
> remember this alternative being raised at the Git Merge conference),
> so please enlighten me as to why this won't work... ;)

[I know this is not what you are suggesting, but I am reminded of
Subversion, which stores trunk, branches, and tags in the same "tree"
space as the contents of the working trees.  A Subversion commit
references a gigantic tree encompassing all branches of development and
all files on all of those branches (with cheap copies to reduce the
redundancy):

/
/trunk/
/trunk/Makefile
/trunk/src/
/trunk/src/foo.c
/branches/
/branches/next/
/branches/next/Makefile
/branches/next/src/
/branches/next/src/foo.c
/branches/pu/
/branches/pu/Makefile
/branches/pu/src/
/branches/pu/src/foo.c
/tags/
/tags/v1.8.2/
/tags/v1.8.2/Makefile
/tags/v1.8.2/src/
/tags/v1.8.2/src/foo.c
etc...

A Subversion commit thus describes the state of *every* branch and tag
at that moment in time.  The model is conceptually very simple (in fact,
too simple, and I believe the Subversion developers regret not having
distinguished between the branch namespace and the file namespace).]

The main difficulty with this idea will be the extreme contention on
that "last loose reference file" pointing at the root of the reference
tree.  Essentially *every* change to the repository will have to create
a new reference tree and point this file at the new version.  I doubt
that would be a problem for short-lived operations, but I fear that a
long-lived operation would *never* get done.  By the time it had
finished constructing its new reference tree, some other short-lived
operation will have changed it, and the long-lived process will have to
choose between

* Restart from the beginning.

* Die with

Re: [PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-20 Thread Johan Herland
On Sun, May 19, 2013 at 10:26 PM, Michael Haggerty  wrote:
> Recently a number of race conditions related to references have been
> discovered.  There is likely to be a two-pronged solution to the
> races:
>
> * For traditional, filesystem-based references, there will have to be
>   more checks that the ref caches are still up-to-date at the time of
>   their use (see, for example, [1]).  If not, the ref cache will have
>   to be invalidated and reloaded.  Assuming that the invalidation of
>   the old cache includes freeing its memory, such an invalidation will
>   cause lots of refname strings to be freed even though callers might
>   still hold references to them.
>
> * For server-class installations, filesystem-based references might
>   not be robust enough for 100% reliable operation, because the
>   reading of the complete set of references is not an atomic
>   operation.  If another reference storage mechanism is developed,
>   there is no reason to expect the refnames strings to have long
>   lifetimes.

(Sorry for going slightly off-topic and returning to the general
discussion on how to resolve the race conditions...)

For server-class installations we need ref storage that can be read
(and updated?) atomically, and the current system of loose + packed
files won't work since reading (and updating) more than a single file
is not an atomic operation. Trivially, one could resolve this by
dropping loose refs, and always using a single packed-refs file, but
that would make it prohibitively expensive to update refs (the entire
packed-refs file must be rewritten for every update).

Now, observe that we don't have these race conditions in the object
database, because it is an add-only immutable data store.

What if we stored the refs as a tree object in the object database,
referenced by a single (loose) ref? There would be a _single_ (albeit
highly contentious) file outside the object database that represent
the current state of the refs, but hopefully we can guarantee
atomicity when reading (and updating?) that one file. Transactions can
be done by:
 1. Recording the tree id holding the refs before starting manipulation.
 2. Creating a new tree object holding the manipulated state.
 3. Re-checking the tree id before replacing the loose ref. If
unchanged: commit, else: rollback/error out.

All readers would trivially have access to a consistent refs view,
since the state of the entire refs hierarchy is held in the tree id
read from that single loose ref.

It seems to me this should be somewhat less prohibitively expensive
than maintaining all refs in a single packed-refs file. That said, we
do end up producing a few new objects for every single ref update,
most of which would be thrown away by a future "gc". This might bog
things down, but I'm not sure how much.

I'm sure someone must have had this idea before (although I don't
remember this alternative being raised at the Git Merge conference),
so please enlighten me as to why this won't work... ;)

...Johan


PS: Keeping reflogs is just a matter of wrapping the ref tree in a
commit object using the previous state of the ref tree as its parent.

-- 
Johan Herland, 
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/17] Remove assumptions about refname lifetimes

2013-05-19 Thread Michael Haggerty
Prior to this patch series, the refs API said nothing about the
lifetime of the refname parameter passed to each_ref_fn callbacks by
the for_each_ref()-style iteration functions.  De facto, the refnames
usually had long lives because they were pointers into the ref_cache
data structures, and those are only invalidated under rare
circumstances.  And some callers were assuming a long lifetime, for
example storing references to the refname string instead of copying
it.

But it has long been the case that ref caches could be invalidated,
for example when a packed ref is deleted.  AFAIK there was never much
clarity about what that might mean for callers.

Recently a number of race conditions related to references have been
discovered.  There is likely to be a two-pronged solution to the
races:

* For traditional, filesystem-based references, there will have to be
  more checks that the ref caches are still up-to-date at the time of
  their use (see, for example, [1]).  If not, the ref cache will have
  to be invalidated and reloaded.  Assuming that the invalidation of
  the old cache includes freeing its memory, such an invalidation will
  cause lots of refname strings to be freed even though callers might
  still hold references to them.

* For server-class installations, filesystem-based references might
  not be robust enough for 100% reliable operation, because the
  reading of the complete set of references is not an atomic
  operation.  If another reference storage mechanism is developed,
  there is no reason to expect the refnames strings to have long
  lifetimes.

A prerequisite for either of these approaches is to harmonize what
callers assume and what the API guarantees.

The purpose of this patch series is to track down callers who assume
that the refnames that they receive via a each_ref_fn callback have
lifetimes beyond the duration of the callback invocation and to
rewrite them to work without that assumption.  The final patch
documents explicitly that callers should not retain references to the
refnames.

A word about how I audited the code:

To find callers making unwarranted assumptions, I (temporarily)
changed do_one_ref() to do a xstrdup() of the refname, pass the copy
to the callback function, then free() the copy.  This caused
ill-behaved callers to access freed memory, which could be detected by
running the testsuite under valgrind.  There were indeed a number of
such errors.  All of them are fixed by this patch series, and the test
just described now runs without errors.

I plan to do a second audit by hand to see if the test suite and/or
valgrind missed anything.

The last two patches are RFCs.  I would like some input on the second
to last because I am not very familiar with how the object array entry
names are used, how many might be created, etc.  The last patch is an
illustration of how I would like to change the API docs, but it will
only be ready when all of the code has been audited and adapted.
Please see especially my comments on these two patches for more
information.

[1] http://thread.gmane.org/gmane.comp.version-control.git/223299

Michael Haggerty (17):
  describe: make own copy of refname
  fetch: make own copies of refnames
  add_rev_cmdline(): make a copy of the name argument
  builtin_diff_tree(): make it obvious that function wants two entries
  cmd_diff(): use an object_array for holding trees
  cmd_diff(): rename local variable "list" -> "entry"
  cmd_diff(): make it obvious which cases are exclusive of each other
  revision: split some overly-long lines
  gc_boundary(): move the check "alloc <= nr" to caller
  get_revision_internal(): make check less mysterious
  object_array: add function object_array_filter()
  object_array_remove_duplicates(): rewrite to reduce copying
  fsck: don't put a void*-shaped peg in a char*-shaped hole
  find_first_merges(): initialize merges variable using initializer
  find_first_merges(): remove unnecessary code
  object_array_entry: copy name before storing in name field
  refs: document the lifetime of the refname passed to each_ref_fn

 builtin/describe.c |  6 +++--
 builtin/diff.c | 68 ++
 builtin/fetch.c|  4 ++--
 builtin/fsck.c |  2 +-
 object.c   | 50 +++
 object.h   | 23 --
 refs.h | 22 +-
 revision.c | 61 +---
 revision.h | 32 -
 submodule.c|  6 ++---
 10 files changed, 172 insertions(+), 102 deletions(-)

-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html