On 10/03/2014 10:39 PM, Jeff King wrote:
> [...]
> Instead, this patch pushes the extra work onto prune, which
> runs less frequently (and has to look at the whole object
> graph anyway). It creates a new category of objects: objects
> which are not recent, but which are reachable from a recent
> object. We do not prune these objects, just like the
> reachable and recent ones.
>
> This lets us avoid the recursive check above, because if we
> have an object, even if it is unreachable, we should have
> its referent:
>
> - if we are creating new objects, then we cannot create
> the parent object without having the child
>
> - and if we are pruning objects, will not prune the child
> if we are keeping the parent
>
> The big exception would be if one were to write the object
> in a way that avoided referential integrity (e.g., using
> hash-object). But if you are in the habit of doing that, you
> deserve what you get.
>
> Naively, the simplest way to implement this would be to add
> all recent objects as tips to the reachability traversal.
> However, this does not perform well. In a recently-packed
> repository, all reachable objects will also be recent, and
> therefore we have to consider each object twice (both as a
> tip, and when we reach it in the traversal). I tested this,
> and it added about 10s to a 30s prune on linux.git. This
> patch instead performs the normal reachability traversal
> first, then follows up with a second traversal for recent
> objects, skipping any that have already been marked.
I haven't read all of the old code, but if I understand correctly this
is your new algorithm:
1. Walk from all references etc., marking reachable objects.
2. Iterate over *all* objects, in no particular order, skipping the
objects that are already known to be reachable. Use any unreachable
object that has a recent mtime as a tip for a second traversal that
marks all of its references as "to-keep".
3. Iterate over any objects that are not marked "to-keep". (I assume
that this iteration is in no particular order.) For each object:
* [Presumably] verify that its mtime is still "old"
* If so, prune the object
I see some problems with this.
* The one that you mentioned in your cover letter, namely that prune's
final mtime check is not atomic with the object deletion. I agree
that this race is so much shorter than the others that we can accept
a solution that doesn't eliminate it, so let's forget about this one.
* If the final mtime check fails, then the object is recognized as new
and not pruned. But does that prevent its referents from being pruned?
* When this situation is encountered, you would have to start another
object traversal starting at the "renewed" object to mark its
referents "to-keep". I don't see that you do this. Another, much
less attractive alternative would be to abort the prune operation
if this situation arises. But even if you do one of these...
* ...assuming the iteration in step 3 is in no defined order, a
referent might *already* have been pruned before you notice the
"renewed" object.
So although your changes are a big improvement, it seems to me that they
still leave a race with a window approximately as long as the time it
takes to scan and prune the unreachable objects.
I think that the only way to get rid of that race is to delete objects
in parent-to-child order; in other words, *only prune an object after
all objects that refer to it have been pruned*. This could be done by
maintaining reference counts of the to-be-pruned objects and only
deleting an object once its reference count is zero.
The next point that I'm confused by is what happens when a new object or
reference is created while prune is running, and the new object or
reference refers to old objects. I think when we discussed this
privately I claimed that the following freshenings would not be
necessary, but now I think that they are (sorry about that!).
Let's take the simpler case first. Suppose I run the following command
between steps 1 and 3:
git update-ref refs/heads/newbranch $COMMIT
, where $COMMIT is a previously-unreachable object. This doesn't affect
the mtime of $COMMIT, does it? So how does prune know that it shouldn't
delete $COMMIT?
-> So ISTM that updating a reference (or any other traversal starting
point, like the index) must freshen the mtime of any object newly
referred to.
A more complicated case: suppose I create a new $COMMIT referring to an
old $TREE during step 2, *after* prune has scanned the directory that
now contains $COMMIT. (I.e., the scan in step 2 never notices $COMMIT.)
Then I create a new reference pointing at $COMMIT. (I.e., the scan in
step 1 never noticed that the reference exists.) None of this affects
the mtime of $TREE, does it? So how does prune know that it mustn't
prune $TREE?
-> It seems to me that the creation of $COMMIT has to freshen the mtime
of $TREE, so that the final mtime check in step 3 realizes that it
shouldn't prune $TREE. Or to generalize, whenever a new object is
created which refers to existing objects, the direct referents of the
new object have to have their mtimes freshened. However, when an attempt
to write a new object accidentally coincides with an object that already
exists, *that* object needs to be freshened but *its* referents do *not*.
I hope I understood that all correctly...
Michael
--
Michael Haggerty
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html