On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:
> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days. However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.
I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).
I posted two alternative patches in May of 2011. The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.
We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:
1. There had been no real-world reports of it.
2. While in theory a priority queue implementation would be used in
other spots, too, it ended up being a pain to use it, as most of
the callers wanted list-like splicing.
You can see the original here:
http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005
Though it probably doesn't apply cleanly anymore. However, I've kept it
rebased over the years at:
git://github.com/peff/git.git jk/fast-commit-list
Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.
And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const
unsigned char *sha1, int fla
struct commit *commit = (struct commit *)o;
if (!(commit->object.flags & COMPLETE)) {
commit->object.flags |= COMPLETE;
- commit_list_insert_by_date(commit, &complete);
+ commit_list_insert(commit, &complete);
}
}
return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
if (!args->depth) {
for_each_ref(mark_complete, NULL);
for_each_alternate_ref(mark_alternate_complete, NULL);
+ commit_list_sort_by_date(&complete);
if (cutoff)
mark_recent_complete_commits(args, cutoff);
}
(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).
> The result, a copy of linus' repo with a million unique
> valid refs and a git fetch of a single updated ref taking a
> very long time (55mins and it did not complete yet). Note,
> with 100K refs it completes in about 2m40s. It is likely
> not linear since 2m40s * 10 would be ~26m (but the
> difference could also just be how the data in the sha1s are
> ordered).
That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?
-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html