On Tue, Aug 09, 2016 at 10:04:11AM -0400, Jeff King wrote:
> Here's the code to do the cycle-breaking.
> It seems to perform well, and it does break the cycles (the later check
> during the write does not kick in, and we get no warnings). I didn't dig
> into the fates of specific objects, but any cycles should be added to
> the delta-compression phase.
Curious about that last part, I did the repack both with and without
this patch. We should be able to see the difference in the progress
meter from the "Compressing" phase. And indeed, it goes from 3030778
objects to 3031056, an increase of 278 objects. Which just so happens to
be exactly the number of "warning: recursive delta detected" lines that
are printed without the patch.
Not surprising, but a nice confirmation that everything is working as
expected. But while looking at this, I managed to unravel what's been
going on this whole time.
I ran the repack again with stock git (no MRU patch), and the number of
objects in the delta phase jumped to 3087880, around 56,000 more than
with the MRU patch. So that's probably where the magic "3%" is coming
from. By looking at the smaller packs first, we are more likely to find
copies of objects which _aren't_ deltas, and then consider them for
delta compression. And that compression tends to find a better match
than reusing what was in the big pack, and we end up with a slightly
smaller output pack.
So why are the deltas we find so much better than what is in the big
There's something rather subtle going on, and it has to do with the fact
that the existing pack was _not_ made by a stock version of git.
I may have mentioned before that I have some patches which restrict the
delta selection process. The goal is that if you have several "islands"
of refs (in my case, refs corresponding to particular forks of a
repository), it's desirable that you don't make a delta against a base
that is not reachable from all of the same islands. Otherwise, when
somebody fetches or clones the fork, you have to throw away the delta
and find one from scratch. If this happens a lot, it gets expensive.
Unsurprisingly, the packfiles produced with the island-restrictions
turned on are slightly worse than normal packfiles (because we reject
some delta possibilities that might be better).
When I did my testing, I did not enable the delta islands, because I
didn't want them to influence the results. But of course their effect
could still be seen in the deltas in the existing packfile, because it
came off our production servers (because I wanted a real-world test
case). So any pack ordering which ends up reusing _fewer_ of those
deltas is going to be a net win in size, because it's going to find new
deltas without being hampered by the island restrictions. And that's
exactly what happened in my tests; the stock pack ordering finds more
objects in the newer packs, has fewer deltas to reuse, and is able to
find better unrestricted deltas.
Unfortunately I don't have a real-world test case that's been packed up
until now with stock git. So the next-best thing I can do is re-run my
tests with the island restrictions turned on, so that newly-found deltas
do not have an unfair advantage. And here are the sizes for each case
after doing so:
The MRU case _is_ slightly bigger, but only by 0.04%. And then adding
the cycle-breaking on top of that reclaims a little bit of space,
because rather than throwing out those 287 deltas at the write phase,
it's able to add those objects back to the delta search and find new
deltas for them.
So there. Mystery solved, and I feel confident in saying that the MRU
patches _don't_ represent a real regression in pack size. We do need to
deal with the cycles, of course, but it's less an issue of size and more
one of correctness (well, the pack is correct either way, but still, it
seems more correct to realize during the correct phase that we cannot
reuse those deltas).
I also suspect that one of the tricks I tried, simply reversing the pack
order (so the big pack is at the front, but the order is stable) would
work very well for this case. But as I mentioned before, I prefer the
MRU strategy because it's less susceptible to pathological cases.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html