On Tue, Jun 25, 2013 at 09:33:11PM +0200, Vicent Martí wrote:
> > One way we side-stepped the size inflation problem in JGit was to only
> > use the bitmap index information when sending data on the wire to a
> > client. Here delta reuse plays a significant factor in building the
> > pack, and we don't have to be as accurate on matching deltas. During
> > the equivalent of `git repack` bitmaps are not used, allowing the
> > traditional graph enumeration algorithm to generate path hash
> > information.
> OH BOY HERE WE GO. This is worth its own thread, lots to discuss here.
> I think peff will have a patchset regarding this to upstream soon,
> we'll get back to it later.
We do the same thing (only use bitmaps during on-the-wire fetches). But
there a few problems with assuming delta reuse.
For us (GitHub), the foremost one is that we pack many "forks" of a
repository together into a single packfile. That means when you clone
torvalds/linux, an object you want may be stored in the on-disk pack
with a delta against an object that you are not going to get. So we have
to throw out that delta and find a new one.
I'm dealing with that by adding an option to respect "islands" during
packing, where an island is a set of common objects (we split it by
fork, since we expect those objects to be fetched together, but you
could use other criteria). The rule is that an object cannot delta
against another object that is not in all of its islands. So everybody
can delta against shared history, but objects in your fork can only
delta against other objects in the fork. You are guaranteed to be able
to reuse such deltas during a full clone of a fork, and the on-disk pack
size does not suffer all that much (because there is usually a good
alternate delta base within your reachable history).
So with that series, we can get good reuse for clones. But there are
still two cases worth considering:
1. When you fetch a subset of the commits, git marks only the edges as
preferred bases, and does not walk the full object graph down to
the roots. So any object you want that is delta'd against something
older will not get reused. If you have reachability bitmaps, I
don't think there is any reason that we cannot use the entire
object graph (starting at the "have" tips, of course) as preferred
2. The server is not necessarily fully packed. In an active repo, you
may have a large "base" pack with bitmaps, with several recently
pushed packs on top. You still need to delta the recently pushed
objects against the base objects.
I don't have measurements on how much the deltas suffer in those two
cases. I know they suffered quite badly for clones without the name
hashes in our alternates repos, but that part should go away with my
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html