On Tue, Jun 25, 2013 at 09:33:11PM +0200, Vicent Martí wrote: > > One way we side-stepped the size inflation problem in JGit was to only > > use the bitmap index information when sending data on the wire to a > > client. Here delta reuse plays a significant factor in building the > > pack, and we don't have to be as accurate on matching deltas. During > > the equivalent of `git repack` bitmaps are not used, allowing the > > traditional graph enumeration algorithm to generate path hash > > information. > > OH BOY HERE WE GO. This is worth its own thread, lots to discuss here. > I think peff will have a patchset regarding this to upstream soon, > we'll get back to it later.
We do the same thing (only use bitmaps during on-the-wire fetches). But there a few problems with assuming delta reuse. For us (GitHub), the foremost one is that we pack many "forks" of a repository together into a single packfile. That means when you clone torvalds/linux, an object you want may be stored in the on-disk pack with a delta against an object that you are not going to get. So we have to throw out that delta and find a new one. I'm dealing with that by adding an option to respect "islands" during packing, where an island is a set of common objects (we split it by fork, since we expect those objects to be fetched together, but you could use other criteria). The rule is that an object cannot delta against another object that is not in all of its islands. So everybody can delta against shared history, but objects in your fork can only delta against other objects in the fork. You are guaranteed to be able to reuse such deltas during a full clone of a fork, and the on-disk pack size does not suffer all that much (because there is usually a good alternate delta base within your reachable history). So with that series, we can get good reuse for clones. But there are still two cases worth considering: 1. When you fetch a subset of the commits, git marks only the edges as preferred bases, and does not walk the full object graph down to the roots. So any object you want that is delta'd against something older will not get reused. If you have reachability bitmaps, I don't think there is any reason that we cannot use the entire object graph (starting at the "have" tips, of course) as preferred bases. 2. The server is not necessarily fully packed. In an active repo, you may have a large "base" pack with bitmaps, with several recently pushed packs on top. You still need to delta the recently pushed objects against the base objects. I don't have measurements on how much the deltas suffer in those two cases. I know they suffered quite badly for clones without the name hashes in our alternates repos, but that part should go away with my patch series. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html