Re: [RFC] Design for http-pull on repo with packs

Daniel Barkalow Sun, 10 Jul 2005 13:34:44 -0700

On Sun, 10 Jul 2005, Dan Holmsand wrote:

> Daniel Barkalow wrote:
> > I have a design for using http-pull on a packed repository, and it only
> > requires one extra file in the repository: an append-only list of the pack
> > files (because getting the directory listing is very painful and
> > failure-prone).
> 
> A few comments (as I've been tinkering with a way to solve the problem 
> myself).
> 
> As long as the pack files are named sensibly (i.e. if they are created 
> by git-repack-script), it's not very error-prone to just get the 
> directory listing, and look for matches for pack-<sha1>.idx. It seems to 
> work quite well (see below). It isn't beautiful in any way, but it works...


I may grab your code for that; the version I just sent seems to be working
except for that.

> >  If an individual file is not available, figure out what packs are
> >   available:
> > 
> >    Get the list of pack files the repository has
> >     (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
> >    For any packs we don't have, get the index files.
> 
> This part might be slightly expensive, for large repositories. If one 
> assumes that packs are named as by git-repack-script, however, one might 
> cache indexes we've already seen (again, see below). Or, if you go for 
> the mandatory "pack-index-file", require that it has a reliable order, 
> so that you can get the last added index first.

Nothing bad happens if you have index files for pack files you don't have,
as it turns out; the library ignores them. So we can keep the index files
around so we can quickly check if they have the objects we want. That way,
we don't have to worry about skipping something now (because it's not
needed) and then ignoring it when the branch gets merged in.

So what I actually do is make a list of the pack files that aren't already
downloaded that are available from the server, and download the index
files for any where the index file isn't downloaded, either.

> >    Keep a list of the struct packed_gits for the packs the server has
> >     (these are not used as places to look for objects)
> > 
> >  Each time we need an object, check the list for it. If it is in there,
> >   download the corresponding pack and report success.
> 
> Here you will need some strategy to deal with packs that overlap with 
> what we've already got. Basically, small and overlapping packs should be 
> unpacked, big and non-overlapping ones saved as is (since 
> git-unpack-objects is painfully slow and memory-hungry...).

I don't think there's an issue to having overlapping packs, either with
each other or with separate objects. If the user wants, stuff can be
repacked outside of the pull operation (note, though, that the index files
should be truncated rather than removed, so that the program doesn't fetch
them again next time some object can't be found easily).

> One could also optimize the pack-download bit, by figuring out the last 
> object in the pack that we need (easy enough to do from the index file), 
>   and just get the part of the pack file leading up to that object. That 
> could be a huge win for independently packed repositories (I don't do 
> that in my code below, though).

That's only possible if you can figure out what you want to have before
you get it. My code is walking the reachability graph on the client; it
can only figure out what other objects it needs after it's mapped the pack
file.

> Anyway, here's my attempt at the same thing. It introduces 
> "git-dumb-fetch", with usage like git-fetch-pack (except that it works 
> with http and rsync). And it adds some uglyness to git-cat-file, for 
> figuring out which objects we already have.

I might use that method for listing the available packs, although I'd sort
of like to encourage a clean solution first.

        -Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Design for http-pull on repo with packs

Reply via email to