Re: [RFC] Design for http-pull on repo with packs

Dan Holmsand Tue, 12 Jul 2005 10:25:24 -0700

Junio C Hamano wrote:

Dan Holmsand <[EMAIL PROTECTED]> writes:

Repacking all of that to a single pack file gives, somewhat
surprisingly, a pack size of 62M (+ 1.3M index). In other words, the
cost of getting all those branches, and all of the new stuff from
Linus, turns out to be *negative* (probably due to some strange
deltification coincidence).



We do _not_ want to optimize for initial slurps into empty
repositories.  Quite the opposite.  We want to optimize for
allowing quick updates of reasonably up-to-date developer repos.
If initial slurps are _also_ efficient then that is an added
bonus; that is something the baseline big pack (60M Linus pack)
would give us already.  So repacking everything into a single
pack nightly is _not_ what we want to do, even though that would
give the maximum compression ;-).  I know you understand this,
but just stating the second of the above paragraphs would give
casual readers a wrong impression.

I agree, to a point: I think the bonus is quite nice to have... As itis, it's actually faster on my machine to clone a fresh tree of Linus'than it is to "git clone" a local tree (without doing the hardlinking"cheating", that is). And it's kind of nice to have the option to startcompletely fresh.

Anyway, my point is this: to make pulling efficient, we should ideallyhave (1) as few object files to pull as possible, especially when usinghttp, and (2) have as few packs as possible, to gain some compressionfor those who pull more seldom. Point 1 is obviously the most important one.

To make this happen, relatively frequent repacking and re-repacking(even if only on parts of the repository) would be necessary. Or atleast nice to have...

Which was why I wanted the "dumb fetch" thingies to at least do some"relatively smart un/repacking" to avoid duplication. And, ideally, thatthey would avoid downloading entire packs that we just want thebeginning of. That would lessen the cost of repacking, which I happen tothink is a good thing.

Also, it's kind of strange when the ssh/local fetching *always* unpackseverything, and rsync/http *never* does this...

You are correct.  For somebody like Jeff, having the Linus
baseline pack with one pack of all of his head (incremental that
excludes what is already in the Linus baseline pack) would help
pullers.

That would work, of course. It, however, means that Linus becomes the"official repository maintainer" in a way that doesn't feel verydistributed. Perhaps then Linus' packs should be marked "official" insome way?

The big problem, however, comes when Jeff (or anyone else) decides to
repack. Then, if you fetch both his repo and Linus', you might end up
with several really big pack files, that mostly overlap. That could
easily mean storing most objects many times, if you don't do some
smart selective un/repacking when fetching.



Indeed.  Overlapping packs is a possibility, but my gut feeling
is that it would not be too bad, if things are arranged so that
packs are expanded-and-then-repacked _very_ rarely if ever.
Instead, at least for your public repository, if you only repack
incrementally I think you would be OK.

To be exact, you're ok (in the meaning of avoiding duplicates) as longas you always rsync in the "official packs", and coordinate with othersyou're merging with, before you do any repacking of your own. Sure, thisworks. It just feels a bit "un-distributed" for my personal taste...


/dan
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Design for http-pull on repo with packs

Reply via email to