Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Ramkumar Ramachandra
Jeff King wrote:
 [...]

First off, thanks for the fabulous writeup. I hope more contributors
read this, and get interested in tackling the problems.

 You are missing a step in the preparation of the delta list. If we
 already have a delta on-disk, we will check whether it is usable and
 keep a note in the list (see check_object). Then when we
 prepare the list of objects to delta, it is omitted from the list (see
 prepare_pack).

Ah, prepare_pack() calls get_object_details(), which then calls
check_object(): reuse_delta is what I was looking for.

 That is why you may see a much smaller number objects in the progress
 eye candy for Compressing objects... than we are actually sending.

I see.

 As Junio mentioned, that is what --thin is about; the sender omits the
 base and the receiver adds it back in (index-pack --fix-thin).

Yeah, I read about the --thin option. However, it's only for
network-packs (i.e --stdout; why?). Also, is it turned on by default?
The documentation says so, but I ran it and found that the value of
thin is 0 in builtin/push.c:316. What is going on?

 And if
 you think about it, that is likely where most of Martin's 317 packs
 turned into 8MB space savings are coming from.

Thin packs have nothing to do with git-exproll per-se, but I think I
understand why pack generations work.

 The reason we do not store thin-packs on disk is that you
 run into problems with cycles in the delta graph (e.g., A deltas against
 B, which deltas against C, which deltas against A; at one point you had
 a full copy of one object which let you create the cycle, but you later
 deleted it as redundant with the delta, and now you cannot reconstruct
 any of the objects).

Ah, that's why.

 Somebody pushes X, which deltas against Y. They send only the delta for
 X-Y, but the receiver stores that delta plus a full copy of Y, even
 though we already have Y in another pack. Repacking throws away the
 extra copy of Y.

Right, I got that.

 And that is why something like Junio's suggestion of do not traverse
 history; just concatenate the packs and throw away duplicate objects
 might be helpful. You would not find new deltas, but you would get rid
 of these duplicate objects, and doing so would be relatively cheap.

... but we've already determined that git pack-objects does a good job
of reusing deltas, so I'm not sure what we can do over and above this.

If I understand correctly, the proposal is to use existing (but
sub-optimal) deltas more aggressively when repacking?

 Another solution could involve not writing the duplicate of Y in the
 first place.

 [...]

 You could possibly solve this with cycle detection, though it would be
 complicated (you need to do it not just when getting rid of objects, but
 when sending a pack, to make sure you don't send a cycle of deltas that
 the other end cannot use). You _might_ be able to get by with a kind of
 two-level hack: consider your main pack as group A and newly pushed
 packs as group B. Allow storing thin deltas on disk from group B
 against group A, but never the reverse (nor within group B). That makes
 sure you don't have cycles, and it eliminates even more I/O than any
 repacking solution (because you never write the extra copy of Y to disk
 in the first place). But I can think of two problems:

Hm, having two kinds of packs: full packs and thin packs, and
different codepaths to handle them.

   1. You still want to repack more often than every 300 packs, because
  having many packs cost both in space, but also in object lookup
  time (we can do a log(N) search through each pack index, but have
  to search linearly through the set of indices).

Once you narrow it down to a pack-index, you can bisect it since the
table is sorted. How do you find the right pack-index though? Do I
have to do the bisection-search on each one to see if the object
exists? Can this be improved by generating a global index for all
packs that contains packfile-name in addition to offset for each
object?

   2. As you accumulate group B packs with new objects, the deltas that
  people send will tend to be against objects in group B. They are
  closer to the tip of history, and therefore make better deltas for
  history built on top.

Actually, I'm not sure it's worth the additional complexity at all.
Keeping packs self-contained means that simpler code and easier
isolation in the face of corruption: the space isn't a big trade-off,
unless there are lots of really tiny packs (in which case we
consolidate them into larger packs).

The way I see it, we'll be saving on some IO by creating thin packs.
But the trade-off seems to be quite heavy: self-contained packs means
that I can just mmap() that pack-index, look for objects and apply
deltas without bothering about any other packs. Thin packs means that
I'll have to mmap() a lot more than one pack-index, and spend a lot
more time searching for the base object to apply the delta to, right?

 That's all just off the 

Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Duy Nguyen
On Sat, Aug 10, 2013 at 3:42 PM, Ramkumar Ramachandra
artag...@gmail.com wrote:
 As Junio mentioned, that is what --thin is about; the sender omits the
 base and the receiver adds it back in (index-pack --fix-thin).

 Yeah, I read about the --thin option. However, it's only for
 network-packs (i.e --stdout; why?). Also, is it turned on by default?
 The documentation says so, but I ran it and found that the value of
 thin is 0 in builtin/push.c:316. What is going on?

--thin is enabled by default for fetch (see
transport.c:transport_get()) but it's only effective when the server
advertises thin-pack capability (see protocol-capabilities.txt).
push has --thin turned off by default favoring server resources over
network traffic, see a4503a1 (Make --no-thin the default in git-push
to save server resources - 2007-09-09)
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Duy Nguyen
On Sat, Aug 10, 2013 at 4:24 PM, Duy Nguyen pclo...@gmail.com wrote:
 push has --thin turned off by default favoring server resources over
 network traffic, see a4503a1 (Make --no-thin the default in git-push
 to save server resources - 2007-09-09)

Side note, I think the server should be able to control the default
behavior. Maybe if the server advertises thin-pack capability and
the user does not explicitly specify --no-thin, then --thin is
automatically turned on at the client.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Jeff King
On Sat, Aug 10, 2013 at 02:12:46PM +0530, Ramkumar Ramachandra wrote:

 First off, thanks for the fabulous writeup. I hope more contributors
 read this, and get interested in tackling the problems.

You're welcome. I'll just respond to a few questions/points you raised
in your response:

 Yeah, I read about the --thin option. However, it's only for
 network-packs (i.e --stdout; why?). Also, is it turned on by default?
 The documentation says so, but I ran it and found that the value of
 thin is 0 in builtin/push.c:316. What is going on?

I'm not sure whether push --thin actually does anything anymore. It
looks like we always turn on thin-packs for the git transport in
transport_get, and have ever since 9b28851 (Push code for transport
library, 2007-09-10). And you cannot turn off thin-packing with
--no-thin.

So I suspect it is historical cruft at this point.

  And if
  you think about it, that is likely where most of Martin's 317 packs
  turned into 8MB space savings are coming from.
 
 Thin packs have nothing to do with git-exproll per-se, but I think I
 understand why pack generations work.

I think it matters in terms of understanding the workload, and why the
repo benefits from repacking. In a normal git repo that accumulates
packs via incremental git repack (e.g., by running gc --auto until
we hit the pack limit), you will not have duplicate objects (only
objects which could be delta'd across the pack boundary but are not).
But if you accumulate packs by repeated thin-pack pushing, you will have
good deltas, but extra copies of base objects.

  And that is why something like Junio's suggestion of do not traverse
  history; just concatenate the packs and throw away duplicate objects
  might be helpful. You would not find new deltas, but you would get rid
  of these duplicate objects, and doing so would be relatively cheap.
 
 ... but we've already determined that git pack-objects does a good job
 of reusing deltas, so I'm not sure what we can do over and above this.

If I understand the proposal correctly, the problem space is something
like:

  A fully packed repository is the best way to store objects, both in
  terms of space efficiency and for delta reuse when clients fetch from
  us. But it is too expensive to do a full repack after each push. What
  makes it expensive, and how can we get around that?

I think the answer to why is it expensive is two-fold:

  1. We spend a lot of CPU on history traversal. Even if we have only 10
 new commits to pack, we have to traverse the _whole_ history to
 generate the pack.

  2. We spend some CPU on delta compression. This scales nicely (10
 commits means we only do deltas for those 10 commits).

  3. We do a lot of I/O, as we rewrite the whole of the old pack content
 into the new pack, along with the extra objects (so again, we might
 rewrite millions of objects just to add 10 more to the pack).

I do not think (2) is that big a deal. It scales nicely already, and the
deltas you get are worth it (and you would have to compute them on the
fly for the next clone anyway). But (1) and (3) are bad, because they
are based on the size of the repository, not the size of the new
content.

If we were to simply append the objects from the new packs onto
the big, old pack, dropping any duplicates but retaining the deltas we
have, we would make (1) and (3) go away. Our pack probably wouldn't be
as good as a real repack, as it would potentially contain unreachable
objects, and we might miss some delta opportunities. But we could
cheaply run the concatenate repack after each push, and then do a real
repack less frequently.

I am not sure if that is exactly what Junio is proposing. There are some
logistical complications (fixing up the pack header and footer, along
with the filename, and updating the pack index).

1. You still want to repack more often than every 300 packs, because
   having many packs cost both in space, but also in object lookup
   time (we can do a log(N) search through each pack index, but have
   to search linearly through the set of indices).
 
 Once you narrow it down to a pack-index, you can bisect it since the
 table is sorted. How do you find the right pack-index though? Do I
 have to do the bisection-search on each one to see if the object
 exists? Can this be improved by generating a global index for all
 packs that contains packfile-name in addition to offset for each
 object?

Right. The bisection-search is the log(N) I mention above; you can do
that on each pack index. But you have to search the packs linearly,
since there is no indication of which pack the object might be found in.

A global index would solve that. Right now, the way to create such an
index is git repack -ad. :)

But there is nothing to say that we could not have a meta-index on top
of the existing pack indices (however it does introduce complications
with lifetimes, as right now we can delete redundant packs and their
indices).


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Jeff King
On Sat, Aug 10, 2013 at 08:24:39AM +0700, Nguyen Thai Ngoc Duy wrote:

  the other end cannot use). You _might_ be able to get by with a kind of
  two-level hack: consider your main pack as group A and newly pushed
  packs as group B. Allow storing thin deltas on disk from group B
  against group A, but never the reverse (nor within group B). That makes
  sure you don't have cycles, and it eliminates even more I/O than any
  repacking solution (because you never write the extra copy of Y to disk
  in the first place). But I can think of two problems:
 [...]
 
 Some refinements on this idea
 
  - We could keep packs in group B ordered as the packs come in. The
 new pack can depend on the previous ones.

I think you could dispense with the two-level altogether and simply give
a definite ordering to packs, whereby newer packs can only depend on
older packs. Enforcing that with filesystem mtime feels a bit
error-prone; I think you'd want to explicitly store a counter somewhere.

  - A group index in addition to separate index for each pack would
 solve linear search object lookup problem.

Yeah. I do not even think it would be that much work. It is a pure
optimization, so you can ignore issues like what happens if I search
for an object, but the pack it is supposed to be in went away?. The
answer is you fall back to a linear search through the packs, and
assume it happens infrequently enough not to care.

I'd wait to see how other proposed optimizations work out before doing a
global index, though.  The current wisdom is don't have a ton of packs,
for both the index issue and other reasons, like wasting space and
on-the-fly deltas for fetches. If the other reasons go away, then a
global index would make sense to solve the remaining issue. But if the
solution for the other issues is to make it cheaper to repack so you can
do it more often, then the index issue just goes away.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Duy Nguyen
On Sat, Aug 10, 2013 at 4:43 PM, Jeff King p...@peff.net wrote:
 push has --thin turned off by default favoring server resources over
 network traffic, see a4503a1 (Make --no-thin the default in git-push
 to save server resources - 2007-09-09)

 Hmm. I don't think that is the case anymore.

 If I do:

   git init parent 
   (cd parent  seq 1 1 file 
git add file  git commit -m base
   ) 
   git clone parent child 
   cd child  seq 1 10001 file 
   git commit -a -m more 
   GIT_TRACE=1 git push origin HEAD:foo

 I see:

   trace: run_command: 'pack-objects' '--all-progress-implied' '--revs'
 '--stdout' '--thin' '--delta-base-offset' '--progress'


Right. transport_get() is also run for push and it sets
smart_options-thin = 1 unconditionally. Thanks for correcting.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Ramkumar Ramachandra
Duy Nguyen wrote:
 Right. transport_get() is also run for push and it sets
 smart_options-thin = 1 unconditionally.

So, it looks like smart http implies the thin capability. I think
sop's patch (6 years ago) was about turning off thin on dumb http: can
we check that before deciding that push --[no-]thin is historical
cruft?

Also, a documentation update is required.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-10 Thread Duy Nguyen
On Sat, Aug 10, 2013 at 5:05 PM, Ramkumar Ramachandra
artag...@gmail.com wrote:
 Duy Nguyen wrote:
 Right. transport_get() is also run for push and it sets
 smart_options-thin = 1 unconditionally.

 So, it looks like smart http implies the thin capability.

smart_options is a bit misleading. This applies to both smart
http://, git:// and ssh://

 I think
 sop's patch (6 years ago) was about turning off thin on dumb http

No, dumb http walks through the remote repository object database,
there's no temporary pack created for transferring data. Dumb http has
nothing to do with this flag.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Jeff King
On Fri, Aug 09, 2013 at 01:34:48AM +0530, Ramkumar Ramachandra wrote:

 Certainly. A push will never use an existing pack as-is, as it's very
 highly unlikely that the server requested exactly what gc --auto
 packed for us locally.
 
 Sure, undeltified objects in the pack are probably better for push
 performance, but I'm talking about the majority: deltified objects.
 Don't you need to apply the deltas (ie. explode the pack), before you
 can recompute the deltas for the information you're sending across for
 the push?

It depends on what each side has it, doesn't it? We generally try to
reuse on-disk deltas when we can, since they require no computation. If
I have object A delta'd against B, and I know that the other side wants
A and has B (or I am also sending B), I can simply send what I have on
disk. So we do not just blit out the existing pack as-is, but we may
reuse portions of it as appropriate.

Of course we may have to reconstruct deltas for trees in order to find
the correct set of objects (i.e., the history traversal). But that is a
separate phase from generating the pack's object content, and we do not
reuse any of the traversal work in later phases.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Ramkumar Ramachandra
Jeff King wrote:
 It depends on what each side has it, doesn't it? We generally try to
 reuse on-disk deltas when we can, since they require no computation. If
 I have object A delta'd against B, and I know that the other side wants
 A and has B (or I am also sending B), I can simply send what I have on
 disk. So we do not just blit out the existing pack as-is, but we may
 reuse portions of it as appropriate.

I'll raise some (hopefully interesting) points. Let's take the example
of a simple push: I start send-pack, which in turn starts receive_pack
on the server and connects its stdin/stdout to it (using git_connect).
Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
pack-objects --stdout with the right arguments as the response, right?
Overall, nothing special: just pack-objects invoked with specific
arguments.

How does pack-objects work? ll_find_deltas() spawns some worker
threads to find_deltas(). This is where this get fuzzy for me: it
calls try_delta() in a nested loop, trying to find the smallest delta,
right? I'm not sure whether the interfaces it uses to read objects
differentiates between packed deltas versus packed undeltified objects
versus loose objects at all.

Now, the main problem I see is that a pack has to be self-contained: I
can't have an object bar which is a delta against an object that is
not present in the pack, right? Therefore no matter what the server
already has, I have to prepare deltas only against the data that I'm
sending across.

 Of course we may have to reconstruct deltas for trees in order to find
 the correct set of objects (i.e., the history traversal). But that is a
 separate phase from generating the pack's object content, and we do not
 reuse any of the traversal work in later phases.

I see. Are we talking about tree-walk.c here? This is not unique to
packing at all; we need to do this kind of traversal for any git
operation that digs into older history, no? I recall a discussion
about using generation numbers to speed up the walk: I tried playing
with your series (where you use a cache to keep the generation
numbers), but got nowhere. Does it make sense to think about speeding
up the walk (how?).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 I'll raise some (hopefully interesting) points. Let's take the example
 of a simple push: I start send-pack, which in turn starts receive_pack
 on the server and connects its stdin/stdout to it (using git_connect).
 Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
 pack-objects --stdout with the right arguments as the response, right?
 Overall, nothing special: just pack-objects invoked with specific
 arguments.

 How does pack-objects work?  ...

You've been here long enough to know that you can and should learn
things yourself instead of repeating tell me tell me ;-)

 threads to find_deltas(). This is where this get fuzzy for me: it
 calls try_delta() in a nested loop, trying to find the smallest delta,
 right?

Yes.

 I'm not sure whether the interfaces it uses to read objects
 differentiates between packed deltas versus packed undeltified objects
 versus loose objects at all.

Yes, but that happens way before that.  Start reading from
pack-heuristics document to get the general overall feel of what
goes on, and then go back to the source.

 Now, the main problem I see is that a pack has to be self-contained: I
 can't have an object bar which is a delta against an object that is
 not present in the pack, right? Therefore no matter what the server
 already has, I have to prepare deltas only against the data that I'm
 sending across.

There is a --thin option to allow deltas against base objects that
are known to exist on the other end to be used in the pack without
including the base objects.  The receiving end then re-adds the base
objects to the received data to complete the pack.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Jeff King
On Fri, Aug 09, 2013 at 07:04:23PM +0530, Ramkumar Ramachandra wrote:

 I'll raise some (hopefully interesting) points. Let's take the example
 of a simple push: I start send-pack, which in turn starts receive_pack
 on the server and connects its stdin/stdout to it (using git_connect).
 Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
 pack-objects --stdout with the right arguments as the response, right?
 Overall, nothing special: just pack-objects invoked with specific
 arguments.
 
 How does pack-objects work? ll_find_deltas() spawns some worker
 threads to find_deltas(). This is where this get fuzzy for me: it
 calls try_delta() in a nested loop, trying to find the smallest delta,
 right? I'm not sure whether the interfaces it uses to read objects
 differentiates between packed deltas versus packed undeltified objects
 versus loose objects at all.

You are missing a step in the preparation of the delta list. If we
already have a delta on-disk, we will check whether it is usable and
keep a note in the list (see check_object). Then when we
prepare the list of objects to delta, it is omitted from the list (see
prepare_pack).

That is why you may see a much smaller number objects in the progress
eye candy for Compressing objects... than we are actually sending.

 Now, the main problem I see is that a pack has to be self-contained: I
 can't have an object bar which is a delta against an object that is
 not present in the pack, right? Therefore no matter what the server
 already has, I have to prepare deltas only against the data that I'm
 sending across.

As Junio mentioned, that is what --thin is about; the sender omits the
base and the receiver adds it back in (index-pack --fix-thin). And if
you think about it, that is likely where most of Martin's 317 packs
turned into 8MB space savings are coming from.

Somebody pushes X, which deltas against Y. They send only the delta for
X-Y, but the receiver stores that delta plus a full copy of Y, even
though we already have Y in another pack. Repacking throws away the
extra copy of Y.

And that is why something like Junio's suggestion of do not traverse
history; just concatenate the packs and throw away duplicate objects
might be helpful. You would not find new deltas, but you would get rid
of these duplicate objects, and doing so would be relatively cheap.

Another solution could involve not writing the duplicate of Y in the
first place. The reason we do not store thin-packs on disk is that you
run into problems with cycles in the delta graph (e.g., A deltas against
B, which deltas against C, which deltas against A; at one point you had
a full copy of one object which let you create the cycle, but you later
deleted it as redundant with the delta, and now you cannot reconstruct
any of the objects).

You could possibly solve this with cycle detection, though it would be
complicated (you need to do it not just when getting rid of objects, but
when sending a pack, to make sure you don't send a cycle of deltas that
the other end cannot use). You _might_ be able to get by with a kind of
two-level hack: consider your main pack as group A and newly pushed
packs as group B. Allow storing thin deltas on disk from group B
against group A, but never the reverse (nor within group B). That makes
sure you don't have cycles, and it eliminates even more I/O than any
repacking solution (because you never write the extra copy of Y to disk
in the first place). But I can think of two problems:

  1. You still want to repack more often than every 300 packs, because
 having many packs cost both in space, but also in object lookup
 time (we can do a log(N) search through each pack index, but have
 to search linearly through the set of indices).

  2. As you accumulate group B packs with new objects, the deltas that
 people send will tend to be against objects in group B. They are
 closer to the tip of history, and therefore make better deltas for
 history built on top.

That's all just off the top of my head. There are probably other flaws,
too, as I haven't considered it too hard. But if you are really
concerned about I/O on a busy repo (and I think for hosting sites, it is
a real problem), the best-performing solution will probably not involve
packs at all, but some kind of delta-capable storage that you can
incrementally add to without rewriting the whole repository. The
downside is that it would be significantly more complicated.

  Of course we may have to reconstruct deltas for trees in order to find
  the correct set of objects (i.e., the history traversal). But that is a
  separate phase from generating the pack's object content, and we do not
  reuse any of the traversal work in later phases.
 
 I see. Are we talking about tree-walk.c here? This is not unique to
 packing at all; we need to do this kind of traversal for any git
 operation that digs into older history, no? I recall a discussion
 about using generation numbers to speed 

Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Duy Nguyen
On Sat, Aug 10, 2013 at 5:16 AM, Jeff King p...@peff.net wrote:
 Another solution could involve not writing the duplicate of Y in the
 first place. The reason we do not store thin-packs on disk is that you
 run into problems with cycles in the delta graph (e.g., A deltas against
 B, which deltas against C, which deltas against A; at one point you had
 a full copy of one object which let you create the cycle, but you later
 deleted it as redundant with the delta, and now you cannot reconstruct
 any of the objects).

 You could possibly solve this with cycle detection, though it would be
 complicated (you need to do it not just when getting rid of objects, but
 when sending a pack, to make sure you don't send a cycle of deltas that
 the other end cannot use). You _might_ be able to get by with a kind of
 two-level hack: consider your main pack as group A and newly pushed
 packs as group B. Allow storing thin deltas on disk from group B
 against group A, but never the reverse (nor within group B). That makes
 sure you don't have cycles, and it eliminates even more I/O than any
 repacking solution (because you never write the extra copy of Y to disk
 in the first place). But I can think of two problems:

   1. You still want to repack more often than every 300 packs, because
  having many packs cost both in space, but also in object lookup
  time (we can do a log(N) search through each pack index, but have
  to search linearly through the set of indices).

   2. As you accumulate group B packs with new objects, the deltas that
  people send will tend to be against objects in group B. They are
  closer to the tip of history, and therefore make better deltas for
  history built on top.

 That's all just off the top of my head. There are probably other flaws,
 too, as I haven't considered it too hard.

Some refinements on this idea

 - We could keep packs in group B ordered as the packs come in. The
new pack can depend on the previous ones.
 - A group index in addition to separate index for each pack would
solve linear search object lookup problem.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-09 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 ... The reason we do not store thin-packs on disk is that you
 run into problems with cycles in the delta graph (e.g., A deltas against
 B, which deltas against C, which deltas against A; at one point you had
 a full copy of one object which let you create the cycle, but you later
 deleted it as redundant with the delta, and now you cannot reconstruct
 any of the objects).

As an extension to that is that a thin-pack will make the validity
of a single pack dependent on its surroundings, making verify-pack
useless.  It may say everything is fine, but corruption of another
pack that holds a base object a delta in it depends on will render
the pack unusable.

With the current arrangement, if you grab a single pack and re-idx
it, you know you can reconstruct all the objects in it.

The original reason was not cycles; it was primarily because we did
not want to have to map more than one packfile while reconstructing
the delta chain for a single object (this was way before the open
small mmap windows into a large packfile was invented).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 Junio C Hamano wrote:
 Imagine we have a cheap way to enumerate the young objects without
 the usual history traversal.

 Before we discuss the advantages, can you outline how we can possibly
 get this data without actually walking downwards from the roots
 (refs)? One way to do it is to pull data out of a log of ref updates
 (aka. reflog), but we both know how unreliable that can be.

My understanding of the topic is to come up with a way that is much
cheaper than the current gc --auto that involves recent history
walk to consolidate both loose objects and small young packs into
one, so that we can use that logic for gc --auto.

The key phrase is without the usual history traversal.  We are
talking about young objects, and they are likely to be reachable
from something (like reflog entries, if not refs).  We may include
unreachable cruft in the result in the let's be quick and collect
them into a single young pack, and you will need to keep them while
reflog entries are alive, and you will need periodic sweeps with the
usual history walking to remove older crufts that recently have
become unreachable due to reflog expiry from packs anyway, so it is
not a problem for the pack that consolidates young objects into a
single pack to contain some unreachable crufts.

If you start from that assumption [*1*], the way to enumerate the
young objects without the usual history traversal should be fairly
obvious.

By definition, loose objects are all young because they were created
since the last gc --auto.  Also pack .idx files know their own
creation timestamp to let you decide how old they are, you can see
how many objects there are in the corresponding .pack and how big it
is.

By doing an equivalent of find .git/objects/[0-9a-f][0-9a-f]/, you
can enumerate the loose objects, and an equivalent of show-ref
will enumerate the objects in the pack that the .idx file you
determined to be small and young.

Note that *1* is an assumption. I do not know offhand if such a
consolidate young objects quickly into one to keep the number of
packs small strategy is an overall win.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Ramkumar Ramachandra
Junio C Hamano wrote:
 it is
 not a problem for the pack that consolidates young objects into a
 single pack to contain some unreachable crufts.

So far, we have never considered putting unreachable objects in packs.
Let me ask the obvious question first: what happens when I push? Do I
pack up all the loose objects quickly (without bothering about
reachability) and send unreachable cruft to the server? Who is
ultimately going to clean up this cruft, and when?

 Note that *1* is an assumption. I do not know offhand if such a
 consolidate young objects quickly into one to keep the number of
 packs small strategy is an overall win.

The way I see it, you're just delaying the reachability analysis
beyond the usual pack-time. The whole point of separating out loose
objects from packfiles was to not make the user wait everytime she
does common repository operations locally: delay the reachability
analysis and compression (aka. packing) until a network operation
kicks in.

To see if introducing another stage is an overall win, think about
what the bottlenecks are: in network operations, it's always the data
being sent over. If I understand correctly, during a network
transaction, there's very minimal computation where the server-client
just quickly tell each other where their refs are: therefore, it's
trivial to figure out what's missing and pack that up to send it over.
The strategy of including unreachable cruft in these packs might make
sense from the point of view of a local gc, but will ultimately slow
down network operations, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 Junio C Hamano wrote:
 it is
 not a problem for the pack that consolidates young objects into a
 single pack to contain some unreachable crufts.

 So far, we have never considered putting unreachable objects in packs.
 Let me ask the obvious question first: what happens when I push? Do I
 pack up all the loose objects quickly (without bothering about
 reachability) and send unreachable cruft to the server?

No.

I thought the discussion was about making the local gc cheaper, and
the Imagine we have a cheap way was to address it by assuming that
the daily pack young objects into a single pack can be sped up if
we did not have to traverse history.  More permanent packs (the
older ones in set of packs staggered by age Martin proposes) in
the repository should go through the normal history traversal route.

And of course we do not transfer objects that are not asked for from
or to a repository over pack tranfer.

Most importantly, it is not about butchering the pack machinery in
such a way that we can create _only_ such non history traversal
packs.

So I do not see how that question is obvious.  The question
obviously pointless and misses the mark by wide margin?  The
question makes it obvious that whoever asks it does not understand
how Git works?

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Martin Fick
On Thursday, August 08, 2013 10:56:38 am Junio C Hamano 
wrote:
 I thought the discussion was about making the local gc
 cheaper, and the Imagine we have a cheap way was to
 address it by assuming that the daily pack young
 objects into a single pack can be sped up if we did not
 have to traverse history.  More permanent packs (the
 older ones in set of packs staggered by age Martin
 proposes) in the repository should go through the normal
 history traversal route.

Assuming I understand what you are suggesting, would these 
young object likely still get deduped in an efficient 
way without doing history traversal (it sounds like they 
would)?  In other words, if I understand correctly, it would 
save time by not pruning unreferenced objects, but it would 
still be deduping things and delta compressing also, so you 
would still likely get a great benefit from creating these 
young object packs?  In other words, is there still a good 
chance that my 317 new pack files which included a 33M pack 
file will still get consolidated down to something near 8M?  

If so, then yeah this might be nice, especially if the 
history traversal is what would speed this up.  Because 
today, my solution mostly saves IO and not time.  I think it 
still saves time, I believe I have seen up to a 50% savings, 
but that is nothing compared to massive, several orders of 
magnitude IO savings.  But if what you suggest could also 
give massive time (orders of magnitude) savings along with 
the IO improvements I am seeing, then suddenly repacking 
regularly would become very cheap even on large repos.  

The only time consuming piece would be pruning then?  Could 
bitmaps eventually help out there?

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Ramkumar Ramachandra
Junio C Hamano wrote:
 So I do not see how that question is obvious.  The question
 obviously pointless and misses the mark by wide margin?  The
 question makes it obvious that whoever asks it does not understand
 how Git works?

Shall we all sit and mourn over the fact that I don't understand how
Git works, or are you willing to explain it to me?

 And of course we do not transfer objects that are not asked for from
 or to a repository over pack tranfer.

 Most importantly, it is not about butchering the pack machinery in
 such a way that we can create _only_ such non history traversal
 packs.

I asked you a very simple question: what happens when I do git push?
Instead of answering the question, you butchered the pack machinery to
only create packs with garbage in them (aka. stripped out the
reachability analysis code completely), and blamed me for doing it.

Explain it to me in plain English without getting agitated:

1. I'm on my terminal doing various repository operations: constantly
creating new objects and moving my refs around to create unreachable
objects. I have lots of loose objects.

2. I say git push. What happens? A reachability analysis is
performed on my loose objects, and the ones reachable by the ref I'm
sending are packed up and sent over the network. Now, I no longer have
any loose objects.

3. After a few days of working, the gc heuristics figure out that I
have too much garbage and too many packs; a cleanup is required. The
gc --auto which doesn't tolerate fragmentation: it tries to put
everything into one large pack.

Loop.

We're talking about tackling the gc aggression problem in 3. And you
propose putting the young objects in a pack without performing
reachability analysis: I'm asking how this is going to benefit me;
when I say git push (or when gc decides to repack), won't I need to
explode the young pack into loose objects, do a reachability analysis,
and repack anyway?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Junio C Hamano
Martin Fick mf...@codeaurora.org writes:

 Assuming I understand what you are suggesting, would these 
 young object likely still get deduped in an efficient 
 way without doing history traversal (it sounds like they 
 would)?

Yes.

The very first thing pack-object machinery does is to get the list
of object names and sort them in a certain order to help producing
good deltas, and this initial input preprocessing will dedup them.

 If so, then yeah this might be nice, especially if the history
 traversal is what would speed this up.

That was the assumption behind the it might help suggestion.  If
that helps or not is not known yet, and since Ram started this
subthread telling me not to talk about performance improvements, my
time on this thread is _not_ spent on that (yet).


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Ramkumar Ramachandra
Junio C Hamano wrote:
 Martin Fick mf...@codeaurora.org writes:
 Assuming I understand what you are suggesting, would these
 young object likely still get deduped in an efficient
 way without doing history traversal (it sounds like they
 would)?

 Yes.

 The very first thing pack-object machinery does is to get the list
 of object names and sort them in a certain order to help producing
 good deltas, and this initial input preprocessing will dedup them.

So, the proposal is to create an index of young objects without doing
reachability analysis (I still didn't get the point of packing them;
as I pointed out, it seems to be rather counter-productive) to help
the actual packing? From what I vaguely understood:

1. Index all the young objects to save a history traversal (?)

2. Perform the reachability analysis using the index in step 1, and
then generate the pack.

I'm not yet clear about what information 1 contains to help 2. Is it
the rough ordering? (The big important objects come near the top of
the pack, and the deltas are generated against them). I say rough
because the ordering might change after the unreachable objects are
pruned.

*scratches head*
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 I asked you a very simple question: what happens when I do git push?

You asked does push send unreachable cruft? and I said No.  Does
that answer your question?

 3. After a few days of working, the gc heuristics figure out that I
 have too much garbage and too many packs; a cleanup is required. The
 gc --auto which doesn't tolerate fragmentation: it tries to put
 everything into one large pack.

Today's gc --auto skims the history shallowly and packs loose
objects into a single _new_ pack.  If you start from a single pack
and enough loose objects and run gc --auto, you will have two
packs, one intact from the original, one new.

 We're talking about tackling the gc aggression problem in 3.

 when I say git push (or when gc decides to repack), won't I need
 to explode the young pack into loose objects, do a reachability
 analysis, and repack anyway?

You do not explode anything.  push will read objects from
anywhere, either from loose or packed, and send a newly created pack
over the wire.

And I see from (or when ...) part that you do not know what you
are talking about.  push will not stream existing pack to the
other end without computation, and it never will.  It will reuse
existing individual pieces when it can, and having data in a pack
(even without deltifying, or as a suboptimal delta) is much better
for push performance than having the same data in a loose object if
only for this reason, as you do not have to recompress and reencode
it in a different way (loose objects and undelitifed object in pack
are encoded differently).

 ... are you willing to explain it to me?

Hope that the above clarifies.

I've treated this thread as a design discussion, not an education
session, but the above ended up having more education material than
anything that would advance the design.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Ramkumar Ramachandra
Junio C Hamano wrote:
 3. After a few days of working, the gc heuristics figure out that I
 have too much garbage and too many packs; a cleanup is required. The
 gc --auto which doesn't tolerate fragmentation: it tries to put
 everything into one large pack.

 Today's gc --auto skims the history shallowly and packs loose
 objects into a single _new_ pack.  If you start from a single pack
 and enough loose objects and run gc --auto, you will have two
 packs, one intact from the original, one new.

That is when I have lots of loose objects and few packs. We are
discussing gc aggression when there are too many packs (how did we get
there in the first place if new packs aren't created?): in which case
it doesn't tolerate fragmentation, and tries to put everything in one
pack. That is both IO and CPU intensive, and we've been trying to
tackle that since the start of the thread.

 And I see from (or when ...) part that you do not know what you
 are talking about.  push will not stream existing pack to the
 other end without computation, and it never will.  It will reuse
 existing individual pieces when it can, and having data in a pack
 (even without deltifying, or as a suboptimal delta) is much better
 for push performance than having the same data in a loose object if
 only for this reason, as you do not have to recompress and reencode
 it in a different way (loose objects and undelitifed object in pack
 are encoded differently).

Certainly. A push will never use an existing pack as-is, as it's very
highly unlikely that the server requested exactly what gc --auto
packed for us locally.

Sure, undeltified objects in the pack are probably better for push
performance, but I'm talking about the majority: deltified objects.
Don't you need to apply the deltas (ie. explode the pack), before you
can recompute the deltas for the information you're sending across for
the push?

 I've treated this thread as a design discussion, not an education
 session, but the above ended up having more education material than
 anything that would advance the design.

You will need to educate your contributors and potential contributors
if you want these problems to be fixed.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-07 Thread Duy Nguyen
On Wed, Aug 7, 2013 at 7:10 AM, Martin Fick mf...@codeaurora.org wrote:
 I wonder if a simpler approach may be nearly efficient as
 this one: keep the largest pack out, repack the rest at
 fetch/push time so there are at most 2 packs at a time.
 Or we we could do the repack at 'gc --auto' time, but
 with lower pack threshold (about 10 or so). When the
 second pack is as big as, say half the size of the
 first, merge them into one at gc --auto time. This can
 be easily implemented in git-repack.sh.

 It would definitely be better than the current gc approach.

 However, I suspect it is still at least one to two orders of
 magnitude off from where it should be.  To give you a real
 world example, on our server today when gitexproll ran on
 our kernel/msm repo, it consolidated 317 pack files into one
 almost 8M packfile (it compresses/dedupes shockingly well,
 one of those new packs was 33M).  Our largest packfile in
 that repo is 1.5G!

 So let's now imagine that the second closest packfile is
 only 100M, it would keep getting consolidated with 8M worth
 of data every day (assuming the same conditions and no extra
 compression).  That would take (750M-100M)/8M ~ 81 days to
 finally build up large enough to no longer consolidate the
 new packs with the second largest pack file daily.  During
 those 80+ days, it will be on average writing 325M too much
 per day (when it should be writing just 8M).

 So I can see the appeal of a simple solution, unfortunately
 I think one layer would still suck though.  And if you are
 going to add even just one extra layer, I suspect that you
 might as well go the full distance since you probably
 already need to implement the logic to do so?

I see. It looks like your way is the best way to go.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Duy Nguyen
On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra artag...@gmail.com wrote:
 +   Garbage collect using a pseudo logarithmic packfile 
 maintenance
 +   approach.  This approach attempts to minimize packfile churn
 +   by keeping several generations of varying sized packfiles 
 around
 +   and only consolidating packfiles (or loose objects) which are
 +   either new packfiles, or packfiles close to the same size as
 +   another packfile.

I wonder if a simpler approach may be nearly efficient as this one:
keep the largest pack out, repack the rest at fetch/push time so there
are at most 2 packs at a time. Or we we could do the repack at 'gc
--auto' time, but with lower pack threshold (about 10 or so). When the
second pack is as big as, say half the size of the first, merge them
into one at gc --auto time. This can be easily implemented in
git-repack.sh.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra artag...@gmail.com 
 wrote:
 +   Garbage collect using a pseudo logarithmic packfile 
 maintenance
 +   approach.  This approach attempts to minimize packfile churn
 +   by keeping several generations of varying sized packfiles 
 around
 +   and only consolidating packfiles (or loose objects) which are
 +   either new packfiles, or packfiles close to the same size as
 +   another packfile.

 I wonder if a simpler approach may be nearly efficient as this one:
 keep the largest pack out, repack the rest at fetch/push time so there
 are at most 2 packs at a time. Or we we could do the repack at 'gc
 --auto' time, but with lower pack threshold (about 10 or so). When the
 second pack is as big as, say half the size of the first, merge them
 into one at gc --auto time. This can be easily implemented in
 git-repack.sh.

Another random thought.

Imagine we have a cheap way to enumerate the young objects without
the usual history traversal.  For example, list of all loose objects
and what appears in the .idx files that are young.

We can reconstruct names for trees and blobs from such a list of
object names; if a commit in the list refers to a tree, that tree is
the top level, and a blob or a tree that appears in such a top-level
tree can be given a name for its place in the tree (recursively).
I suspect we would end up giving names to large majority of trees
and blobs in such a list by doing so.

If these suspicions turn out to be true, then we could:

 - run that enumeration algorithm to come up with a set of object
   names;

 - emit the tag objects in that set in the tagger timestamp order;

 - emit the commit objects in that set in the commit timestamp
   order, while noting the tree objects contained in the set, giving
   them name ;

 - traverse the trees and blobs in that set, giving the found ones
   names (do so without stepping outside the set);

 - emit the trees and blobs with their names.  Some objects may not
   have given any name, but that is OK as long as they are in the
   minority.

And feeding it to pack-objects to produce a single pack, and then
prune away the source of these young objects in the end.

The above could turn out to be much cheaper than the traditional
history traversal.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Martin Fick
On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote:
 On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra 
artag...@gmail.com wrote:
  +   Garbage collect using a pseudo
  logarithmic packfile maintenance +  
  approach.  This approach attempts to minimize packfile
  churn +   by keeping several generations
  of varying sized packfiles around +   and
  only consolidating packfiles (or loose objects) which
  are +   either new packfiles, or packfiles
  close to the same size as +   another
  packfile.
 
 I wonder if a simpler approach may be nearly efficient as
 this one: keep the largest pack out, repack the rest at
 fetch/push time so there are at most 2 packs at a time.
 Or we we could do the repack at 'gc --auto' time, but
 with lower pack threshold (about 10 or so). When the
 second pack is as big as, say half the size of the
 first, merge them into one at gc --auto time. This can
 be easily implemented in git-repack.sh.

It would definitely be better than the current gc approach.  

However, I suspect it is still at least one to two orders of 
magnitude off from where it should be.  To give you a real 
world example, on our server today when gitexproll ran on 
our kernel/msm repo, it consolidated 317 pack files into one 
almost 8M packfile (it compresses/dedupes shockingly well, 
one of those new packs was 33M).  Our largest packfile in 
that repo is 1.5G!  

So let's now imagine that the second closest packfile is 
only 100M, it would keep getting consolidated with 8M worth 
of data every day (assuming the same conditions and no extra 
compression).  That would take (750M-100M)/8M ~ 81 days to 
finally build up large enough to no longer consolidate the 
new packs with the second largest pack file daily.  During 
those 80+ days, it will be on average writing 325M too much 
per day (when it should be writing just 8M).

So I can see the appeal of a simple solution, unfortunately 
I think one layer would still suck though.  And if you are 
going to add even just one extra layer, I suspect that you 
might as well go the full distance since you probably 
already need to implement the logic to do so?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Martin Fick
On Monday, August 05, 2013 08:38:47 pm Ramkumar Ramachandra 
wrote:
 This is the rough explanation I wrote down after reading
 it:
 
 So, the problem is that my .git/objects/pack is polluted
 with little packs everytime I fetch (or push, if you're
 the server), and this is problematic from the
 perspective of a overtly (naively) aggressive gc that
 hammers out all fragmentation.  So, on the first run,
 the little packfiles I have are all consolidated into
 big packfiles; you also write .keep files to say that
 don't gc these big packs we just generated.  In
 subsequent runs, the little packfiles from the fetch are
 absorbed into a pack that is immune to gc.  You're also
 using a size heuristic, to consolidate similarly sized
 packfiles.  You also have a --ratio to tweak the ratio
 of sizes.
 
 From: Martin Fickmf...@codeaurora.org
 See: https://gerrit-review.googlesource.com/#/c/35215/
 Thread:
 http://thread.gmane.org/gmane.comp.version-control.git/2
 31555 (Martin's emails are missing from the archive)
 ---

After analyzing today's data, I recognize that in some 
circumstances the size estimation after consolidation can be 
off by huge amounts.  The script naively just adds the 
current sizes together.  This gives a very rough estimate, 
of the new packfile size, but sometimes it can be off by 
over 2 orders of magnitude. :(  While many new packfiles are 
tiny (several K only), it seems like the larger new 
packfiles have a terrible tendency to throw the estimate way 
off (I suspect they simply have many duplicate objects).  
But despite this poor estimate, the script still offers 
drastic improvements over plain git gc.

So, it has me wondering if there isn't a more accurate way 
to estimate the new packfile without wasting a ton of time?

If not, one approach which might be worth experimenting with 
is to just assume that new packfiles have size 0!  Then just 
consolidate them with any other packfile which is ready for 
consolidation, or if none are ready, with the smallest 
packfile.  I would not be surprised to see this work on 
average better than the current summation,

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Ramkumar Ramachandra
Martin Fick wrote:
 So, it has me wondering if there isn't a more accurate way
 to estimate the new packfile without wasting a ton of time?

I'm not sure there is. Adding the sizes of individual packs can be off
by a lot, because your deltification will be more effective if you
have more data to slide windows over and compress. For the purposes of
illustration, take a simple example:

packfile-1 has a 30M Makefile and several tiny deltas. Total = 40M.
packfile-2 has a 31M Makefile.um and several tiny deltas. Total = 40M.

Now, what is the size of packfile-3 which contains the contents of
both packfile-1 and packfile-2? 80M is a bad estimate, because you can
store deltas against just one Makefile.

So, unless you do an in-depth analysis of the objects in the packfiles
(which can be terribly expensive), I don't see how you can arrive at a
better estimate.

 If not, one approach which might be worth experimenting with
 is to just assume that new packfiles have size 0!  Then just
 consolidate them with any other packfile which is ready for
 consolidation, or if none are ready, with the smallest
 packfile.  I would not be surprised to see this work on
 average better than the current summation,

That is assuming that all fetches (or pushes) are small, which is
probably a good rule; you might like to have a smallness threshold,
although I haven't thought hard about the problem.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Ramkumar Ramachandra
Junio C Hamano wrote:
 Imagine we have a cheap way to enumerate the young objects without
 the usual history traversal.

Before we discuss the advantages, can you outline how we can possibly
get this data without actually walking downwards from the roots
(refs)? One way to do it is to pull data out of a log of ref updates
(aka. reflog), but we both know how unreliable that can be.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] git exproll: steps to tackle gc aggression

2013-08-05 Thread Ramkumar Ramachandra
This is the rough explanation I wrote down after reading it:

So, the problem is that my .git/objects/pack is polluted with little
packs everytime I fetch (or push, if you're the server), and this is
problematic from the perspective of a overtly (naively) aggressive gc
that hammers out all fragmentation.  So, on the first run, the little
packfiles I have are all consolidated into big packfiles; you also
write .keep files to say that don't gc these big packs we just
generated.  In subsequent runs, the little packfiles from the fetch are
absorbed into a pack that is immune to gc.  You're also using a size
heuristic, to consolidate similarly sized packfiles.  You also have a
--ratio to tweak the ratio of sizes.

From: Martin Fickmf...@codeaurora.org
See: https://gerrit-review.googlesource.com/#/c/35215/
Thread: http://thread.gmane.org/gmane.comp.version-control.git/231555
(Martin's emails are missing from the archive)
---
 Not a candidate for inclusion, given how difficult it is to convince
 our conservative maintainer to add anything new to the tree.

 I'm posting this in the interest of visibility: I've checked it into
 my repo and started using it. I encourage everyone else to do the
 same, so we can learn from it and discuss how to improve gc.

 contrib/exproll/git-exproll.sh | 566 +
 1 file changed, 566 insertions(+)
 create mode 100755 contrib/exproll/git-exproll.sh

diff --git a/contrib/exproll/git-exproll.sh b/contrib/exproll/git-exproll.sh
new file mode 100755
index 000..9526d9f
--- /dev/null
+++ b/contrib/exproll/git-exproll.sh
@@ -0,0 +1,566 @@
+#!/bin/bash
+# Copyright (c) 2012, Code Aurora Forum. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+## Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+## Redistributions in binary form must reproduce the above
+#   copyright notice, this list of conditions and the following
+#   disclaimer in the documentation and/or other materials provided
+#   with the distribution.
+## Neither the name of Code Aurora Forum, Inc. nor the names of its
+#   contributors may be used to endorse or promote products derived
+#   from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED AS IS AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT
+# ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+# BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+# OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
+# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+usage() { # error_message
+
+cat -EOF
+   usage: $(basename $0) [-unvt] [--noref] [--nolosse] [-r|--ratio 
number]
+ [git gc option...] git.repo
+
+   -u|-husage/help
+   -v verbose
+   -n dry-run   don't actually repack anything
+   -t touch treat repo as if it had been touched
+   --noref  avoid extra ref packing timestamp checking
+   --noloosedo not run just because there are loose 
object dirs
+(repacking may still run if they are 
referenced)
+   -r ratio numberpackfile ratio to aim for (default 10)
+
+   git gc optionwill be passed as args to git gc
+
+   git.repo to run gc against
+
+   Garbage collect using a pseudo logarithmic packfile maintenance
+   approach.  This approach attempts to minimize packfile churn
+   by keeping several generations of varying sized packfiles around
+   and only consolidating packfiles (or loose objects) which are
+   either new packfiles, or packfiles close to the same size as
+   another packfile.
+
+   An estimate is used to predict when rollups (one consolidation
+   would cause another consolidation) would occur so that this
+   rollup can be done all at once via a single repack.  This 
reduces
+   both the runtime and the pack file churn in rollup cases.
+
+   Approach: plan each consolidation by creating a table like this:
+
+   Id Keep Size   Sha1(or consolidation list)