On Fri, Sep 28, 2012 at 5:00 AM, Nguyen Thai Ngoc Duy <pclo...@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spea...@spearce.org> wrote:
>> * https://git.eclipse.org/r/7939
>>   Defines the new E003 index format and the bit set
>>   implementation logic.
> Quote from the patch's message:
> "Currently, the new index format can only be used with pack files that
> contain a complete closure of the object graph e.g. the result of a
> garbage collection."
> You mentioned this before in your idea mail a while back. I wonder if
> it's worth storing bitmaps for all packs, not just the self contained
> ones.

Colby and I started talking about this late last week too. It seems
feasible, but does add a bit more complexity to the algorithm used
when enumerating.

> We could have one leaf bitmap per pack to mark all leaves where
> we'll need to traverse outside the pack. Commit leaves are the best as
> we can potentially reuse commit bitmaps from other packs. Tree leaves
> will be followed in the normal/slow way.

Yes, Colby proposed the same idea.

We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in
the pack and therefore cannot have a bit assigned to them. We could
add a new section that listed the unique leaf SHA-1s in their own
private table, and then assigned per bitmap a leaf bitmap that set to
1 for any leaf object that is outside of the pack. This would probably
take up the least amount of disk space, vs. storing the list of leaf
SHA-1s after each bitmap. If a pack has only 1 bitmap (e.g. it is a
small chunk of recent history) there is really no difference in disk
usage. If the pack has 2 or 3 commit bitmaps along a string of
approximately 300 commits, you will have an identical leaf set for
each of those bitmaps so using a single leaf SHA-1 table would support
reusing the redundant leaf pointers.

One of the problems we have seen with these non-closed packs is they
waste an incredible amount of disk. As an example, do a `git fetch`
from Linus tree when you are more than a few weeks behind. You will
get back more than 100 objects, so the thin pack will be saved and
completed with additional base objects. That thin pack will go from a
few MiBs to more than 40 MiB of data on disk, thanks to the redundant
base objects being appended to the end of the pack. For most uses
these packs are best eliminated and replaced with a new complete
closure pack. The redundant base objects disappear, and Git stops
wasting a huge amount of disk.

> For connectivity check, fewer trees/commits to deflate/parse means
> less time. And connectivity check is done on every git-fetch (I
> suspect the other end of a push also has the same check). It's not
> unusual for me to fetch some repos once every few months so these
> incomplete packs could be quite big and it'll take some time for gc
> --auto to kick in (of course we could adjust gc --auto to start based
> on the number of non-bitmapped objects, in additional to number of
> packs).

Yes, of course.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to