Re: [RFC] pack-objects: compression level for non-blobs

Nguyen Thai Ngoc Duy Sun, 30 Dec 2012 04:54:26 -0800

On Sun, Dec 30, 2012 at 7:05 PM, Jeff King <[email protected]> wrote:
> So I was thinking about this, which led to some coding, which led to
> some benchmarking.


I like your way of thinking! May I suggest you take a new year break
first, then "think" about reachability bitmaps ;-) 2013 will be an
exciting year.

> I want to clean up a few things in the code before I post it, but the
> general idea is to have arbitrary per-pack cache files in the
> objects/pack directory. Like this:
>
>   $ cd objects/pack && ls
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.commits
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.idx
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.pack
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.parents
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.timestamps
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.trees
>
> Each file describes the objects in the matching pack. If a new pack is
> generated, you'd throw away the old cache files along with the old pack,
> and generate new ones. Or not. These are totally optional, and an older
> version of git will just ignore them. A newer version will use them if
> they're available, and otherwise fallback to the existing code (i.e.,
> reading the whole object from the pack). So you can generate them at

You have probably thought about this (and I don't have the source to
check first), but we may need to version these extra files so we can
change the format later if needed. Git versions that do not recognize
new versions simply ignore the cahce.

> repack time, later on, or not at all. For now I have a separate command
> that generates them based on the pack index; if this turns out to be a
> good idea, it would probably get called as part of "repack".

I'd like to make it part of index-pack, where we have nearly
everything in memory. But let's leave it as a separate command first.

> Each file is a set of fixed-length records. The "commits" file contains
> the sha1 of every commit in the pack (sorted). A binary search of the
> mmap'd file gives the position of a particular commit within the list,

I think we could avoid storing sha-1 in the cache with Shawn's idea
[1]. But now I read it again I fail to see it :(

[1] http://article.gmane.org/gmane.comp.version-control.git/206485

> Of course, it does very little for the full --objects listing, where we
> spend most of our time inflating trees. We could couple this with
> uncompressed trees (which are not all that much bigger, since the sha1s
> do not compress anyway). Or we could have an external tree cache, but
> I'm not sure exactly what it would look like (this is basically
> reinventing bits of packv4, but doing so in a way that is redundant with
> the existing packfile, rather than replacing it).

Depending on the use case, we could just generate packv4-like cache
for recently-used trees only. I'm not sure how tree cache impact a
merge operation on a very large worktree (iow, a lot of trees
referenced from HEAD to be inflated). This is something a cache can
do, but a new pack version cannot.

> Or since the point of
> --objects is usually reachability, it may make more sense to pursue the
> bitmap, which should be even faster still.

Yes. And if narrow clone ever comes, which needs --objects limited by
pathspec, we could just produce extra bitmaps for frequently-used
pathspecs and only allow narrow clone with those pathspecs.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] pack-objects: compression level for non-blobs

Reply via email to