Re: [RFC] pack-objects: compression level for non-blobs

Shawn Pearce Mon, 31 Dec 2012 10:06:45 -0800

This thread is pretty interesting. Unfortunately the holidays have
kept me busy. But I am excited by the work David and Peff are doing.
:-)

On Sun, Dec 30, 2012 at 1:31 PM, Jeff King <[email protected]> wrote:
> On Sun, Dec 30, 2012 at 07:53:48PM +0700, Nguyen Thai Ngoc Duy wrote:
>
>> >   $ cd objects/pack && ls
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.commits
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.idx
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.pack
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.parents
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.timestamps
>> >   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.trees
>> >
>> > Each file describes the objects in the matching pack. If a new pack is
>> > generated, you'd throw away the old cache files along with the old pack,
>> > and generate new ones. Or not. These are totally optional, and an older
>> > version of git will just ignore them. A newer version will use them if
>> > they're available, and otherwise fallback to the existing code (i.e.,
>> > reading the whole object from the pack). So you can generate them at
>>
>> You have probably thought about this (and I don't have the source to
>> check first), but we may need to version these extra files so we can
>> change the format later if needed. Git versions that do not recognize
>> new versions simply ignore the cahce.
>
> Agreed. The current code has a 4-byte magic, followed by a 4-byte
> version number, followed by a 4-byte record size[1]. Then the data,
> followed by the pack sha1, followed by a sha1 of all of the preceding
> data.  So you can verify the validity of any cache file (both its
> checksum, and that it matches the right packfile), just as you can with
> a ".idx" file.

Put the pack sha1 into the header, rather than the trailer. Its really
annoying that you read the header, determine you probably understand
this file, and then have to seek to END-40 to read the pack sha1 and
verify it matches the pack. In an ideal world the pack sha1 would have
been the file name, making this less of an issue, but someone didn't
anticipate repacking the same object set with possibly different
results. :-(

The idx format is kind of wrong here, I wish we had put the pack sha1
into the header. Given that we mmap the files the 20 bytes in front
vs. 20 bytes in the trailer wouldn't have made any difference on
access cost.

>> > Each file is a set of fixed-length records. The "commits" file contains
>> > the sha1 of every commit in the pack (sorted). A binary search of the
>> > mmap'd file gives the position of a particular commit within the list,
>>
>> I think we could avoid storing sha-1 in the cache with Shawn's idea
>> [1]. But now I read it again I fail to see it :(
>>
>> [1] http://article.gmane.org/gmane.comp.version-control.git/206485
>
> Right. My implementation is very similar to what Shawn said there. I.e.,
> the timestamps file is literally 4 bytes times the number of commits.
> The parents file is 40 bytes per commit (2 parents, with a marker to
> indicate "more or less than 2"), though a lot of it is zero bytes.

Hmm, after re-reading [1] I still like my idea better. But I won't
find the time to code it myself, so I'll have to go with whatever
someone else writes. :-)

Since tree pointers are also required when parsing a commit (even if
they might not get used e.g. `git log master`) maybe this should be 16
bytes per commit storing the commit time, tree pointer, and 2 parents,
with the last 3 fields using the N-th object in the sorted sha1 list
in the idx. Sorting the file by pack stream ordering gives you good
locality during rev-list operations and makes it compact if
pack-objects adheres to writing commits before other objects.

Unfortunately this ordering requires the pack reverse index in memory
to translate from sha1 to position in the cache file. Making the
reverse index is a non-trivial cost that may dominate the running time
for smaller traversals, or the startup time for `git log` outputting
to the pager.

> Some alternatives I'm thinking about are:
>
>   1. Using non-fixed-size records, which would allow trivial compression
>      of entries like null sha1s. This would mean adding a separate
>      lookup table, though, mapping sha1s to offsets. Still, even a
>      32-bit offset is only 4 bytes per commit. If it meant dropping 40
>      bytes of zeroes from the 2nd parent field out of half of all
>      commits, that would be a win space-wise. It would be a
>      double-indirect lookup, but it's constant effort, and only two page
>      hits (which would be warm after the first lookup anyway).

Or use a 16 byte fixed width record (see above).

>   2. Storing offsets to objects in the packfile rather than their sha1s.
>      This would save a lot of space, but would mean we couldn't refer to
>      parents outside of the pack, but that may be OK. This is an
>      optimization, and the case we want to target is a fully (or mostly)
>      packed repo. It's OK to have the lookup fail and fallback to
>      accessing the object.

I glossed over this in both [1] and this message. I think its
perfectly reasonable to require parsing the commit when the commit's
parents are outside of the pack. These edge commits are infrequent
compared to the number of commits within the pack. Just mark them the
same way you mark an octopus merge, so the reader knows the parent
data is not available in the cache. For most repositories the bulk of
the commits will be in a single giant pack that contains history to
the root.

I wouldn't store the byte offsets here, those are possibly 8 bytes
wide on bigger packs. Instead store the Nth position in the pack
stream. Even if you store byte offsets you need to use the pack
reverse index to recover the SHA-1 in log N time. If you store the Nth
position you also use the pack reverse index, but you can fit all
objects from that pack in 4 bytes rather than 8 bytes per reference.

>   3. Dropping the "commits" file and just using the pack-*.idx as the
>      index. The problem is that it is sparse in the commit space. So
>      just naively storing 40 bytes per entry is going to waste a lot of
>      space. If we had a separate index as in (1) above, that could be
>      dropped to (say) 4 bytes of offset per object. But still, right now
>      the commits file for linux-2.6 is about 7.2M (20 bytes times ~376K
>      commits). There are almost 3 million total objects, so even storing
>      4 bytes per object is going to be worse.

Fix pack-objects to behave the way JGit does, cluster commits first in
the pack stream. Now you have a dense space of commits. If I remember
right this has a tiny positive improvement for most rev-list
operations with very little downside.

>   4. Making a new index version that stores the sha1s separated by type.
>      This means we can piggy-back on the regular index to get a packed
>      list of just commits. But it also means that regular sha1 lookups
>      of the objects have to look in several places (unless the caller
>      annotates the call to read_sha1_object with "I am expecting this
>      sha1 to be a commit"). And of course it means bumping the index
>      version, which is a pain. The external index means it can be
>      completely optional on top of the current index/pack.

I don't think this is worthwhile.

>> Depending on the use case, we could just generate packv4-like cache
>> for recently-used trees only. I'm not sure how tree cache impact a
>> merge operation on a very large worktree (iow, a lot of trees
>> referenced from HEAD to be inflated). This is something a cache can
>> do, but a new pack version cannot.
>
> I do not care too much about the cost of running merge on a large
> working tree. Of course it's better to make our optimizations as
> generally applicable as possible, but there is a lot of other work going
> on in a merge. The really painful, noticeable, repetitive bits right now
> are:
>
>   1. Running git-prune.
>
>   2. Creating a pack from git-upload-pack.
>
> Which are both just reachability problems. Something like "git log --
> <pathspec>" would also be helped by packv4-ish tree access patterns,
> though, but not by reachability bitmaps. And that may be something
> worth caring about.

blame would also benefit from a packv4-ish tree.

But upload-pack and prune can make massive improvements through
bitmaps, while packv4-ish tree would be only a marginal incremental
improvement. In the case of upload-pack having a bitmap gives you much
more knowledge of the remote's have set, and allows making a smaller
pack, in a lot less time, and a smaller server memory footprint. Now
that we have implemented bitmaps in our servers, I can say you really
don't want to ignore the gains we can get from them. packv4-ish tree
might help some other workloads, but bitmaps provide a better solution
to these reachability problems than anything else we know.

>> Yes. And if narrow clone ever comes, which needs --objects limited by
>> pathspec, we could just produce extra bitmaps for frequently-used
>> pathspecs and only allow narrow clone with those pathspecs.
>
> I hadn't thought about that. But yeah, because of the optional, external
> nature, there's no reason you couldn't have extra bitmap sets for
> specialized situations.

Right. We still need to redo the JGit patch series to eject the
bitmaps into an extension file. :-(
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] pack-objects: compression level for non-blobs

Reply via email to