Re: Why does pack-objects use so much memory on incremental packing?

2018-03-19 Thread Jeff King
On Sat, Mar 17, 2018 at 11:05:59PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Splitting this off into its own thread. Aside from the improvements in
> your repack memory reduction (20180317141033.21545-1-pclo...@gmail.com)
> and gc config (20180316192745.19557-1-pclo...@gmail.com) series's I'm
> wondering why repack takes so much memory to incrementally repack new
> stuff when you leave out the base pack.

I think it's a combination of a few issues:

 1. We do a complete history traversal, and then cull out objects which
our filters reject (e.g., things in a .keep pack). So you pay for
all of the "struct object", along with the obj_hash table to look
them up.

In my measurements of just "git rev-list --objects --all", that's
about 25MB for git.git. Plus a few misc things (pending object
structs for the traversal, etc).

 2. The delta-base cache used for the traversal is a fixed size. So
that's going to be 96MB regardless of your repo size.

I measured a total heap usage of 130MB for "rev-list --objects --all".
That's not 230, but I'm not sure what you're measuring. If it's RSS,
keep in mind that includes the mmap'd packfiles, too.

Doing a separate "rev-list | pack-objects" should be minorly cheaper
(although it will still have a similar peak cost, since that memory will
just be moved to the rev-list process).

If you _just_ want to pack the loose objects, you could probably do
something like:

  find .git/objects/?? -type f |
  tr -d / |
  git pack-objects .git/objects/pack/pack
  git prune-packed

But you'd get pretty crappy deltas out of that, since the heuristics
rely on knowing the filenames of trees and blobs (which you can only get
by walking the graph).

So you'd do better with something like:

  git rev-list --objects $new_tips --not $old_tips |
  git pack-objects .git/objects/pack/pack

but it's hard to know what "$old_tips" should be, unless you recorded it
last time you did a full repack.

> But no, it takes around 230MB. But thinking about it a bit further:
> 
>  * This builds on top of existing history, so that needs to be
>read/consulted

Right, I think this is the main thing.

>  * We might be reusing (if not directly, skipping re-comuting) deltas
>from the existing pack.

I don't think that should matter. We'll reuse deltas if the base is
going into our pack, but otherwise recompute. The delta computation
itself takes some memory, but it should be fairly constant even for a
large repo (it's really average_blob_size * window_size).

So I think most of your memory is just going to the traversal stuff.
Running:

  valgrind --tool=massif git pack-objects --all foo  But I get the same result if after cloning I make an orphan branch, and
> pass all the "do this as cheaply as possible" branches I can find down
> to git-repack:
> 
> (
> rm -rf /tmp/git &&
> git clone g...@github.com:git/git.git /tmp/git &&
> cd /tmp/git &&
> touch $(ls .git/objects/pack/*pack | sed 's/\.pack$/.keep/') &&
> git checkout --orphan new &&
> git reset --hard &&
> for i in {1..10}
> do
> touch $i &&
> git add $i &&
> git commit -m$i
> done &&
> git tag -d $(git tag -l) &&
> /usr/bin/time -f %M git repack -A -d -f -F --window=1 --depth=1
> )
> 
> But the memory use barely changes, my first example used 227924 kb, but
> this one uses 226788.

I think you still had to do the whole history traversal there, because
you have existing refs (the "master" branch, along with refs/remotes) as
well as reflogs.

Try:

  git branch -d master
  git remote rm origin
  rm -rf .git/logs

After that, the repack uses about 5MB.

> Jeff: Is this something ref islands[1] could be (ab)used to do, or have
> I misunderstood that concept?
> 
> 1. https://public-inbox.org/git/20130626051117.gb26...@sigill.intra.peff.net/
>https://public-inbox.org/git/20160304153359.ga16...@sigill.intra.peff.net/
>
> https://public-inbox.org/git/20160809174528.2ydgkhd7aycla...@sigill.intra.peff.net/

I think you misunderstood the concept. :)

They are about disallowing deltas between unrelated islands. They
actually require _more_ memory, because you have to storage an island
bitmap for each object (though with some copy-on-write magic, it's not
too bad). But they can never save you memory, since reused deltas are
always cheaper than re-finding new ones.

-Peff


Re: Why does pack-objects use so much memory on incremental packing?

2018-03-18 Thread Duy Nguyen
On Sat, Mar 17, 2018 at 11:05 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Wed, Feb 28 2018, Duy Nguyen jotted:
>
>> linux-2.6.git current has 6483999 objects. "git gc" on my poor laptop
>> consumes 1.7G out of 4G RAM, pushing lots of data to swap and making
>> all apps nearly unusuable (granted the problem is partly Linux I/O
>> scheduler too). So I wonder if we can reduce pack-objects memory
>> footprint a bit.
>>
>> This demonstration patch (probably breaks some tests) would reduce the
>> size of struct object_entry from from 136 down to 112 bytes on
>> x86-64. There are 6483999 of these objects, so the saving is 17% or
>> 148 MB.
>
> Splitting this off into its own thread. Aside from the improvements in
> your repack memory reduction (20180317141033.21545-1-pclo...@gmail.com)
> and gc config (20180316192745.19557-1-pclo...@gmail.com) series's I'm
> wondering why repack takes so much memory to incrementally repack new
> stuff when you leave out the base pack.
>
> Repacking git.git takes around 290MB of memory on my system, but I'd
> think that this would make it take a mere few megabytes, since all I'm
> asking it to do is pack up the few loose objects that got added and keep
> the base pack:
>
> ...
>

I left some clue in the new estimate_repack_memory() function in my gc
series that could help you find this out. I haven't really tested this
case but my guess is the two cache pools we have will likely be filled
up close to full anyway and hit delta_base_cache_limit and
max_delta_cache_size limits. When these are really full on default
configuration, they'll take roughly ~300mb.

The second is, I think we still go through all objects to mark which
one is included in the new pack, which one not (and probably which one
can be delta base candidates). Try calling alloc_report() function at
the end of repack to see exactly how much memory is locked in there.
This we could perhaps improve for incremental repack by avoiding
running rev-list.
-- 
Duy


Why does pack-objects use so much memory on incremental packing?

2018-03-17 Thread Ævar Arnfjörð Bjarmason

On Wed, Feb 28 2018, Duy Nguyen jotted:

> linux-2.6.git current has 6483999 objects. "git gc" on my poor laptop
> consumes 1.7G out of 4G RAM, pushing lots of data to swap and making
> all apps nearly unusuable (granted the problem is partly Linux I/O
> scheduler too). So I wonder if we can reduce pack-objects memory
> footprint a bit.
>
> This demonstration patch (probably breaks some tests) would reduce the
> size of struct object_entry from from 136 down to 112 bytes on
> x86-64. There are 6483999 of these objects, so the saving is 17% or
> 148 MB.

Splitting this off into its own thread. Aside from the improvements in
your repack memory reduction (20180317141033.21545-1-pclo...@gmail.com)
and gc config (20180316192745.19557-1-pclo...@gmail.com) series's I'm
wondering why repack takes so much memory to incrementally repack new
stuff when you leave out the base pack.

Repacking git.git takes around 290MB of memory on my system, but I'd
think that this would make it take a mere few megabytes, since all I'm
asking it to do is pack up the few loose objects that got added and keep
the base pack:

(
rm -rf /tmp/git &&
git clone g...@github.com:git/git.git /tmp/git &&
cd /tmp/git &&
touch $(ls .git/objects/pack/*pack | sed 's/\.pack$/.keep/') &&
for i in {1..10}
do
touch $i &&
git add $i &&
git commit -m$i
done &&
/usr/bin/time -f %M git repack -A -d
)

But no, it takes around 230MB. But thinking about it a bit further:

 * This builds on top of existing history, so that needs to be
   read/consulted

 * We might be reusing (if not directly, skipping re-comuting) deltas
   from the existing pack.

But I get the same result if after cloning I make an orphan branch, and
pass all the "do this as cheaply as possible" branches I can find down
to git-repack:

(
rm -rf /tmp/git &&
git clone g...@github.com:git/git.git /tmp/git &&
cd /tmp/git &&
touch $(ls .git/objects/pack/*pack | sed 's/\.pack$/.keep/') &&
git checkout --orphan new &&
git reset --hard &&
for i in {1..10}
do
touch $i &&
git add $i &&
git commit -m$i
done &&
git tag -d $(git tag -l) &&
/usr/bin/time -f %M git repack -A -d -f -F --window=1 --depth=1
)

But the memory use barely changes, my first example used 227924 kb, but
this one uses 226788.

Of course nobody's going to clone a huge repo and then right away create
an --orphan branch, but is there an inherent reason for why this
couldn't be taking as much memory as if the repo was cloned with
--depth=1?

I.e. when I have a *.keep on an existing pack we would have some
low-memory mode to copy the trees/blobs needed for the current commit
over to the new pack, and use that as the basis for packing everything
going forward.

Jeff: Is this something ref islands[1] could be (ab)used to do, or have
I misunderstood that concept?

1. https://public-inbox.org/git/20130626051117.gb26...@sigill.intra.peff.net/
   https://public-inbox.org/git/20160304153359.ga16...@sigill.intra.peff.net/
   
https://public-inbox.org/git/20160809174528.2ydgkhd7aycla...@sigill.intra.peff.net/