Re: [PATCH 00/11] Reduce pack-objects memory footprint

2018-03-02 Thread Jeff King
On Fri, Mar 02, 2018 at 07:14:01AM +0700, Duy Nguyen wrote:

> > We have a big repo, and this gets repacked on 6-8GB of memory on dev
> > KVMs, so we're under a fair bit of memory pressure. git-gc slows things
> > down a lot.
> >
> > It would be really nice to have something that made it use drastically
> > less memory at the cost of less efficient packs. Is the property that
> 
> Ahh.. less efficient. You may be more interested in [1] then. It
> avoids rewriting the base pack. Without the base pack, book keeping
> becomes much much cheaper.
> 
> We still read every single byte in all packs though (I think, unless
> you use pack-bitmap) and this amount of I/O affect the rest of the
> system too. Perhaps reducing core.packedgitwindowsize might make it
> friendlier to the OS, I don't know.

Yes, the ".keep" thing is actually quite expensive. We still do a
complete rev-list to find all the objects we want, and then for each
object say "is this in a pack with .keep?". And worse, the mru doesn't
help there because even if we find it in the first pack, we have to keep
looking to see if it's _another_ pack.

There are probably some low-hanging optimizations there (e.g., only
looking in the .keep packs if that's all we're looking for; we may even
do that already).

But I think fundamentally you'd do much better to generate the partial
list of objects outside of pack-objects entirely, and then just feed it
to pack-objects without using "--revs".

-Peff


Re: [PATCH 00/11] Reduce pack-objects memory footprint

2018-03-01 Thread Duy Nguyen
On Thu, Mar 1, 2018 at 8:33 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Thu, Mar 01 2018, Nguyễn Thái Ngọc Duy jotted:
>
>> The array of object_entry in pack-objects can take a lot of memory
>> when pack-objects is run in "pack everything" mode. On linux-2.6.git,
>> this array alone takes roughly 800MB.
>>
>> This series reorders some fields and reduces field size... to keep
>> this struct smaller. Its size goes from 136 bytes to 96 bytes (29%) on
>> 64-bit linux and saves 260MB on linux-2.6.git.
>
> I'm very interested in this patch series. I don't have time to test this
> one right now (have to run), but with your previous RFC patch memory use
> (in the ~4GB range) on a big in-house repo went down by a bit over 3%,
> and it's ~5% faster.
>
> Before/after RSS 4440812 / 429 & runtime 172.73 / 162.45. This is
> after having already done a full git gc before, data via /usr/bin/time
> -v.

Jeff correctly pointed out elsewhere in this thread that RSS covers
both heap (this is what I try to reduce) and some file cache (we mmap
the whole pack file just to ease the reading) so RSS might not a good
indicator of memory reduction. Any new freed memory should be used for
cache which raises RSS back up. I think the RssAnon field in
/proc//status shows it better.

> So not huge, but respectable.
>
> We have a big repo, and this gets repacked on 6-8GB of memory on dev
> KVMs, so we're under a fair bit of memory pressure. git-gc slows things
> down a lot.
>
> It would be really nice to have something that made it use drastically
> less memory at the cost of less efficient packs. Is the property that

Ahh.. less efficient. You may be more interested in [1] then. It
avoids rewriting the base pack. Without the base pack, book keeping
becomes much much cheaper.

We still read every single byte in all packs though (I think, unless
you use pack-bitmap) and this amount of I/O affect the rest of the
system too. Perhaps reducing core.packedgitwindowsize might make it
friendlier to the OS, I don't know.

> you need to spend give or take the size of .git/objects in memory
> something inherent, or just a limitation of the current implementation?
> I.e. could we do a first pass to pick some objects based on some
> heuristic, then repack them N at a time, and finally delete the
> now-obsolete packs?
>
> Another thing I've dealt with is that on these machines their
> NFS-mounted storage gets exhausted (I'm told) due to some pathological
> operations git does during repack, I/O tends to get 5-6x slower. Of
> course ionice doesn't help because the local kernel doesn't know
> anything about how harmful it is.

[1] https://public-inbox.org/git/20180301092046.2769-1-pclo...@gmail.com/T/#u
-- 
Duy


Re: [PATCH 00/11] Reduce pack-objects memory footprint

2018-03-01 Thread Ævar Arnfjörð Bjarmason

On Thu, Mar 01 2018, Nguyễn Thái Ngọc Duy jotted:

> The array of object_entry in pack-objects can take a lot of memory
> when pack-objects is run in "pack everything" mode. On linux-2.6.git,
> this array alone takes roughly 800MB.
>
> This series reorders some fields and reduces field size... to keep
> this struct smaller. Its size goes from 136 bytes to 96 bytes (29%) on
> 64-bit linux and saves 260MB on linux-2.6.git.

I'm very interested in this patch series. I don't have time to test this
one right now (have to run), but with your previous RFC patch memory use
(in the ~4GB range) on a big in-house repo went down by a bit over 3%,
and it's ~5% faster.

Before/after RSS 4440812 / 429 & runtime 172.73 / 162.45. This is
after having already done a full git gc before, data via /usr/bin/time
-v.

So not huge, but respectable.

We have a big repo, and this gets repacked on 6-8GB of memory on dev
KVMs, so we're under a fair bit of memory pressure. git-gc slows things
down a lot.

It would be really nice to have something that made it use drastically
less memory at the cost of less efficient packs. Is the property that
you need to spend give or take the size of .git/objects in memory
something inherent, or just a limitation of the current implementation?
I.e. could we do a first pass to pick some objects based on some
heuristic, then repack them N at a time, and finally delete the
now-obsolete packs?

Another thing I've dealt with is that on these machines their
NFS-mounted storage gets exhausted (I'm told) due to some pathological
operations git does during repack, I/O tends to get 5-6x slower. Of
course ionice doesn't help because the local kernel doesn't know
anything about how harmful it is.


[PATCH 00/11] Reduce pack-objects memory footprint

2018-03-01 Thread Nguyễn Thái Ngọc Duy
The array of object_entry in pack-objects can take a lot of memory
when pack-objects is run in "pack everything" mode. On linux-2.6.git,
this array alone takes roughly 800MB.

This series reorders some fields and reduces field size... to keep
this struct smaller. Its size goes from 136 bytes to 96 bytes (29%) on
64-bit linux and saves 260MB on linux-2.6.git.

Now the bad side:

- the number of pack files pack-objects can handle is reduced to 4096
  (previously unlimited)
- max delta chain is also limited to 4096 (previously practically
  unlimited)
- some patches are quite invasive (e.g. replacing pointer with
  uint32_t) and reduces readability a bit.
- it may be tricker to add more data in object_entry in the future.

Nguyễn Thái Ngọc Duy (11):
  pack-objects: document holes in struct object_entry.h
  pack-objects: turn type and in_pack_type to bitfields
  pack-objects: use bitfield for object_entry::dfs_state
  pack-objects: use bitfield for object_entry::depth
  pack-objects: note about in_pack_header_size
  pack-objects: move in_pack_pos out of struct object_entry
  pack-objects: move in_pack out of struct object_entry
  pack-objects: faster reverse packed_git lookup
  pack-objects: refer to delta objects by index instead of pointer
  pack-objects: reorder 'hash' to pack struct object_entry
  pack-objects: increase pack file limit to 4096

 builtin/pack-objects.c | 189 ++---
 cache.h|   3 +
 object.h   |   1 -
 pack-bitmap-write.c|   8 +-
 pack-bitmap.c  |   2 +-
 pack-bitmap.h  |   4 +-
 pack-objects.h |  70 ++-
 7 files changed, 180 insertions(+), 97 deletions(-)

-- 
2.16.1.435.g8f24da2e1a