Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On Mon, Aug 13, 2018 at 9:00 PM, Jeff King wrote: > On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote: > >> >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); >> >> + rl->hash += sha_core; >> > >> > Hmm, so the first 64-bits of the oid of each ref that is part of >> > this island is added together as a 'hash' for the island. And this >> > is used to de-duplicate the islands? Any false positives? (does it >> > matter - it would only affect performance, not correctness, right?) >> >> I would think that a false positive from pure chance is very unlikely. >> We would need to approach billions of delta islands (as 2 to the power >> 64/2 is in the order of billions) for the probability to be >> significant. GitHub has less than 50 millions users and it is very >> unlikely that a significant proportion of these users will fork the >> same repo. >> >> Now if there is a false positive because two forks have exactly the >> same refs, then it is not a problem if they are considered the same, >> because they are actually the same. > > Right, the idea is to find such same-ref setups to avoid spending a > pointless bit in the per-object bitmap. In the GitHub setup, it would be > an indication that two people forked at exactly the same time, so they > have the same refs and the same delta requirements. If one of them later > updates, that relationship would change at the next repack. > > I don't know that we ever collected numbers for how often this happens. > So let me see if I can dig some up. > > On our git/git repository network, it looks like we have ~14k forks, and > ~4k are unique by this hashing scheme. So it really is saving us > 10k-bits per bitmap. That's over 1k-byte per object in the worst case. > There are ~24M objects (many times what is in git.git, but people push > lots of random things to their forks), so that's saving us up to 24GB in > RAM. Of course it almost certainly isn't that helpful in practice, since > we copy-on-write the bitmaps to avoid the full cost per object. But I > think it's fair to say it is helping (more numbers below). [...] > So all in all (and I'd emphasize this is extremely rough) I think it > probably costs about 2GB for the feature in this particular case. But > you need much more to repack at this size sanely anyway. Thanks for the interesting numbers!
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On Mon, Aug 13, 2018 at 8:11 PM, Jeff King wrote: > On Mon, Aug 13, 2018 at 01:17:18PM +0100, Ramsay Jones wrote: > >> >>> +struct island_bitmap { >> >>> + uint32_t refcount; >> >>> + uint32_t bits[]; >> >> >> >> Use FLEX_ARRAY here? We are slowly moving toward requiring >> >> certain C99 features, but I can't remember a flex array >> >> weather-balloon patch. >> > >> > This was already discussed by Junio and Peff there: >> > >> > https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/ >> >> That is a fine discussion, without a firm conclusion, but I don't >> think you can simply do nothing here: >> >> $ cat -n junk.c >>1 #include >>2 >>3 struct island_bitmap { >>4 uint32_t refcount; >>5 uint32_t bits[]; >>6 }; >>7 >> $ gcc --std=c89 --pedantic -c junk.c >> junk.c:5:11: warning: ISO C90 does not support flexible array members >> [-Wpedantic] >> uint32_t bits[]; >> ^~~~ >> $ gcc --std=c99 --pedantic -c junk.c > > Right, whether we use the FLEX_ALLOC macros or not, this needs to be > declared with FLEX_ARRAY, not an empty "[]". Ok, it will use FLEX_ARRAY in the reroll I will send soon. Thanks Ramsay and Peff!
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote: > >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); > >> + rl->hash += sha_core; > > > > Hmm, so the first 64-bits of the oid of each ref that is part of > > this island is added together as a 'hash' for the island. And this > > is used to de-duplicate the islands? Any false positives? (does it > > matter - it would only affect performance, not correctness, right?) > > I would think that a false positive from pure chance is very unlikely. > We would need to approach billions of delta islands (as 2 to the power > 64/2 is in the order of billions) for the probability to be > significant. GitHub has less than 50 millions users and it is very > unlikely that a significant proportion of these users will fork the > same repo. > > Now if there is a false positive because two forks have exactly the > same refs, then it is not a problem if they are considered the same, > because they are actually the same. Right, the idea is to find such same-ref setups to avoid spending a pointless bit in the per-object bitmap. In the GitHub setup, it would be an indication that two people forked at exactly the same time, so they have the same refs and the same delta requirements. If one of them later updates, that relationship would change at the next repack. I don't know that we ever collected numbers for how often this happens. So let me see if I can dig some up. On our git/git repository network, it looks like we have ~14k forks, and ~4k are unique by this hashing scheme. So it really is saving us 10k-bits per bitmap. That's over 1k-byte per object in the worst case. There are ~24M objects (many times what is in git.git, but people push lots of random things to their forks), so that's saving us up to 24GB in RAM. Of course it almost certainly isn't that helpful in practice, since we copy-on-write the bitmaps to avoid the full cost per object. But I think it's fair to say it is helping (more numbers below). The distribution of buckets itself looks pretty power-law-ish from eyeballing it. The biggest one has 92 forks, and then it trails off quickly to smaller numbers, and then eventually a long tail of singletons. The numbers for torvalds/linux are similar. There are ~23k forks, only ~8k of which are unique. rails/rails has 12k/17k unique. So there's some fluctuation, but the notion that there will be a bunch of identical forks seems solid. If you're curious what the actual memory usage is for a repack of git/git, here are some rough numbers from eyeballing RSS via "top" while watching the repack progress meter (I didn't want to do a full valgrind run because it's S-L-O-W): - after loading the islands, we're at ~3GB RSS. Some of that is shared (the mmap'd packed-refs file is 1GB by itself), but probably 1-1.5GB is real heap, likely from the gigantic fork list. - we grow to around 6GB RSS before starting "counting objects". I think this is largely prepare_revision_walk(), since we're making "struct commit" objects for all the ref tips (and this is mmap-ing packfiles, too, which counts against our RSS) - by the time "counting objects" is done, we're at ~10GB. We're not yet running with Duy's memory-slimming for pack-objects, so it's 100+ bytes per object_entry. Which means we're probably only spending a GB or so on island bitmaps, but at this point we've only marked commits and root-trees. - By the time "propagating island marks" is done, we'll have marked all the trees and blobs, and RSS is up to ~11GB. So all in all (and I'd emphasize this is extremely rough) I think it probably costs about 2GB for the feature in this particular case. But you need much more to repack at this size sanely anyway. If I were doing this patch series from scratch, I might implement the fork de-duping and copy-on-write bits successively so I could take good measurements of how much they help (and brag about it in the commit messages). But I don't know if it's worth untangling now or not. -Peff
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On Mon, Aug 13, 2018 at 01:17:18PM +0100, Ramsay Jones wrote: > >>> +struct island_bitmap { > >>> + uint32_t refcount; > >>> + uint32_t bits[]; > >> > >> Use FLEX_ARRAY here? We are slowly moving toward requiring > >> certain C99 features, but I can't remember a flex array > >> weather-balloon patch. > > > > This was already discussed by Junio and Peff there: > > > > https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/ > > > > That is a fine discussion, without a firm conclusion, but I don't > think you can simply do nothing here: > > $ cat -n junk.c >1 #include >2 >3 struct island_bitmap { >4 uint32_t refcount; >5 uint32_t bits[]; >6 }; >7 > $ gcc --std=c89 --pedantic -c junk.c > junk.c:5:11: warning: ISO C90 does not support flexible array members > [-Wpedantic] > uint32_t bits[]; > ^~~~ > $ gcc --std=c99 --pedantic -c junk.c Right, whether we use the FLEX_ALLOC macros or not, this needs to be declared with FLEX_ARRAY, not an empty "[]". I'm fine either way on using the FLEX_ALLOC macros. > >> ... Ah, OK, trg_ => target. > > > > I am ok to replace "trg" with "target" (or maybe "dst"? or something > > else) and "src" with "source" if you think it would make things > > clearer. > > If it had been dst_ (or target), I would not have had a 'huh?' > moment, but it is not all that important. FWIW, these are all inherited from try_delta(), etc, in the existing code. I'm fine with using another term, but we should probably do it universally. And if we do, probably "base" is a better name than "src", since the direction depends on which part of the relationship you are considering. I'm not sure what that makes "dst". -Peff
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On 13/08/18 04:33, Christian Couder wrote: > On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones [snip] >> And neither the sha1 or str hash-maps are destroyed here. >> (That is not necessarily a problem, of course! ;-) ) > > The instances are declared as static: > > static khash_sha1 *island_marks; > > static kh_str_t *remote_islands; > > so it maybe ok. Yes, that's fine. > >>> +struct island_bitmap { >>> + uint32_t refcount; >>> + uint32_t bits[]; >> >> Use FLEX_ARRAY here? We are slowly moving toward requiring >> certain C99 features, but I can't remember a flex array >> weather-balloon patch. > > This was already discussed by Junio and Peff there: > > https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/ > That is a fine discussion, without a firm conclusion, but I don't think you can simply do nothing here: $ cat -n junk.c 1#include 2 3struct island_bitmap { 4uint32_t refcount; 5uint32_t bits[]; 6}; 7 $ gcc --std=c89 --pedantic -c junk.c junk.c:5:11: warning: ISO C90 does not support flexible array members [-Wpedantic] uint32_t bits[]; ^~~~ $ gcc --std=c99 --pedantic -c junk.c $ >>> +}; > >>> +int in_same_island(const struct object_id *trg_oid, const struct object_id >>> *src_oid) >> >> Hmm, what does the trg_ prefix stand for? >> >>> +{ >>> + khiter_t trg_pos, src_pos; >>> + >>> + /* If we aren't using islands, assume everything goes together. */ >>> + if (!island_marks) >>> + return 1; >>> + >>> + /* >>> + * If we don't have a bitmap for the target, we can delta it >> >> ... Ah, OK, trg_ => target. > > I am ok to replace "trg" with "target" (or maybe "dst"? or something > else) and "src" with "source" if you think it would make things > clearer. If it had been dst_ (or target), I would not have had a 'huh?' moment, but it is not all that important. > >>> +static void add_ref_to_island(const char *island_name, const struct >>> object_id *oid) >>> +{ >>> + uint64_t sha_core; >>> + struct remote_island *rl = NULL; >>> + >>> + int hash_ret; >>> + khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret); >>> + >>> + if (hash_ret) { >>> + kh_key(remote_islands, pos) = xstrdup(island_name); >>> + kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct >>> remote_island)); >>> + } >>> + >>> + rl = kh_value(remote_islands, pos); >>> + oid_array_append(&rl->oids, oid); >>> + >>> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); >>> + rl->hash += sha_core; >> >> Hmm, so the first 64-bits of the oid of each ref that is part of >> this island is added together as a 'hash' for the island. And this >> is used to de-duplicate the islands? Any false positives? (does it >> matter - it would only affect performance, not correctness, right?) > > I would think that a false positive from pure chance is very unlikely. > We would need to approach billions of delta islands (as 2 to the power > 64/2 is in the order of billions) for the probability to be > significant. GitHub has less than 50 millions users and it is very > unlikely that a significant proportion of these users will fork the > same repo. > > Now if there is a false positive because two forks have exactly the > same refs, then it is not a problem if they are considered the same, > because they are actually the same. Yep, good point. ATB, Ramsay Jones
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones wrote: > On 12/08/18 06:11, Christian Couder wrote: >> Because the inefficiency primarily arises when an >> object is delitified against another object that does > > s/delitified/deltified/ ? Ok, this will be in the next reroll if any. >> not exist in the same fork, we partition objects into >> sets that appear in the same fork, and define >> "delta islands". When finding delta base, we do not >> allow an object outside the same island to be >> considered as its base. >> --- /dev/null >> +++ b/delta-islands.c >> @@ -0,0 +1,493 @@ >> +#include "cache.h" >> +#include "attr.h" >> +#include "object.h" >> +#include "blob.h" >> +#include "commit.h" >> +#include "tag.h" >> +#include "tree.h" >> +#include "delta.h" >> +#include "pack.h" >> +#include "tree-walk.h" >> +#include "diff.h" >> +#include "revision.h" >> +#include "list-objects.h" >> +#include "progress.h" >> +#include "refs.h" >> +#include "khash.h" > > I was wondering how many copies of the inline functions > introduced by this header we had, so: > > $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_ > 3 kh_resize_sha1 > 3 kh_put_sha1 > 3 kh_init_sha1 > 3 kh_get_sha1 > 1 kh_resize_str > 1 kh_resize_sha1_pos > 1 kh_put_str > 1 kh_put_sha1_pos > 1 kh_init_str > 1 kh_init_sha1_pos > 1 kh_get_str > 1 kh_get_sha1_pos > 1 kh_destroy_sha1 > $ > > Looking at the individual object files, we see: > > $ nm pack-bitmap-write.o | grep ' t ' | grep kh_ > 01cc t kh_get_sha1 > 01b7 t kh_init_sha1 > 085e t kh_put_sha1 > 0310 t kh_resize_sha1 > $ > > So, the two instances of the sha1 hash-map are never > destroyed (kh_destroy_sha1 is not present in the object > file). This is interesting (even though it seems related to more code than the current patch series). As those hash maps are in 'struct bitmap_writer' and a static instance is used: static struct bitmap_writer writer; it maybe ok. > $ nm pack-bitmap.o | grep ' t ' | grep kh_ > 02d9 t kh_destroy_sha1 > 032b t kh_get_sha1 > 0daa t kh_get_sha1_pos > 02c4 t kh_init_sha1 > 0d95 t kh_init_sha1_pos > 09bd t kh_put_sha1 > 1432 t kh_put_sha1_pos > 046f t kh_resize_sha1 > 0eee t kh_resize_sha1_pos > $ > > The sha1_pos hash-map is not destroyed here. Yeah, maybe a line like: kh_destroy_pos(b->ext_index.positions); is missing from free_bitmap_index()? Adding that should be in a separate patch from this series though. > $ nm delta-islands.o | grep ' t ' | grep kh_ > 02be t kh_get_sha1 > 0e52 t kh_get_str > 02a9 t kh_init_sha1 > 0e3d t kh_init_str > 0950 t kh_put_sha1 > 14e4 t kh_put_str > 0402 t kh_resize_sha1 > 0f96 t kh_resize_str > $ > > And neither the sha1 or str hash-maps are destroyed here. > (That is not necessarily a problem, of course! ;-) ) The instances are declared as static: static khash_sha1 *island_marks; static kh_str_t *remote_islands; so it maybe ok. >> +struct island_bitmap { >> + uint32_t refcount; >> + uint32_t bits[]; > > Use FLEX_ARRAY here? We are slowly moving toward requiring > certain C99 features, but I can't remember a flex array > weather-balloon patch. This was already discussed by Junio and Peff there: https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/ >> +}; >> +int in_same_island(const struct object_id *trg_oid, const struct object_id >> *src_oid) > > Hmm, what does the trg_ prefix stand for? > >> +{ >> + khiter_t trg_pos, src_pos; >> + >> + /* If we aren't using islands, assume everything goes together. */ >> + if (!island_marks) >> + return 1; >> + >> + /* >> + * If we don't have a bitmap for the target, we can delta it > > ... Ah, OK, trg_ => target. I am ok to replace "trg" with "target" (or maybe "dst"? or something else) and "src" with "source" if you think it would make things clearer. >> +static void add_ref_to_island(const char *island_name, const struct >> object_id *oid) >> +{ >> + uint64_t sha_core; >> + struct remote_island *rl = NULL; >> + >> + int hash_ret; >> + khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret); >> + >> + if (hash_ret) { >> + kh_key(remote_islands, pos) = xstrdup(island_name); >> + kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct >> remote_island)); >> + } >> + >> + rl = kh_value(remote_islands, pos); >> + oid_array_append(&rl->oids, oid); >> + >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); >> + rl->hash += sha_core; > > Hmm, so the first 64-bits of the oid of each ref that is part of > this is
Re: [PATCH v4 1/7] Add delta-islands.{c,h}
On 12/08/18 06:11, Christian Couder wrote: > From: Jeff King > > Hosting providers that allow users to "fork" existing > repos want those forks to share as much disk space as > possible. > > Alternates are an existing solution to keep all the > objects from all the forks into a unique central repo, > but this can have some drawbacks. Especially when > packing the central repo, deltas will be created > between objects from different forks. > > This can make cloning or fetching a fork much slower > and much more CPU intensive as Git might have to > compute new deltas for many objects to avoid sending > objects from a different fork. > > Because the inefficiency primarily arises when an > object is delitified against another object that does s/delitified/deltified/ ? > not exist in the same fork, we partition objects into > sets that appear in the same fork, and define > "delta islands". When finding delta base, we do not > allow an object outside the same island to be > considered as its base. > > So "delta islands" is a way to store objects from > different forks in the same repo and packfile without > having deltas between objects from different forks. > > This patch implements the delta islands mechanism in > "delta-islands.{c,h}", but does not yet make use of it. > > A few new fields are added in 'struct object_entry' > in "pack-objects.h" though. > > The documentation will follow in a patch that actually > uses delta islands in "builtin/pack-objects.c". > > Signed-off-by: Jeff King > Signed-off-by: Christian Couder > --- > Makefile| 1 + > delta-islands.c | 493 > delta-islands.h | 11 ++ > pack-objects.h | 4 + > 4 files changed, 509 insertions(+) > create mode 100644 delta-islands.c > create mode 100644 delta-islands.h > > diff --git a/Makefile b/Makefile > index bc4fc8eeab..e7994888e8 100644 > --- a/Makefile > +++ b/Makefile > @@ -841,6 +841,7 @@ LIB_OBJS += csum-file.o > LIB_OBJS += ctype.o > LIB_OBJS += date.o > LIB_OBJS += decorate.o > +LIB_OBJS += delta-islands.o > LIB_OBJS += diffcore-break.o > LIB_OBJS += diffcore-delta.o > LIB_OBJS += diffcore-order.o > diff --git a/delta-islands.c b/delta-islands.c > new file mode 100644 > index 00..c0b0da6837 > --- /dev/null > +++ b/delta-islands.c > @@ -0,0 +1,493 @@ > +#include "cache.h" > +#include "attr.h" > +#include "object.h" > +#include "blob.h" > +#include "commit.h" > +#include "tag.h" > +#include "tree.h" > +#include "delta.h" > +#include "pack.h" > +#include "tree-walk.h" > +#include "diff.h" > +#include "revision.h" > +#include "list-objects.h" > +#include "progress.h" > +#include "refs.h" > +#include "khash.h" I was wondering how many copies of the inline functions introduced by this header we had, so: $ nm git | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | sort -rn | grep kh_ 3 kh_resize_sha1 3 kh_put_sha1 3 kh_init_sha1 3 kh_get_sha1 1 kh_resize_str 1 kh_resize_sha1_pos 1 kh_put_str 1 kh_put_sha1_pos 1 kh_init_str 1 kh_init_sha1_pos 1 kh_get_str 1 kh_get_sha1_pos 1 kh_destroy_sha1 $ Looking at the individual object files, we see: $ nm pack-bitmap-write.o | grep ' t ' | grep kh_ 01cc t kh_get_sha1 01b7 t kh_init_sha1 085e t kh_put_sha1 0310 t kh_resize_sha1 $ So, the two instances of the sha1 hash-map are never destroyed (kh_destroy_sha1 is not present in the object file). $ nm pack-bitmap.o | grep ' t ' | grep kh_ 02d9 t kh_destroy_sha1 032b t kh_get_sha1 0daa t kh_get_sha1_pos 02c4 t kh_init_sha1 0d95 t kh_init_sha1_pos 09bd t kh_put_sha1 1432 t kh_put_sha1_pos 046f t kh_resize_sha1 0eee t kh_resize_sha1_pos $ The sha1_pos hash-map is not destroyed here. $ nm delta-islands.o | grep ' t ' | grep kh_ 02be t kh_get_sha1 0e52 t kh_get_str 02a9 t kh_init_sha1 0e3d t kh_init_str 0950 t kh_put_sha1 14e4 t kh_put_str 0402 t kh_resize_sha1 0f96 t kh_resize_str $ And neither the sha1 or str hash-maps are destroyed here. (That is not necessarily a problem, of course! ;-) ) > +#include "pack-bitmap.h" > +#include "pack-objects.h" > +#include "delta-islands.h" > +#include "sha1-array.h" > +#include "config.h" > + > +KHASH_INIT(str, const char *, void *, 1, kh_str_hash_func, kh_str_hash_equal) > + > +static khash_sha1 *island_marks; > +static unsigned island_counter; > +static unsigned island_counter_core; > + > +static kh_str_t *remote_islands; > + > +struct remote_island { > + uint64_t hash; > + struct oid_array oids; > +}; > + > +struct island_bitmap { > + uint32_t refcount; > + uint32_t bits[]; Use FLEX_ARRAY here?