Re: non-smooth progress indication for git fsck and git gc
On Mon, Sep 03, 2018 at 06:48:54PM +0200, Ævar Arnfjörð Bjarmason wrote: > > And there are definitely a few nasty bits (like the way the progress is > > ended). I'm not planning on taking this further for now, but maybe > > you or somebody can find it interesting or useful. > > I think it would be really nice if this were taken further. Using my > perf test in > https://public-inbox.org/git/20180903144928.30691-7-ava...@gmail.com/T/#u > I get these results: > > $ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 > GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh > [...] > Test HEAD~ HEAD > > 1450.1: fsck 384.18(381.63+2.53) 301.52(508.28+38.34) -21.5% > > > I.e. this gives a 20% speedup, although of course some of that might be > because some of this might be skipping too much work, but looks really > promising. I'm pretty sure it's doing the correct thing, in terms of doing all the right checks. But look at your CPU time. You're getting a 20% wall-clock speedup, but spending a lot more CPU. So the main difference is really the multi-threading in index-pack. It should be strictly worse in terms of total CPU on a single-processor system because we're doing work in the sub-process (so we pay for the process invocation, but also we probably are unable to share things like in-memory commit structs, wasting a little extra time). So I'm on the fence on whether it is worth it. I like getting rid of the duplicated code. But on the other hand it is not all that complex, and maybe when it comes to things like fsck it is good to have a different implementation than the one that writes the .idx out in the first place. -Peff
Re: non-smooth progress indication for git fsck and git gc
Jeff King writes: > That code isn't lib-ified enough to be run in process, but I think the > patch below should give similar behavior to what fsck currently does. > We'd need to tell index-pack to use our fsck.* config for its checks, I > imagine. The progress here is still per-pack, but I think we could pass > in sufficient information to have it do one continuous meter across all > of the packs (see the in-code comment). > > And it makes the result multi-threaded, and lets us drop a bunch of > duplicate code. > > --- > builtin/fsck.c | 53 +++-- > pack-check.c | 142 --- > pack.h | 1 - > 3 files changed, 32 insertions(+), 164 deletions(-) The numbers here are nice, even to readers who do not necessarily care about the progress meter output ;-)
Re: non-smooth progress indication for git fsck and git gc
On Sun, Sep 02 2018, Jeff King wrote: > On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote: > >> I still think the more interesting long-term thing here is to reuse the >> pack verification from index-pack, which actually hashes as it does the >> per-object countup. >> >> That code isn't lib-ified enough to be run in process, but I think the >> patch below should give similar behavior to what fsck currently does. >> We'd need to tell index-pack to use our fsck.* config for its checks, I >> imagine. The progress here is still per-pack, but I think we could pass >> in sufficient information to have it do one continuous meter across all >> of the packs (see the in-code comment). >> >> And it makes the result multi-threaded, and lets us drop a bunch of >> duplicate code. > > Here's a little polish on that to pass enough progress data to > index-pack to let it have one nice continuous meter. I'm not sure if > it's worth all the hackery or not. The dual-meter that index-pack > generally uses is actually more informative (since it meters the initial > pass through the pack and the delta reconstruction separately). Thanks! > And there are definitely a few nasty bits (like the way the progress is > ended). I'm not planning on taking this further for now, but maybe > you or somebody can find it interesting or useful. I think it would be really nice if this were taken further. Using my perf test in https://public-inbox.org/git/20180903144928.30691-7-ava...@gmail.com/T/#u I get these results: $ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh [...] Test HEAD~ HEAD 1450.1: fsck 384.18(381.63+2.53) 301.52(508.28+38.34) -21.5% I.e. this gives a 20% speedup, although of course some of that might be because some of this might be skipping too much work, but looks really promising. I don't know if I'll have time for it, and besides, you wrote the code so you already understand it *hint* *hint* :) > --- > builtin/fsck.c | 59 +-- > builtin/index-pack.c | 43 ++- > pack-check.c | 142 --- > pack.h | 1 - > 4 files changed, 75 insertions(+), 170 deletions(-) > > diff --git a/builtin/fsck.c b/builtin/fsck.c > index 250f5af118..2d774ea2e5 100644 > --- a/builtin/fsck.c > +++ b/builtin/fsck.c > @@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, > unsigned long size) > return err; > } > > -static int fsck_obj_buffer(const struct object_id *oid, enum object_type > type, > -unsigned long size, void *buffer, int *eaten) > -{ > - /* > - * Note, buffer may be NULL if type is OBJ_BLOB. See > - * verify_packfile(), data_valid variable for details. > - */ > - struct object *obj; > - obj = parse_object_buffer(the_repository, oid, type, size, buffer, > - eaten); > - if (!obj) { > - errors_found |= ERROR_OBJECT; > - return error("%s: object corrupt or missing", oid_to_hex(oid)); > - } > - obj->flags &= ~(REACHABLE | SEEN); > - obj->flags |= HAS_OBJ; > - return fsck_obj(obj, buffer, size); > -} > - > static int default_refs; > > static void fsck_handle_reflog_oid(const char *refname, struct object_id > *oid, > @@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct > object_id *oid, > return 0; > } > > +static int verify_pack(struct packed_git *p, > +unsigned long count, unsigned long total) > +{ > + struct child_process index_pack = CHILD_PROCESS_INIT; > + > + if (is_pack_valid(p) < 0) > + return -1; > + for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0); > + > + index_pack.git_cmd = 1; > + argv_array_pushl(&index_pack.args, > + "index-pack", > + "--verify-fsck", > + NULL); > + if (show_progress) > + argv_array_pushf(&index_pack.args, > + "--fsck-progress=%lu,%lu,Checking pack %s", > + count, total, sha1_to_hex(p->sha1)); > + argv_array_push(&index_pack.args, p->pack_name); > + > + if (run_command(&index_pack)) > + return -1; > + > + return 0; > +} > + > static char const * const fsck_usage[] = { > N_("git fsck [] [...]"), > NULL > @@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char > *prefix) > if (check_full) { > struct packed_git *p; > uint32_t total = 0, count = 0; > - struct progress *progress = NULL; > > if (show_progress) { > for (p = get_packed_git(the_repository)
Re: non-smooth progress indication for git fsck and git gc
On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote: > I still think the more interesting long-term thing here is to reuse the > pack verification from index-pack, which actually hashes as it does the > per-object countup. > > That code isn't lib-ified enough to be run in process, but I think the > patch below should give similar behavior to what fsck currently does. > We'd need to tell index-pack to use our fsck.* config for its checks, I > imagine. The progress here is still per-pack, but I think we could pass > in sufficient information to have it do one continuous meter across all > of the packs (see the in-code comment). > > And it makes the result multi-threaded, and lets us drop a bunch of > duplicate code. Here's a little polish on that to pass enough progress data to index-pack to let it have one nice continuous meter. I'm not sure if it's worth all the hackery or not. The dual-meter that index-pack generally uses is actually more informative (since it meters the initial pass through the pack and the delta reconstruction separately). And there are definitely a few nasty bits (like the way the progress is ended). I'm not planning on taking this further for now, but maybe you or somebody can find it interesting or useful. --- builtin/fsck.c | 59 +-- builtin/index-pack.c | 43 ++- pack-check.c | 142 --- pack.h | 1 - 4 files changed, 75 insertions(+), 170 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 250f5af118..2d774ea2e5 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size) return err; } -static int fsck_obj_buffer(const struct object_id *oid, enum object_type type, - unsigned long size, void *buffer, int *eaten) -{ - /* -* Note, buffer may be NULL if type is OBJ_BLOB. See -* verify_packfile(), data_valid variable for details. -*/ - struct object *obj; - obj = parse_object_buffer(the_repository, oid, type, size, buffer, - eaten); - if (!obj) { - errors_found |= ERROR_OBJECT; - return error("%s: object corrupt or missing", oid_to_hex(oid)); - } - obj->flags &= ~(REACHABLE | SEEN); - obj->flags |= HAS_OBJ; - return fsck_obj(obj, buffer, size); -} - static int default_refs; static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, @@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct object_id *oid, return 0; } +static int verify_pack(struct packed_git *p, + unsigned long count, unsigned long total) +{ + struct child_process index_pack = CHILD_PROCESS_INIT; + + if (is_pack_valid(p) < 0) + return -1; + for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0); + + index_pack.git_cmd = 1; + argv_array_pushl(&index_pack.args, +"index-pack", +"--verify-fsck", +NULL); + if (show_progress) + argv_array_pushf(&index_pack.args, +"--fsck-progress=%lu,%lu,Checking pack %s", +count, total, sha1_to_hex(p->sha1)); + argv_array_push(&index_pack.args, p->pack_name); + + if (run_command(&index_pack)) + return -1; + + return 0; +} + static char const * const fsck_usage[] = { N_("git fsck [] [...]"), NULL @@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) if (check_full) { struct packed_git *p; uint32_t total = 0, count = 0; - struct progress *progress = NULL; if (show_progress) { for (p = get_packed_git(the_repository); p; @@ -746,18 +752,21 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) continue; total += p->num_objects; } - - progress = start_progress(_("Checking objects"), total); } for (p = get_packed_git(the_repository); p; p = p->next) { /* verify gives error messages itself */ - if (verify_pack(p, fsck_obj_buffer, - progress, count)) + if (verify_pack(p, count, total)) errors_found |= ERROR_PACK; count += p->num_objects; } - stop_progress(&progres
Re: non-smooth progress indication for git fsck and git gc
On Sun, Sep 02, 2018 at 03:46:57AM -0400, Jeff King wrote: > Something like this, which chunks it there, uses a per-packfile meter > (though still does not give any clue how many packfiles there are), and > shows a throughput meter. Actually, in typical cases it would not matter how many packfiles there are, because there's generally one big one, and then none of the small ones would take long enough to trigger the delayed meter. So you'd only see one such meter usually, and it would cover the majority of the time. In theory, anyway. I still think the more interesting long-term thing here is to reuse the pack verification from index-pack, which actually hashes as it does the per-object countup. That code isn't lib-ified enough to be run in process, but I think the patch below should give similar behavior to what fsck currently does. We'd need to tell index-pack to use our fsck.* config for its checks, I imagine. The progress here is still per-pack, but I think we could pass in sufficient information to have it do one continuous meter across all of the packs (see the in-code comment). And it makes the result multi-threaded, and lets us drop a bunch of duplicate code. --- builtin/fsck.c | 53 +++-- pack-check.c | 142 --- pack.h | 1 - 3 files changed, 32 insertions(+), 164 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 250f5af118..643e16125b 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size) return err; } -static int fsck_obj_buffer(const struct object_id *oid, enum object_type type, - unsigned long size, void *buffer, int *eaten) -{ - /* -* Note, buffer may be NULL if type is OBJ_BLOB. See -* verify_packfile(), data_valid variable for details. -*/ - struct object *obj; - obj = parse_object_buffer(the_repository, oid, type, size, buffer, - eaten); - if (!obj) { - errors_found |= ERROR_OBJECT; - return error("%s: object corrupt or missing", oid_to_hex(oid)); - } - obj->flags &= ~(REACHABLE | SEEN); - obj->flags |= HAS_OBJ; - return fsck_obj(obj, buffer, size); -} - static int default_refs; static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, @@ -662,6 +643,37 @@ static int mark_packed_for_connectivity(const struct object_id *oid, return 0; } +static int verify_pack(struct packed_git *p) +{ + struct child_process index_pack = CHILD_PROCESS_INIT; + + for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0); + + /* +* This should probably be replaced with a command-line option +* that teaches "index-pack --verify" to show a more compact and +* fsck-oriented progress (see also the "-v" below). +*/ + if (show_progress) + fprintf(stderr, "Checking packfile %s...\n", + p->pack_name); + + index_pack.git_cmd = 1; + argv_array_pushl(&index_pack.args, +"index-pack", +"--verify", +"--strict", +NULL); + if (show_progress) + argv_array_push(&index_pack.args, "-v"); + argv_array_push(&index_pack.args, p->pack_name); + + if (run_command(&index_pack)) + return -1; + + return 0; +} + static char const * const fsck_usage[] = { N_("git fsck [] [...]"), NULL @@ -752,8 +764,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) for (p = get_packed_git(the_repository); p; p = p->next) { /* verify gives error messages itself */ - if (verify_pack(p, fsck_obj_buffer, - progress, count)) + if (verify_pack(p)) errors_found |= ERROR_PACK; count += p->num_objects; } diff --git a/pack-check.c b/pack-check.c index d3a57df34f..ea1457ce53 100644 --- a/pack-check.c +++ b/pack-check.c @@ -15,17 +15,6 @@ struct idx_entry { unsigned int nr; }; -static int compare_entries(const void *e1, const void *e2) -{ - const struct idx_entry *entry1 = e1; - const struct idx_entry *entry2 = e2; - if (entry1->offset < entry2->offset) - return -1; - if (entry1->offset > entry2->offset) - return 1; - return 0; -} - int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr) { @@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, return
Re: non-smooth progress indication for git fsck and git gc
On Sat, Sep 01, 2018 at 02:53:28PM +0200, Ævar Arnfjörð Bjarmason wrote: > With this we'll get output like: > > $ ~/g/git/git -C ~/g/2015-04-03-1M-git/ --exec-path=$PWD fsck > Checking object directories: 100% (256/256), done. > Hashing: 100% (452634108/452634108), done. > Hashing: 100% (1073741824/1073741824), done. > Hashing: 100% (1073741824/1073741824), done. > Hashing: 100% (1008001572/1008001572), done. > Checking objects: 2% (262144/13064614) > ^C > > All tests pass with this. Isn't it awesome? Except it's of course a > massive hack, we wouldn't want to just hook into SHA1DC like this. I still consider that output so-so; the byte counts are big and there's no indication how many "hashing" lines we're going to see. It's also broken up in a weird way (it's not one per file; it's one per giant chunk we fed to sha1). > The problem comes down to us needing to call git_hash_sha1_update() with > some really huge input, that function is going to take a *long* time, > and the only way we're getting incremental progress is: > > 1) If we ourselves split the input into N chunks > 2) If we hack into the SHA1 library itself > > This patch does #2, but for this to be acceptable we'd need to do > something like #1. I think we could just do the chunking in verify_packfile(), couldn't we? (And the .idx hash, if we really want to cover that case, but IMHO that's way less interesting). Something like this, which chunks it there, uses a per-packfile meter (though still does not give any clue how many packfiles there are), and shows a throughput meter. diff --git a/pack-check.c b/pack-check.c index d3a57df34f..c94223664f 100644 --- a/pack-check.c +++ b/pack-check.c @@ -62,10 +62,25 @@ static int verify_packfile(struct packed_git *p, uint32_t nr_objects, i; int err = 0; struct idx_entry *entries; + struct progress *hashing_progress; + char *title; + off_t total_hashed = 0; if (!is_pack_valid(p)) return error("packfile %s cannot be accessed", p->pack_name); + if (progress) { + /* Probably too long... */ + title = xstrfmt("Hashing %s", p->pack_name); + + /* +* I don't think it actually works to have two progresses going +* at the same time, because when the first one ends, we'll +* cancel the alarm. But hey, this is a hacky proof of concept. +*/ + hashing_progress = start_progress(title, 0); + } + the_hash_algo->init_fn(&ctx); do { unsigned long remaining; @@ -75,9 +90,25 @@ static int verify_packfile(struct packed_git *p, pack_sig_ofs = p->pack_size - the_hash_algo->rawsz; if (offset > pack_sig_ofs) remaining -= (unsigned int)(offset - pack_sig_ofs); - the_hash_algo->update_fn(&ctx, in, remaining); + while (remaining) { + int chunk = remaining < 4096 ? remaining : 4096; + the_hash_algo->update_fn(&ctx, in, chunk); + in += chunk; + remaining -= chunk; + total_hashed += chunk; + /* +* The progress code needs tweaking to show throughputs +* better for open-ended meters. +*/ + display_throughput(hashing_progress, total_hashed); + display_progress(hashing_progress, 0); + } } while (offset < pack_sig_ofs); + the_hash_algo->final_fn(hash, &ctx); + stop_progress(&hashing_progress); + free(title); + pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL); if (hashcmp(hash, pack_sig)) err = error("%s pack checksum mismatch",
Re: non-smooth progress indication for git fsck and git gc
On Sat, Sep 01 2018, Ævar Arnfjörð Bjarmason wrote: > On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote: > >> On Thu, Aug 16 2018, Jeff King wrote: >> >>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote: >>> I'd like to point out some minor issue observed while processing some 5-object repository with many binary objects, but most are rather small: Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated. During "git gc" the writing objects phase did not update for some seconds, but then the percentage counter jumped like from 15% to 42%. I understand that updating the progress output too often can be a performance bottleneck, while upating it too rarely might only bore the user... ;-) >>> >>> We update the counter integer for every object we process, and then >>> actually update the display whenever the percentage increases or a >>> second has elapsed, whichever comes first. >>> >>> What you're seeing is likely an artifact of _what_ we're counting: >>> written objects. Not all objects are the same size, so it's not uncommon >>> to see thousands/sec when dealing with small ones, and then several >>> seconds for one giant blob. >>> >>> The only way to solve that is to count bytes. We don't have a total byte >>> count in most cases, and it wouldn't always make sense (e.g., the >>> "Compressing objects" meter can show the same issue, but it's not really >>> putting through bytes in a linear way). In some cases we do show >>> transmitted size and throughput, but that's just for network operations. >>> We could do the same for "gc" with the patch below. But usually >>> throughput isn't all that interesting for a filesystem write, because >>> bandwidth isn't the bottleneck. >>> >>> Possibly we could have a "half throughput" mode that counts up the total >>> size written, but omits the speed indicator. That's not an unreasonable >>> thing to show for a local pack, since you end up with the final pack >>> size. The object counter would still be jumpy, but you'd at least have >>> one number updated at least once per second as you put through a large >>> blob. >>> >>> If you really want a smooth percentage, then we'd have to start counting >>> bytes instead of objects. Two reasons we don't do that are: >>> >>> - we often don't know the total byte size exactly. E.g., for a >>> packfile write, it depends on the result of deflating each object. >>> You can make an approximation and just pretend at the end that you >>> hit 100%. Or you can count the pre-deflate sizes, but then your >>> meter doesn't match the bytes from the throughput counter. >>> >>> - for something like fsck, we're not actually writing out bytes. So I >>> guess you'd be measuring "here's how many bytes of objects I >>> fsck-ed". But is that on-disk compressed bytes, or decompressed >>> bytes? >>> >>> If the former, that's only marginally better as a measure of effort, >>> since delta compression means that a small number of on-disk bytes >>> may require a big effort (imagine processing a 100 byte blob versus >>> a 100 byte delta off of a 100MB blob). >>> >>> The latter is probably more accurate. But it's also not free to >>> pre-generate the total. We can get the number of objects or the size >>> of the packfile in constant-time, but totaling up the uncompressed >>> size of all objects is O(n). So that's extra computation, but it >>> also means a potential lag before we can start the progress meter. >>> >>> I'm also not sure how meaningful a byte count is for a user there. >>> >>> So there. That's probably more than you wanted to know about Git's >>> progress code. I think it probably _could_ be improved by counting >>> more/different things, but I also think it can be a bit of a rabbit >>> hole. Which is why AFAIK nobody's really looked too seriously into >>> changing it. >>> >>> -Peff >> >> This is all interesting, but I think unrelated to what Ulrich is talking >> about. Quote: >> >> Between the two phases of "git fsck" (checking directories and >> checking objects) there was a break of several seconds where no >> progress was indicated >> >> I.e. it's not about the pause you get with your testcase (which is >> certainly another issue) but the break between the two progress bars. >> >> Here's a test case you can clone: >> https://github.com/avar/2015-04-03-1M-git (or might already have >> "locally" :) >> >> If you fsck this repository it'll take around (on my spinning rust >> server) 30 seconds between 100% of "Checking object directories" before >> you get any output from "Checking objects". >> >> The breakdown of that is (this is from approximate eyeballing): >> >> * We spend 1-3 seconds just on this: >> >> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote: > On Thu, Aug 16 2018, Jeff King wrote: > >> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote: >> >>> I'd like to point out some minor issue observed while processing some >>> 5-object repository with many binary objects, but most are rather >>> small: >>> >>> Between the two phases of "git fsck" (checking directories and >>> checking objects) there was a break of several seconds where no >>> progress was indicated. >>> >>> During "git gc" the writing objects phase did not update for some >>> seconds, but then the percentage counter jumped like from 15% to 42%. >>> >>> I understand that updating the progress output too often can be a >>> performance bottleneck, while upating it too rarely might only bore >>> the user... ;-) >> >> We update the counter integer for every object we process, and then >> actually update the display whenever the percentage increases or a >> second has elapsed, whichever comes first. >> >> What you're seeing is likely an artifact of _what_ we're counting: >> written objects. Not all objects are the same size, so it's not uncommon >> to see thousands/sec when dealing with small ones, and then several >> seconds for one giant blob. >> >> The only way to solve that is to count bytes. We don't have a total byte >> count in most cases, and it wouldn't always make sense (e.g., the >> "Compressing objects" meter can show the same issue, but it's not really >> putting through bytes in a linear way). In some cases we do show >> transmitted size and throughput, but that's just for network operations. >> We could do the same for "gc" with the patch below. But usually >> throughput isn't all that interesting for a filesystem write, because >> bandwidth isn't the bottleneck. >> >> Possibly we could have a "half throughput" mode that counts up the total >> size written, but omits the speed indicator. That's not an unreasonable >> thing to show for a local pack, since you end up with the final pack >> size. The object counter would still be jumpy, but you'd at least have >> one number updated at least once per second as you put through a large >> blob. >> >> If you really want a smooth percentage, then we'd have to start counting >> bytes instead of objects. Two reasons we don't do that are: >> >> - we often don't know the total byte size exactly. E.g., for a >> packfile write, it depends on the result of deflating each object. >> You can make an approximation and just pretend at the end that you >> hit 100%. Or you can count the pre-deflate sizes, but then your >> meter doesn't match the bytes from the throughput counter. >> >> - for something like fsck, we're not actually writing out bytes. So I >> guess you'd be measuring "here's how many bytes of objects I >> fsck-ed". But is that on-disk compressed bytes, or decompressed >> bytes? >> >> If the former, that's only marginally better as a measure of effort, >> since delta compression means that a small number of on-disk bytes >> may require a big effort (imagine processing a 100 byte blob versus >> a 100 byte delta off of a 100MB blob). >> >> The latter is probably more accurate. But it's also not free to >> pre-generate the total. We can get the number of objects or the size >> of the packfile in constant-time, but totaling up the uncompressed >> size of all objects is O(n). So that's extra computation, but it >> also means a potential lag before we can start the progress meter. >> >> I'm also not sure how meaningful a byte count is for a user there. >> >> So there. That's probably more than you wanted to know about Git's >> progress code. I think it probably _could_ be improved by counting >> more/different things, but I also think it can be a bit of a rabbit >> hole. Which is why AFAIK nobody's really looked too seriously into >> changing it. >> >> -Peff > > This is all interesting, but I think unrelated to what Ulrich is talking > about. Quote: > > Between the two phases of "git fsck" (checking directories and > checking objects) there was a break of several seconds where no > progress was indicated > > I.e. it's not about the pause you get with your testcase (which is > certainly another issue) but the break between the two progress bars. > > Here's a test case you can clone: > https://github.com/avar/2015-04-03-1M-git (or might already have > "locally" :) > > If you fsck this repository it'll take around (on my spinning rust > server) 30 seconds between 100% of "Checking object directories" before > you get any output from "Checking objects". > > The breakdown of that is (this is from approximate eyeballing): > > * We spend 1-3 seconds just on this: > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181 > > * We spend the majority of the ~30s on this: > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-
Re: Antw: Re: non-smooth progress indication for git fsck and git gc
On Tue, Aug 21, 2018 at 3:13 AM Jeff King wrote: > I _think_ they should work together OK without further modification. > Once upon a time the caller had to say "don't show if we're past N% > after M seconds", but I think with the current code we'd just show it if > we're not completely finished after 2 seconds. > > So it really should just be a simple: > > progress = start_delayed_progress("Hashing packfile", 0); > > That said, counting bytes would probably be ugly (just because the > numbers get really big). We'd probably prefer to show a throughput or > something. Or just an ascii throbber. I think the important thing is communicate "I'm still doing something, not hanging up". "Hashing packfile" satisfies the "something" part, the throbber the "hanging". -- Duy
Re: Antw: Re: non-smooth progress indication for git fsck and git gc
>>> Jeff King schrieb am 21.08.2018 um 03:07 in Nachricht <20180821010712.ga32...@sigill.intra.peff.net>: > On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote: > [...] > So it really should just be a simple: > > progress = start_delayed_progress("Hashing packfile", 0); > > That said, counting bytes would probably be ugly (just because the > numbers get really big). We'd probably prefer to show a throughput or Hi! With big numbers you could apply scaling (the usual ISO suffixes), but after having read GBs, there wouldn't be much change visible unless you add significant fractional digits. So scaling seems good for absolute numbers, but not for growing numbers. Of course one could recursively scale the fractional parts again, but then the result will be lengthy again, then. Regards, Ulrich > something. And as you noted, there's some refactoring to be done with > pack-check for it to show multiple progress meters. > > (I still think in the long run we would want to scrap that code, but > that's a much bigger job; I'm fine if somebody wants to do incremental > improvements in the meantime). > > -Peff
Re: Antw: Re: non-smooth progress indication for git fsck and git gc
On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote: > > That seems to apply. BTW: Is there a way go get some repository statistics > > like a histogram of object sizes (or whatever that might be useful to help > > making decisions)? > > The git-sizer program is really helpful in this regard: > https://github.com/github/git-sizer Yeah, I'd very much agree that's the best tool for a general overview of the repository stats. Ulrich, if you want to more analysis (like a histogram), probably something like: git cat-file --batch-all-objects --batch-check='%(objectsize)' might be a good starting place to dump information (see the "BATCH OUTPUT" section of "git help cat-file" for more format items). > > If it's sorting, maybe add some code like (wild guess): > > > > if (objects_to_sort > magic_number) > >message("Sorting something..."); > > I think a good solution to these cases is to just introduce something to > the progress.c mode where it learns how to display a counter where we > don't know what the end-state will be. Something like your proposed > magic_number can just be covered under the more general case where we > don't show the progress bar unless it's been 1 second (which I believe > is the default). Yeah. We already have open-ended counters (e.g., "counting objects"), and delayed meters (we actually normalized the default to 2s recently). I _think_ they should work together OK without further modification. Once upon a time the caller had to say "don't show if we're past N% after M seconds", but I think with the current code we'd just show it if we're not completely finished after 2 seconds. So it really should just be a simple: progress = start_delayed_progress("Hashing packfile", 0); That said, counting bytes would probably be ugly (just because the numbers get really big). We'd probably prefer to show a throughput or something. And as you noted, there's some refactoring to be done with pack-check for it to show multiple progress meters. (I still think in the long run we would want to scrap that code, but that's a much bigger job; I'm fine if somebody wants to do incremental improvements in the meantime). -Peff
Re: Antw: Re: non-smooth progress indication for git fsck and git gc
Hi! Here are some stats from the repository. First the fast import ones (which had good performance, but probably all cached, also): % git fast-import <../git-stream /usr/lib/git/git-fast-import statistics: - Alloc'd objects: 55000 Total objects:51959 (56 duplicates ) blobs :47991 ( 0 duplicates 0 deltas of 0 attempts) trees : 3946 (56 duplicates994 deltas of 3929 attempts) commits: 22 ( 0 duplicates 0 deltas of 0 attempts) tags :0 ( 0 duplicates 0 deltas of 0 attempts) Total branches: 15 (15 loads ) marks:1048576 ( 48013 unique) atoms: 43335 Memory total: 9611 KiB pools: 7033 KiB objects: 2578 KiB - pack_report: getpagesize()= 4096 pack_report: core.packedGitWindowSize = 1073741824 pack_report: core.packedGitLimit = 8589934592 pack_report: pack_used_ctr= 1780 pack_report: pack_mmap_calls = 23 pack_report: pack_open_windows= 1 / 1 pack_report: pack_mapped =2905751 /2905751 - Then the output from git-sizer: Processing blobs: 47991 Processing trees: 3946 Processing commits: 22 Matching commits to trees: 22 Processing annotated tags: 0 Processing references: 15 | Name | Value | Level of concern | | | - | -- | | Overall repository size | || | * Blobs | || | * Total size | 13.7 GiB | * | | | || | Biggest objects | || | * Trees | || | * Maximum entries [1] | 13.4 k | * | | * Blobs | || | * Maximum size [2] | 279 MiB | * | | | || | Biggest checkouts| || | * Maximum path depth [3] |10 | * | | * Maximum path length[3] | 130 B | * | | * Total size of files[3] | 8.63 GiB | * | [1] b701345cbd4317276568b9d9890fd77f38933bdc (refs/heads/master:Resources/default/CIFP) [2] 19f54c4a7595667329c1be23200234f0fe50ac56 (refs/heads/master:Resources/default/apt.dat) [3] b0e3d3a2b7f2504117408f13486c905a8eb8fb1e (refs/heads/master^{tree}) Some notes: [1] is a directory with many short (typically < 1kB) text files [2] is a very large text file with many changes [3] Yes, some paths are long Regards, Ulrich >>> Ævar Arnfjörð Bjarmason schrieb am 20.08.2018 um 10:57 in Nachricht <87woslpg9i@evledraar.gmail.com>: > On Mon, Aug 20 2018, Ulrich Windl wrote: > > Jeff King schrieb am 16.08.2018 um 22:55 in Nachricht >> <20180816205556.ga8...@sigill.intra.peff.net>: >>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote: >>> This is all interesting, but I think unrelated to what Ulrich is talking about. Quote: Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated I.e. it's not about the pause you get with your testcase (which is certainly another issue) but the break between the two progress bars. >>> >>> I think he's talking about both. What I said responds to this: >> >> Hi guys! >> >> Yes, I was wondering what git does between the two visible phases, and > between >> the lines I was suggesting another progress message between those phases. At >> least the maximum unspecific three-dot-message "Thinking..." could be > displayed >> ;-) Of course anything more appropriate would be welcome. >> Also that message should only be displayed if it's foreseeable that the >> operation will take significant time. In my case (I just repeated it a few >> minutes ago) the delay is significant (at least 10 seconds). As noted > earlier I >> was hoping to capture the timing in a screencast, but it seems all
Re: Antw: Re: non-smooth progress indication for git fsck and git gc
On Mon, Aug 20 2018, Ulrich Windl wrote: Jeff King schrieb am 16.08.2018 um 22:55 in Nachricht > <20180816205556.ga8...@sigill.intra.peff.net>: >> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >>> This is all interesting, but I think unrelated to what Ulrich is talking >>> about. Quote: >>> >>> Between the two phases of "git fsck" (checking directories and >>> checking objects) there was a break of several seconds where no >>> progress was indicated >>> >>> I.e. it's not about the pause you get with your testcase (which is >>> certainly another issue) but the break between the two progress bars. >> >> I think he's talking about both. What I said responds to this: > > Hi guys! > > Yes, I was wondering what git does between the two visible phases, and between > the lines I was suggesting another progress message between those phases. At > least the maximum unspecific three-dot-message "Thinking..." could be > displayed > ;-) Of course anything more appropriate would be welcome. > Also that message should only be displayed if it's foreseeable that the > operation will take significant time. In my case (I just repeated it a few > minutes ago) the delay is significant (at least 10 seconds). As noted earlier > I > was hoping to capture the timing in a screencast, but it seems all the delays > were just optimized away in the recording. > >> >>> >> During "git gc" the writing objects phase did not update for some >>> >> seconds, but then the percentage counter jumped like from 15% to 42%. >> >> But yeah, I missed that the fsck thing was specifically about a break >> between two meters. That's a separate problem, but also worth >> discussing (and hopefully much easier to address). >> >>> If you fsck this repository it'll take around (on my spinning rust >>> server) 30 seconds between 100% of "Checking object directories" before >>> you get any output from "Checking objects". >>> >>> The breakdown of that is (this is from approximate eyeballing): >>> >>> * We spend 1-3 seconds just on this: >>> >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > >> -check.c#L181 >> >> OK, so that's checking the sha1 over the .idx file. We could put a meter >> on that. I wouldn't expect it to generally be all that slow outside of >> pathological cases, since it scales with the number of objects (and 1s >> is our minimum update anyway, so that might be OK as-is). Your case has >> 13M objects, which is quite large. > > Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my > CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is > a > classic spinning disk with 5400RPM built in... > >> >>> * We spend the majority of the ~30s on this: >>> >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > >> -check.c#L70-L79 >> >> This is hashing the actual packfile. This is potentially quite long, >> especially if you have a ton of big objects. > > That seems to apply. BTW: Is there a way go get some repository statistics > like a histogram of object sizes (or whatever that might be useful to help > making decisions)? The git-sizer program is really helpful in this regard: https://github.com/github/git-sizer >> >> I wonder if we need to do this as a separate step anyway, though. Our >> verification is based on index-pack these days, which means it's going >> to walk over the whole content as part of the "Indexing objects" step to >> expand base objects and mark deltas for later. Could we feed this hash >> as part of that walk over the data? It's not going to save us 30s, but >> it's likely to be more efficient. And it would fold the effort naturally >> into the existing progress meter. >> >>> * Wes spend another 3-5 seconds on this QSORT: >>> >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > >> -check.c#L105 >> >> That's a tough one. I'm not sure how we'd count it (how many compares we >> do?). And each item is doing so little work that hitting the progress >> code may make things noticeably slower. > > If it's sorting, maybe add some code like (wild guess): > > if (objects_to_sort > magic_number) >message("Sorting something..."); I think a good solution to these cases is to just introduce something to the progress.c mode where it learns how to display a counter where we don't know what the end-state will be. Something like your proposed magic_number can just be covered under the more general case where we don't show the progress bar unless it's been 1 second (which I believe is the default). >> >> Again, your case is pretty big. Just based on the number of objects, >> linux.git should be 1.5-2.5 seconds on your machine for the same >> operation. Which I think may be small enough to ignore (or even just >> print a generic before/after). It's really the 30s packfile hash that's >> making the whole thing so terrible. >> >> -Peff
Antw: Re: non-smooth progress indication for git fsck and git gc
>>> Jeff King schrieb am 16.08.2018 um 22:55 in Nachricht <20180816205556.ga8...@sigill.intra.peff.net>: > On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> This is all interesting, but I think unrelated to what Ulrich is talking >> about. Quote: >> >> Between the two phases of "git fsck" (checking directories and >> checking objects) there was a break of several seconds where no >> progress was indicated >> >> I.e. it's not about the pause you get with your testcase (which is >> certainly another issue) but the break between the two progress bars. > > I think he's talking about both. What I said responds to this: Hi guys! Yes, I was wondering what git does between the two visible phases, and between the lines I was suggesting another progress message between those phases. At least the maximum unspecific three-dot-message "Thinking..." could be displayed ;-) Of course anything more appropriate would be welcome. Also that message should only be displayed if it's foreseeable that the operation will take significant time. In my case (I just repeated it a few minutes ago) the delay is significant (at least 10 seconds). As noted earlier I was hoping to capture the timing in a screencast, but it seems all the delays were just optimized away in the recording. > >> >> During "git gc" the writing objects phase did not update for some >> >> seconds, but then the percentage counter jumped like from 15% to 42%. > > But yeah, I missed that the fsck thing was specifically about a break > between two meters. That's a separate problem, but also worth > discussing (and hopefully much easier to address). > >> If you fsck this repository it'll take around (on my spinning rust >> server) 30 seconds between 100% of "Checking object directories" before >> you get any output from "Checking objects". >> >> The breakdown of that is (this is from approximate eyeballing): >> >> * We spend 1-3 seconds just on this: >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > -check.c#L181 > > OK, so that's checking the sha1 over the .idx file. We could put a meter > on that. I wouldn't expect it to generally be all that slow outside of > pathological cases, since it scales with the number of objects (and 1s > is our minimum update anyway, so that might be OK as-is). Your case has > 13M objects, which is quite large. Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is a classic spinning disk with 5400RPM built in... > >> * We spend the majority of the ~30s on this: >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > -check.c#L70-L79 > > This is hashing the actual packfile. This is potentially quite long, > especially if you have a ton of big objects. That seems to apply. BTW: Is there a way go get some repository statistics like a histogram of object sizes (or whatever that might be useful to help making decisions)? > > I wonder if we need to do this as a separate step anyway, though. Our > verification is based on index-pack these days, which means it's going > to walk over the whole content as part of the "Indexing objects" step to > expand base objects and mark deltas for later. Could we feed this hash > as part of that walk over the data? It's not going to save us 30s, but > it's likely to be more efficient. And it would fold the effort naturally > into the existing progress meter. > >> * Wes spend another 3-5 seconds on this QSORT: >> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack > -check.c#L105 > > That's a tough one. I'm not sure how we'd count it (how many compares we > do?). And each item is doing so little work that hitting the progress > code may make things noticeably slower. If it's sorting, maybe add some code like (wild guess): if (objects_to_sort > magic_number) message("Sorting something..."); > > Again, your case is pretty big. Just based on the number of objects, > linux.git should be 1.5-2.5 seconds on your machine for the same > operation. Which I think may be small enough to ignore (or even just > print a generic before/after). It's really the 30s packfile hash that's > making the whole thing so terrible. > > -Peff
Antw: Re: non-smooth progress indication for git fsck and git gc
>>> Duy Nguyen schrieb am 16.08.2018 um 17:18 in Nachricht : > On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl > wrote: >> >> Hi! >> >> I'd like to point out some minor issue observed while processing some > 5-object repository with many binary objects, but most are rather small: >> >> Between the two phases of "git fsck" (checking directories and checking > objects) there was a break of several seconds where no progress was > indicated. >> >> During "git gc" the writing objects phase did not update for some seconds, > but then the percentage counter jumped like from 15% to 42%. >> >> I understand that updating the progress output too often can be a > performance bottleneck, while upating it too rarely might only bore the > user... ;-) >> >> But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3). > > Is it possible to make this repository public? You can also use "git > fast-export --anonymize" to make a repo with same structure but no > real content (but read the man page about that option first) Hi! Actually I tried that locally, but with the resulting repository both, fsck and gc are very fast. So I guess it won't be very useful. Also the original .git directory uses 5.3G, while the anonymous .git just used 4.3M... I tried to capture the behavior as screencast, but it seems the screencast optimized the little cahnges away, and in the result git almost had no delay on any operation 8-( Regards, Ulrich > >> Regards, >> Ulrich >> >> > > > -- > Duy
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 11:08 PM Jeff King wrote: > > On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote: > > > > * We spend the majority of the ~30s on this: > > > > > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79 > > > > This is hashing the actual packfile. This is potentially quite long, > > especially if you have a ton of big objects. > > > > I wonder if we need to do this as a separate step anyway, though. Our > > verification is based on index-pack these days, which means it's going > > to walk over the whole content as part of the "Indexing objects" step to > > expand base objects and mark deltas for later. Could we feed this hash > > as part of that walk over the data? It's not going to save us 30s, but > > it's likely to be more efficient. And it would fold the effort naturally > > into the existing progress meter. > > Actually, I take it back. That's the nice, modern way we do it in > git-verify-pack. But git-fsck uses the ancient "just walk over all of > the idx entries method". It at least sorts in pack order, which is good, > but: > > - it's not multi-threaded, like index-pack/verify-pack > > - the index-pack way is actually more efficient than pack-ordering for > the delta-base cache, because it actually walks the delta-graph in > the optimal order > I actually tried to make git-fsck use index-pack --verify at one point. The only thing that stopped it from working was index-pack automatically wrote the newer index version if I remember correctly, and that would fail the final hash check. fsck performance was not a big deal so I dropped it. Just saying it should be possible, if someone's interested in that direction. -- Duy
Re: non-smooth progress indication for git fsck and git gc
Jeff King writes: > On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote: > >> The only way to solve that is to count bytes. We don't have a total byte >> count in most cases, and it wouldn't always make sense (e.g., the >> "Compressing objects" meter can show the same issue, but it's not really >> putting through bytes in a linear way). In some cases we do show >> transmitted size and throughput, but that's just for network operations. >> We could do the same for "gc" with the patch below. But usually >> throughput isn't all that interesting for a filesystem write, because >> bandwidth isn't the bottleneck. > > Just realized I forgot to include the patch. Here it is, for reference. I've been wondering when you'd realize the omission ;-) > Doing something similar for fsck would be quite a bit more invasive. Yeah, on that codepath there is no streaming write passing through a single chokepoint you can count bytes X-<. > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 80c880e9ad..e1130b959d 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -837,7 +837,7 @@ static void write_pack_file(void) > if (pack_to_stdout) > f = hashfd_throughput(1, "", progress_state); > else > - f = create_tmp_packfile(&pack_tmp_name); > + f = create_tmp_packfile(&pack_tmp_name, progress_state); > > offset = write_pack_header(f, nr_remaining); > > diff --git a/bulk-checkin.c b/bulk-checkin.c > index 9f3b644811..0df45b8f55 100644 > --- a/bulk-checkin.c > +++ b/bulk-checkin.c > @@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state > *state, > if (!(flags & HASH_WRITE_OBJECT) || state->f) > return; > > - state->f = create_tmp_packfile(&state->pack_tmp_name); > + state->f = create_tmp_packfile(&state->pack_tmp_name, NULL); > reset_pack_idx_option(&state->pack_idx_opts); > > /* Pretend we are going to write only one object */ > diff --git a/pack-write.c b/pack-write.c > index a9d46bc03f..b72480b440 100644 > --- a/pack-write.c > +++ b/pack-write.c > @@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, > int hdr_len, > return n; > } > > -struct hashfile *create_tmp_packfile(char **pack_tmp_name) > +struct hashfile *create_tmp_packfile(char **pack_tmp_name, > + struct progress *progress) > { > struct strbuf tmpname = STRBUF_INIT; > int fd; > > fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XX"); > *pack_tmp_name = strbuf_detach(&tmpname, NULL); > - return hashfd(fd, *pack_tmp_name); > + return hashfd_throughput(fd, *pack_tmp_name, progress); > } > > void finish_tmp_packfile(struct strbuf *name_buffer, > diff --git a/pack.h b/pack.h > index 34a9d458b4..c87628b093 100644 > --- a/pack.h > +++ b/pack.h > @@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, > int hdr_len, > #define PH_ERROR_PROTOCOL(-3) > extern int read_pack_header(int fd, struct pack_header *); > > -extern struct hashfile *create_tmp_packfile(char **pack_tmp_name); > +extern struct hashfile *create_tmp_packfile(char **pack_tmp_name, > + struct progress *progress); > extern void finish_tmp_packfile(struct strbuf *name_buffer, const char > *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, > struct pack_idx_option *pack_idx_opts, unsigned char sha1[]); > > #endif
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote: > > * We spend the majority of the ~30s on this: > > > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79 > > This is hashing the actual packfile. This is potentially quite long, > especially if you have a ton of big objects. > > I wonder if we need to do this as a separate step anyway, though. Our > verification is based on index-pack these days, which means it's going > to walk over the whole content as part of the "Indexing objects" step to > expand base objects and mark deltas for later. Could we feed this hash > as part of that walk over the data? It's not going to save us 30s, but > it's likely to be more efficient. And it would fold the effort naturally > into the existing progress meter. Actually, I take it back. That's the nice, modern way we do it in git-verify-pack. But git-fsck uses the ancient "just walk over all of the idx entries method". It at least sorts in pack order, which is good, but: - it's not multi-threaded, like index-pack/verify-pack - the index-pack way is actually more efficient than pack-ordering for the delta-base cache, because it actually walks the delta-graph in the optimal order Once upon a time verify-pack used this same pack-check code, and we switched in 3de89c9d42 (verify-pack: use index-pack --verify, 2011-06-03). So I suspect the right thing to do here is for fsck to switch to that, too, and then delete pack-check.c entirely. That may well involve us switching the progress to a per-pack view (e.g., "checking pack 1/10 (10%)", followed by its own progress meter). But I don't think that would be a bad thing. It's a more realistic view of the work we're actually doing. Although perhaps it's misleading about the total work remaining, because not all packs are the same size (so you see that you're halfway through pack 2/10, but that's quite likely to be half of the total work if it's the one gigantic pack). -Peff
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote: > This is all interesting, but I think unrelated to what Ulrich is talking > about. Quote: > > Between the two phases of "git fsck" (checking directories and > checking objects) there was a break of several seconds where no > progress was indicated > > I.e. it's not about the pause you get with your testcase (which is > certainly another issue) but the break between the two progress bars. I think he's talking about both. What I said responds to this: > >> During "git gc" the writing objects phase did not update for some > >> seconds, but then the percentage counter jumped like from 15% to 42%. But yeah, I missed that the fsck thing was specifically about a break between two meters. That's a separate problem, but also worth discussing (and hopefully much easier to address). > If you fsck this repository it'll take around (on my spinning rust > server) 30 seconds between 100% of "Checking object directories" before > you get any output from "Checking objects". > > The breakdown of that is (this is from approximate eyeballing): > > * We spend 1-3 seconds just on this: > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181 OK, so that's checking the sha1 over the .idx file. We could put a meter on that. I wouldn't expect it to generally be all that slow outside of pathological cases, since it scales with the number of objects (and 1s is our minimum update anyway, so that might be OK as-is). Your case has 13M objects, which is quite large. > * We spend the majority of the ~30s on this: > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79 This is hashing the actual packfile. This is potentially quite long, especially if you have a ton of big objects. I wonder if we need to do this as a separate step anyway, though. Our verification is based on index-pack these days, which means it's going to walk over the whole content as part of the "Indexing objects" step to expand base objects and mark deltas for later. Could we feed this hash as part of that walk over the data? It's not going to save us 30s, but it's likely to be more efficient. And it would fold the effort naturally into the existing progress meter. > * Wes spend another 3-5 seconds on this QSORT: > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105 That's a tough one. I'm not sure how we'd count it (how many compares we do?). And each item is doing so little work that hitting the progress code may make things noticeably slower. Again, your case is pretty big. Just based on the number of objects, linux.git should be 1.5-2.5 seconds on your machine for the same operation. Which I think may be small enough to ignore (or even just print a generic before/after). It's really the 30s packfile hash that's making the whole thing so terrible. -Peff
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16 2018, Jeff King wrote: > On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote: > >> I'd like to point out some minor issue observed while processing some >> 5-object repository with many binary objects, but most are rather >> small: >> >> Between the two phases of "git fsck" (checking directories and >> checking objects) there was a break of several seconds where no >> progress was indicated. >> >> During "git gc" the writing objects phase did not update for some >> seconds, but then the percentage counter jumped like from 15% to 42%. >> >> I understand that updating the progress output too often can be a >> performance bottleneck, while upating it too rarely might only bore >> the user... ;-) > > We update the counter integer for every object we process, and then > actually update the display whenever the percentage increases or a > second has elapsed, whichever comes first. > > What you're seeing is likely an artifact of _what_ we're counting: > written objects. Not all objects are the same size, so it's not uncommon > to see thousands/sec when dealing with small ones, and then several > seconds for one giant blob. > > The only way to solve that is to count bytes. We don't have a total byte > count in most cases, and it wouldn't always make sense (e.g., the > "Compressing objects" meter can show the same issue, but it's not really > putting through bytes in a linear way). In some cases we do show > transmitted size and throughput, but that's just for network operations. > We could do the same for "gc" with the patch below. But usually > throughput isn't all that interesting for a filesystem write, because > bandwidth isn't the bottleneck. > > Possibly we could have a "half throughput" mode that counts up the total > size written, but omits the speed indicator. That's not an unreasonable > thing to show for a local pack, since you end up with the final pack > size. The object counter would still be jumpy, but you'd at least have > one number updated at least once per second as you put through a large > blob. > > If you really want a smooth percentage, then we'd have to start counting > bytes instead of objects. Two reasons we don't do that are: > > - we often don't know the total byte size exactly. E.g., for a > packfile write, it depends on the result of deflating each object. > You can make an approximation and just pretend at the end that you > hit 100%. Or you can count the pre-deflate sizes, but then your > meter doesn't match the bytes from the throughput counter. > > - for something like fsck, we're not actually writing out bytes. So I > guess you'd be measuring "here's how many bytes of objects I > fsck-ed". But is that on-disk compressed bytes, or decompressed > bytes? > > If the former, that's only marginally better as a measure of effort, > since delta compression means that a small number of on-disk bytes > may require a big effort (imagine processing a 100 byte blob versus > a 100 byte delta off of a 100MB blob). > > The latter is probably more accurate. But it's also not free to > pre-generate the total. We can get the number of objects or the size > of the packfile in constant-time, but totaling up the uncompressed > size of all objects is O(n). So that's extra computation, but it > also means a potential lag before we can start the progress meter. > > I'm also not sure how meaningful a byte count is for a user there. > > So there. That's probably more than you wanted to know about Git's > progress code. I think it probably _could_ be improved by counting > more/different things, but I also think it can be a bit of a rabbit > hole. Which is why AFAIK nobody's really looked too seriously into > changing it. > > -Peff This is all interesting, but I think unrelated to what Ulrich is talking about. Quote: Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated I.e. it's not about the pause you get with your testcase (which is certainly another issue) but the break between the two progress bars. Here's a test case you can clone: https://github.com/avar/2015-04-03-1M-git (or might already have "locally" :) If you fsck this repository it'll take around (on my spinning rust server) 30 seconds between 100% of "Checking object directories" before you get any output from "Checking objects". The breakdown of that is (this is from approximate eyeballing): * We spend 1-3 seconds just on this: https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181 * We spend the majority of the ~30s on this: https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79 * Wes spend another 3-5 seconds on this QSORT: https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105 I.e. it's not about
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote: > The only way to solve that is to count bytes. We don't have a total byte > count in most cases, and it wouldn't always make sense (e.g., the > "Compressing objects" meter can show the same issue, but it's not really > putting through bytes in a linear way). In some cases we do show > transmitted size and throughput, but that's just for network operations. > We could do the same for "gc" with the patch below. But usually > throughput isn't all that interesting for a filesystem write, because > bandwidth isn't the bottleneck. Just realized I forgot to include the patch. Here it is, for reference. Doing something similar for fsck would be quite a bit more invasive. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 80c880e9ad..e1130b959d 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,7 +837,7 @@ static void write_pack_file(void) if (pack_to_stdout) f = hashfd_throughput(1, "", progress_state); else - f = create_tmp_packfile(&pack_tmp_name); + f = create_tmp_packfile(&pack_tmp_name, progress_state); offset = write_pack_header(f, nr_remaining); diff --git a/bulk-checkin.c b/bulk-checkin.c index 9f3b644811..0df45b8f55 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state, if (!(flags & HASH_WRITE_OBJECT) || state->f) return; - state->f = create_tmp_packfile(&state->pack_tmp_name); + state->f = create_tmp_packfile(&state->pack_tmp_name, NULL); reset_pack_idx_option(&state->pack_idx_opts); /* Pretend we are going to write only one object */ diff --git a/pack-write.c b/pack-write.c index a9d46bc03f..b72480b440 100644 --- a/pack-write.c +++ b/pack-write.c @@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len, return n; } -struct hashfile *create_tmp_packfile(char **pack_tmp_name) +struct hashfile *create_tmp_packfile(char **pack_tmp_name, +struct progress *progress) { struct strbuf tmpname = STRBUF_INIT; int fd; fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XX"); *pack_tmp_name = strbuf_detach(&tmpname, NULL); - return hashfd(fd, *pack_tmp_name); + return hashfd_throughput(fd, *pack_tmp_name, progress); } void finish_tmp_packfile(struct strbuf *name_buffer, diff --git a/pack.h b/pack.h index 34a9d458b4..c87628b093 100644 --- a/pack.h +++ b/pack.h @@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, int hdr_len, #define PH_ERROR_PROTOCOL (-3) extern int read_pack_header(int fd, struct pack_header *); -extern struct hashfile *create_tmp_packfile(char **pack_tmp_name); +extern struct hashfile *create_tmp_packfile(char **pack_tmp_name, + struct progress *progress); extern void finish_tmp_packfile(struct strbuf *name_buffer, const char *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, struct pack_idx_option *pack_idx_opts, unsigned char sha1[]); #endif
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 05:18:51PM +0200, Duy Nguyen wrote: > > During "git gc" the writing objects phase did not update for some > > seconds, but then the percentage counter jumped like from 15% to > > 42%. > [...] > > Is it possible to make this repository public? You can also use "git > fast-export --anonymize" to make a repo with same structure but no > real content (but read the man page about that option first) Try this: -- >8 -- git init repo cd repo # one big blob dd if=/dev/urandom of=big bs=1M count=100 git add big git commit -m big # several small blobs for i in $(seq 10); do echo $i >file git add file git commit -m $i done git gc -- >8 -- It "stalls" at 33%, and then jumps straight to 100%. -Peff
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote: > I'd like to point out some minor issue observed while processing some > 5-object repository with many binary objects, but most are rather > small: > > Between the two phases of "git fsck" (checking directories and > checking objects) there was a break of several seconds where no > progress was indicated. > > During "git gc" the writing objects phase did not update for some > seconds, but then the percentage counter jumped like from 15% to 42%. > > I understand that updating the progress output too often can be a > performance bottleneck, while upating it too rarely might only bore > the user... ;-) We update the counter integer for every object we process, and then actually update the display whenever the percentage increases or a second has elapsed, whichever comes first. What you're seeing is likely an artifact of _what_ we're counting: written objects. Not all objects are the same size, so it's not uncommon to see thousands/sec when dealing with small ones, and then several seconds for one giant blob. The only way to solve that is to count bytes. We don't have a total byte count in most cases, and it wouldn't always make sense (e.g., the "Compressing objects" meter can show the same issue, but it's not really putting through bytes in a linear way). In some cases we do show transmitted size and throughput, but that's just for network operations. We could do the same for "gc" with the patch below. But usually throughput isn't all that interesting for a filesystem write, because bandwidth isn't the bottleneck. Possibly we could have a "half throughput" mode that counts up the total size written, but omits the speed indicator. That's not an unreasonable thing to show for a local pack, since you end up with the final pack size. The object counter would still be jumpy, but you'd at least have one number updated at least once per second as you put through a large blob. If you really want a smooth percentage, then we'd have to start counting bytes instead of objects. Two reasons we don't do that are: - we often don't know the total byte size exactly. E.g., for a packfile write, it depends on the result of deflating each object. You can make an approximation and just pretend at the end that you hit 100%. Or you can count the pre-deflate sizes, but then your meter doesn't match the bytes from the throughput counter. - for something like fsck, we're not actually writing out bytes. So I guess you'd be measuring "here's how many bytes of objects I fsck-ed". But is that on-disk compressed bytes, or decompressed bytes? If the former, that's only marginally better as a measure of effort, since delta compression means that a small number of on-disk bytes may require a big effort (imagine processing a 100 byte blob versus a 100 byte delta off of a 100MB blob). The latter is probably more accurate. But it's also not free to pre-generate the total. We can get the number of objects or the size of the packfile in constant-time, but totaling up the uncompressed size of all objects is O(n). So that's extra computation, but it also means a potential lag before we can start the progress meter. I'm also not sure how meaningful a byte count is for a user there. So there. That's probably more than you wanted to know about Git's progress code. I think it probably _could_ be improved by counting more/different things, but I also think it can be a bit of a rabbit hole. Which is why AFAIK nobody's really looked too seriously into changing it. -Peff
Re: non-smooth progress indication for git fsck and git gc
On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl wrote: > > Hi! > > I'd like to point out some minor issue observed while processing some > 5-object repository with many binary objects, but most are rather small: > > Between the two phases of "git fsck" (checking directories and checking > objects) there was a break of several seconds where no progress was indicated. > > During "git gc" the writing objects phase did not update for some seconds, > but then the percentage counter jumped like from 15% to 42%. > > I understand that updating the progress output too often can be a performance > bottleneck, while upating it too rarely might only bore the user... ;-) > > But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3). Is it possible to make this repository public? You can also use "git fast-export --anonymize" to make a repo with same structure but no real content (but read the man page about that option first) > Regards, > Ulrich > > -- Duy
non-smooth progress indication for git fsck and git gc
Hi! I'd like to point out some minor issue observed while processing some 5-object repository with many binary objects, but most are rather small: Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated. During "git gc" the writing objects phase did not update for some seconds, but then the percentage counter jumped like from 15% to 42%. I understand that updating the progress output too often can be a performance bottleneck, while upating it too rarely might only bore the user... ;-) But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3). Regards, Ulrich