subject:"non\-smooth progress indication for git fsck and git gc"

Re: non-smooth progress indication for git fsck and git gc

2018-09-06 Thread Jeff King

On Mon, Sep 03, 2018 at 06:48:54PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > And there are definitely a few nasty bits (like the way the progress is
> > ended). I'm not planning on taking this further for now, but maybe
> > you or somebody can find it interesting or useful.
> 
> I think it would be really nice if this were taken further. Using my
> perf test in
> https://public-inbox.org/git/20180903144928.30691-7-ava...@gmail.com/T/#u
> I get these results:
> 
> $ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 
> GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh
> [...]
> Test   HEAD~ HEAD
> 
> 1450.1: fsck   384.18(381.63+2.53)   301.52(508.28+38.34) -21.5%
> 
> 
> I.e. this gives a 20% speedup, although of course some of that might be
> because some of this might be skipping too much work, but looks really
> promising.

I'm pretty sure it's doing the correct thing, in terms of doing all the
right checks. But look at your CPU time. You're getting a 20% wall-clock
speedup, but spending a lot more CPU. So the main difference is really
the multi-threading in index-pack. It should be strictly worse in terms
of total CPU on a single-processor system because we're doing work in
the sub-process (so we pay for the process invocation, but also we
probably are unable to share things like in-memory commit structs,
wasting a little extra time).

So I'm on the fence on whether it is worth it. I like getting rid of the
duplicated code. But on the other hand it is not all that complex, and
maybe when it comes to things like fsck it is good to have a different
implementation than the one that writes the .idx out in the first place.

-Peff

Re: non-smooth progress indication for git fsck and git gc

2018-09-04 Thread Junio C Hamano

Jeff King  writes:

> That code isn't lib-ified enough to be run in process, but I think the
> patch below should give similar behavior to what fsck currently does.
> We'd need to tell index-pack to use our fsck.* config for its checks, I
> imagine. The progress here is still per-pack, but I think we could pass
> in sufficient information to have it do one continuous meter across all
> of the packs (see the in-code comment).
>
> And it makes the result multi-threaded, and lets us drop a bunch of
> duplicate code.
>
> ---
>  builtin/fsck.c |  53 +++--
>  pack-check.c   | 142 ---
>  pack.h |   1 -
>  3 files changed, 32 insertions(+), 164 deletions(-)

The numbers here are nice, even to readers who do not necessarily
care about the progress meter output ;-)

Re: non-smooth progress indication for git fsck and git gc

2018-09-03 Thread Ævar Arnfjörð Bjarmason



On Sun, Sep 02 2018, Jeff King wrote:

> On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote:
>
>> I still think the more interesting long-term thing here is to reuse the
>> pack verification from index-pack, which actually hashes as it does the
>> per-object countup.
>>
>> That code isn't lib-ified enough to be run in process, but I think the
>> patch below should give similar behavior to what fsck currently does.
>> We'd need to tell index-pack to use our fsck.* config for its checks, I
>> imagine. The progress here is still per-pack, but I think we could pass
>> in sufficient information to have it do one continuous meter across all
>> of the packs (see the in-code comment).
>>
>> And it makes the result multi-threaded, and lets us drop a bunch of
>> duplicate code.
>
> Here's a little polish on that to pass enough progress data to
> index-pack to let it have one nice continuous meter. I'm not sure if
> it's worth all the hackery or not. The dual-meter that index-pack
> generally uses is actually more informative (since it meters the initial
> pass through the pack and the delta reconstruction separately).

Thanks!

> And there are definitely a few nasty bits (like the way the progress is
> ended). I'm not planning on taking this further for now, but maybe
> you or somebody can find it interesting or useful.

I think it would be really nice if this were taken further. Using my
perf test in
https://public-inbox.org/git/20180903144928.30691-7-ava...@gmail.com/T/#u
I get these results:

$ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 
GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh
[...]
Test   HEAD~ HEAD

1450.1: fsck   384.18(381.63+2.53)   301.52(508.28+38.34) -21.5%


I.e. this gives a 20% speedup, although of course some of that might be
because some of this might be skipping too much work, but looks really
promising.

I don't know if I'll have time for it, and besides, you wrote the code
so you already understand it *hint* *hint* :)

> ---
>  builtin/fsck.c   |  59 +--
>  builtin/index-pack.c |  43 ++-
>  pack-check.c | 142 ---
>  pack.h   |   1 -
>  4 files changed, 75 insertions(+), 170 deletions(-)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 250f5af118..2d774ea2e5 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, 
> unsigned long size)
>   return err;
>  }
>
> -static int fsck_obj_buffer(const struct object_id *oid, enum object_type 
> type,
> -unsigned long size, void *buffer, int *eaten)
> -{
> - /*
> -  * Note, buffer may be NULL if type is OBJ_BLOB. See
> -  * verify_packfile(), data_valid variable for details.
> -  */
> - struct object *obj;
> - obj = parse_object_buffer(the_repository, oid, type, size, buffer,
> -   eaten);
> - if (!obj) {
> - errors_found |= ERROR_OBJECT;
> - return error("%s: object corrupt or missing", oid_to_hex(oid));
> - }
> - obj->flags &= ~(REACHABLE | SEEN);
> - obj->flags |= HAS_OBJ;
> - return fsck_obj(obj, buffer, size);
> -}
> -
>  static int default_refs;
>
>  static void fsck_handle_reflog_oid(const char *refname, struct object_id 
> *oid,
> @@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct 
> object_id *oid,
>   return 0;
>  }
>
> +static int verify_pack(struct packed_git *p,
> +unsigned long count, unsigned long total)
> +{
> + struct child_process index_pack = CHILD_PROCESS_INIT;
> +
> + if (is_pack_valid(p) < 0)
> + return -1;
> + for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
> +
> + index_pack.git_cmd = 1;
> + argv_array_pushl(&index_pack.args,
> +  "index-pack",
> +  "--verify-fsck",
> +  NULL);
> + if (show_progress)
> + argv_array_pushf(&index_pack.args,
> +  "--fsck-progress=%lu,%lu,Checking pack %s",
> +  count, total, sha1_to_hex(p->sha1));
> + argv_array_push(&index_pack.args, p->pack_name);
> +
> + if (run_command(&index_pack))
> + return -1;
> +
> + return 0;
> +}
> +
>  static char const * const fsck_usage[] = {
>   N_("git fsck [] [...]"),
>   NULL
> @@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char 
> *prefix)
>   if (check_full) {
>   struct packed_git *p;
>   uint32_t total = 0, count = 0;
> - struct progress *progress = NULL;
>
>   if (show_progress) {
>   for (p = get_packed_git(the_repository)

Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King

On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote:

> I still think the more interesting long-term thing here is to reuse the
> pack verification from index-pack, which actually hashes as it does the
> per-object countup.
> 
> That code isn't lib-ified enough to be run in process, but I think the
> patch below should give similar behavior to what fsck currently does.
> We'd need to tell index-pack to use our fsck.* config for its checks, I
> imagine. The progress here is still per-pack, but I think we could pass
> in sufficient information to have it do one continuous meter across all
> of the packs (see the in-code comment).
> 
> And it makes the result multi-threaded, and lets us drop a bunch of
> duplicate code.

Here's a little polish on that to pass enough progress data to
index-pack to let it have one nice continuous meter. I'm not sure if
it's worth all the hackery or not. The dual-meter that index-pack
generally uses is actually more informative (since it meters the initial
pass through the pack and the delta reconstruction separately).

And there are definitely a few nasty bits (like the way the progress is
ended). I'm not planning on taking this further for now, but maybe
you or somebody can find it interesting or useful.

---
 builtin/fsck.c   |  59 +--
 builtin/index-pack.c |  43 ++-
 pack-check.c | 142 ---
 pack.h   |   1 -
 4 files changed, 75 insertions(+), 170 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..2d774ea2e5 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, 
unsigned long size)
return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-  unsigned long size, void *buffer, int *eaten)
-{
-   /*
-* Note, buffer may be NULL if type is OBJ_BLOB. See
-* verify_packfile(), data_valid variable for details.
-*/
-   struct object *obj;
-   obj = parse_object_buffer(the_repository, oid, type, size, buffer,
- eaten);
-   if (!obj) {
-   errors_found |= ERROR_OBJECT;
-   return error("%s: object corrupt or missing", oid_to_hex(oid));
-   }
-   obj->flags &= ~(REACHABLE | SEEN);
-   obj->flags |= HAS_OBJ;
-   return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int verify_pack(struct packed_git *p,
+  unsigned long count, unsigned long total)
+{
+   struct child_process index_pack = CHILD_PROCESS_INIT;
+
+   if (is_pack_valid(p) < 0)
+   return -1;
+   for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+   index_pack.git_cmd = 1;
+   argv_array_pushl(&index_pack.args,
+"index-pack",
+"--verify-fsck",
+NULL);
+   if (show_progress)
+   argv_array_pushf(&index_pack.args,
+"--fsck-progress=%lu,%lu,Checking pack %s",
+count, total, sha1_to_hex(p->sha1));
+   argv_array_push(&index_pack.args, p->pack_name);
+
+   if (run_command(&index_pack))
+   return -1;
+
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
if (check_full) {
struct packed_git *p;
uint32_t total = 0, count = 0;
-   struct progress *progress = NULL;
 
if (show_progress) {
for (p = get_packed_git(the_repository); p;
@@ -746,18 +752,21 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
continue;
total += p->num_objects;
}
-
-   progress = start_progress(_("Checking 
objects"), total);
}
for (p = get_packed_git(the_repository); p;
 p = p->next) {
/* verify gives error messages itself */
-   if (verify_pack(p, fsck_obj_buffer,
-   progress, count))
+   if (verify_pack(p, count, total))
errors_found |= ERROR_PACK;
count += p->num_objects;
}
-   stop_progress(&progres

Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King

On Sun, Sep 02, 2018 at 03:46:57AM -0400, Jeff King wrote:

> Something like this, which chunks it there, uses a per-packfile meter
> (though still does not give any clue how many packfiles there are), and
> shows a throughput meter.

Actually, in typical cases it would not matter how many packfiles there
are, because there's generally one big one, and then none of the small
ones would take long enough to trigger the delayed meter. So you'd only
see one such meter usually, and it would cover the majority of the time.
In theory, anyway.

I still think the more interesting long-term thing here is to reuse the
pack verification from index-pack, which actually hashes as it does the
per-object countup.

That code isn't lib-ified enough to be run in process, but I think the
patch below should give similar behavior to what fsck currently does.
We'd need to tell index-pack to use our fsck.* config for its checks, I
imagine. The progress here is still per-pack, but I think we could pass
in sufficient information to have it do one continuous meter across all
of the packs (see the in-code comment).

And it makes the result multi-threaded, and lets us drop a bunch of
duplicate code.

---
 builtin/fsck.c |  53 +++--
 pack-check.c   | 142 ---
 pack.h |   1 -
 3 files changed, 32 insertions(+), 164 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..643e16125b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, 
unsigned long size)
return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-  unsigned long size, void *buffer, int *eaten)
-{
-   /*
-* Note, buffer may be NULL if type is OBJ_BLOB. See
-* verify_packfile(), data_valid variable for details.
-*/
-   struct object *obj;
-   obj = parse_object_buffer(the_repository, oid, type, size, buffer,
- eaten);
-   if (!obj) {
-   errors_found |= ERROR_OBJECT;
-   return error("%s: object corrupt or missing", oid_to_hex(oid));
-   }
-   obj->flags &= ~(REACHABLE | SEEN);
-   obj->flags |= HAS_OBJ;
-   return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,37 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int verify_pack(struct packed_git *p)
+{
+   struct child_process index_pack = CHILD_PROCESS_INIT;
+
+   for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+   /*
+* This should probably be replaced with a command-line option
+* that teaches "index-pack --verify" to show a more compact and
+* fsck-oriented progress (see also the "-v" below).
+*/
+   if (show_progress)
+   fprintf(stderr, "Checking packfile %s...\n",
+   p->pack_name);
+
+   index_pack.git_cmd = 1;
+   argv_array_pushl(&index_pack.args,
+"index-pack",
+"--verify",
+"--strict",
+NULL);
+   if (show_progress)
+   argv_array_push(&index_pack.args, "-v");
+   argv_array_push(&index_pack.args, p->pack_name);
+
+   if (run_command(&index_pack))
+   return -1;
+
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -752,8 +764,7 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
for (p = get_packed_git(the_repository); p;
 p = p->next) {
/* verify gives error messages itself */
-   if (verify_pack(p, fsck_obj_buffer,
-   progress, count))
+   if (verify_pack(p))
errors_found |= ERROR_PACK;
count += p->num_objects;
}
diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..ea1457ce53 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -15,17 +15,6 @@ struct idx_entry {
unsigned int nr;
 };
 
-static int compare_entries(const void *e1, const void *e2)
-{
-   const struct idx_entry *entry1 = e1;
-   const struct idx_entry *entry2 = e2;
-   if (entry1->offset < entry2->offset)
-   return -1;
-   if (entry1->offset > entry2->offset)
-   return 1;
-   return 0;
-}
-
 int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
   off_t offset, off_t len, unsigned int nr)
 {
@@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window 
**w_curs,
return

Re: non-smooth progress indication for git fsck and git gc

2018-09-02 Thread Jeff King

On Sat, Sep 01, 2018 at 02:53:28PM +0200, Ævar Arnfjörð Bjarmason wrote:

> With this we'll get output like:
> 
> $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
> Checking object directories: 100% (256/256), done.
> Hashing: 100% (452634108/452634108), done.
> Hashing: 100% (1073741824/1073741824), done.
> Hashing: 100% (1073741824/1073741824), done.
> Hashing: 100% (1008001572/1008001572), done.
> Checking objects:   2% (262144/13064614)
> ^C
> 
> All tests pass with this. Isn't it awesome? Except it's of course a
> massive hack, we wouldn't want to just hook into SHA1DC like this.

I still consider that output so-so; the byte counts are big and there's
no indication how many "hashing" lines we're going to see. It's also
broken up in a weird way (it's not one per file; it's one per giant
chunk we fed to sha1).

> The problem comes down to us needing to call git_hash_sha1_update() with
> some really huge input, that function is going to take a *long* time,
> and the only way we're getting incremental progress is:
> 
>  1) If we ourselves split the input into N chunks
>  2) If we hack into the SHA1 library itself
> 
> This patch does #2, but for this to be acceptable we'd need to do
> something like #1.

I think we could just do the chunking in verify_packfile(), couldn't we?
(And the .idx hash, if we really want to cover that case, but IMHO
that's way less interesting).

Something like this, which chunks it there, uses a per-packfile meter
(though still does not give any clue how many packfiles there are), and
shows a throughput meter.

diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..c94223664f 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -62,10 +62,25 @@ static int verify_packfile(struct packed_git *p,
uint32_t nr_objects, i;
int err = 0;
struct idx_entry *entries;
+   struct progress *hashing_progress;
+   char *title;
+   off_t total_hashed = 0;
 
if (!is_pack_valid(p))
return error("packfile %s cannot be accessed", p->pack_name);
 
+   if (progress) {
+   /* Probably too long... */
+   title = xstrfmt("Hashing %s", p->pack_name);
+
+   /*
+* I don't think it actually works to have two progresses going
+* at the same time, because when the first one ends, we'll
+* cancel the alarm. But hey, this is a hacky proof of concept.
+*/
+   hashing_progress = start_progress(title, 0);
+   }
+
the_hash_algo->init_fn(&ctx);
do {
unsigned long remaining;
@@ -75,9 +90,25 @@ static int verify_packfile(struct packed_git *p,
pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
if (offset > pack_sig_ofs)
remaining -= (unsigned int)(offset - pack_sig_ofs);
-   the_hash_algo->update_fn(&ctx, in, remaining);
+   while (remaining) {
+   int chunk = remaining < 4096 ? remaining : 4096;
+   the_hash_algo->update_fn(&ctx, in, chunk);
+   in += chunk;
+   remaining -= chunk;
+   total_hashed += chunk;
+   /*
+* The progress code needs tweaking to show throughputs
+* better for open-ended meters.
+*/
+   display_throughput(hashing_progress, total_hashed);
+   display_progress(hashing_progress, 0);
+   }
} while (offset < pack_sig_ofs);
+
the_hash_algo->final_fn(hash, &ctx);
+   stop_progress(&hashing_progress);
+   free(title);
+
pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
if (hashcmp(hash, pack_sig))
err = error("%s pack checksum mismatch",

Re: non-smooth progress indication for git fsck and git gc

2018-09-01 Thread Ævar Arnfjörð Bjarmason



On Sat, Sep 01 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:
>
>> On Thu, Aug 16 2018, Jeff King wrote:
>>
>>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>>
 I'd like to point out some minor issue observed while processing some
 5-object repository with many binary objects, but most are rather
 small:

 Between the two phases of "git fsck" (checking directories and
 checking objects) there was a break of several seconds where no
 progress was indicated.

 During "git gc" the writing objects phase did not update for some
 seconds, but then the percentage counter jumped like from 15% to 42%.

 I understand that updating the progress output too often can be a
 performance bottleneck, while upating it too rarely might only bore
 the user... ;-)
>>>
>>> We update the counter integer for every object we process, and then
>>> actually update the display whenever the percentage increases or a
>>> second has elapsed, whichever comes first.
>>>
>>> What you're seeing is likely an artifact of _what_ we're counting:
>>> written objects. Not all objects are the same size, so it's not uncommon
>>> to see thousands/sec when dealing with small ones, and then several
>>> seconds for one giant blob.
>>>
>>> The only way to solve that is to count bytes. We don't have a total byte
>>> count in most cases, and it wouldn't always make sense (e.g., the
>>> "Compressing objects" meter can show the same issue, but it's not really
>>> putting through bytes in a linear way).  In some cases we do show
>>> transmitted size and throughput, but that's just for network operations.
>>> We could do the same for "gc" with the patch below. But usually
>>> throughput isn't all that interesting for a filesystem write, because
>>> bandwidth isn't the bottleneck.
>>>
>>> Possibly we could have a "half throughput" mode that counts up the total
>>> size written, but omits the speed indicator. That's not an unreasonable
>>> thing to show for a local pack, since you end up with the final pack
>>> size. The object counter would still be jumpy, but you'd at least have
>>> one number updated at least once per second as you put through a large
>>> blob.
>>>
>>> If you really want a smooth percentage, then we'd have to start counting
>>> bytes instead of objects. Two reasons we don't do that are:
>>>
>>>   - we often don't know the total byte size exactly. E.g., for a
>>> packfile write, it depends on the result of deflating each object.
>>> You can make an approximation and just pretend at the end that you
>>> hit 100%.  Or you can count the pre-deflate sizes, but then your
>>> meter doesn't match the bytes from the throughput counter.
>>>
>>>   - for something like fsck, we're not actually writing out bytes.  So I
>>> guess you'd be measuring "here's how many bytes of objects I
>>> fsck-ed". But is that on-disk compressed bytes, or decompressed
>>> bytes?
>>>
>>> If the former, that's only marginally better as a measure of effort,
>>> since delta compression means that a small number of on-disk bytes
>>> may require a big effort (imagine processing a 100 byte blob versus
>>> a 100 byte delta off of a 100MB blob).
>>>
>>> The latter is probably more accurate. But it's also not free to
>>> pre-generate the total. We can get the number of objects or the size
>>> of the packfile in constant-time, but totaling up the uncompressed
>>> size of all objects is O(n). So that's extra computation, but it
>>> also means a potential lag before we can start the progress meter.
>>>
>>> I'm also not sure how meaningful a byte count is for a user there.
>>>
>>> So there. That's probably more than you wanted to know about Git's
>>> progress code. I think it probably _could_ be improved by counting
>>> more/different things, but I also think it can be a bit of a rabbit
>>> hole. Which is why AFAIK nobody's really looked too seriously into
>>> changing it.
>>>
>>> -Peff
>>
>> This is all interesting, but I think unrelated to what Ulrich is talking
>> about. Quote:
>>
>> Between the two phases of "git fsck" (checking directories and
>> checking objects) there was a break of several seconds where no
>> progress was indicated
>>
>> I.e. it's not about the pause you get with your testcase (which is
>> certainly another issue) but the break between the two progress bars.
>>
>> Here's a test case you can clone:
>> https://github.com/avar/2015-04-03-1M-git (or might already have
>> "locally" :)
>>
>> If you fsck this repository it'll take around (on my spinning rust
>> server) 30 seconds between 100% of "Checking object directories" before
>> you get any output from "Checking objects".
>>
>> The breakdown of that is (this is from approximate eyeballing):
>>
>>  * We spend 1-3 seconds just on this:
>>
>> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21

Re: non-smooth progress indication for git fsck and git gc

2018-09-01 Thread Ævar Arnfjörð Bjarmason



On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Aug 16 2018, Jeff King wrote:
>
>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>
>>> I'd like to point out some minor issue observed while processing some
>>> 5-object repository with many binary objects, but most are rather
>>> small:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated.
>>>
>>> During "git gc" the writing objects phase did not update for some
>>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> I understand that updating the progress output too often can be a
>>> performance bottleneck, while upating it too rarely might only bore
>>> the user... ;-)
>>
>> We update the counter integer for every object we process, and then
>> actually update the display whenever the percentage increases or a
>> second has elapsed, whichever comes first.
>>
>> What you're seeing is likely an artifact of _what_ we're counting:
>> written objects. Not all objects are the same size, so it's not uncommon
>> to see thousands/sec when dealing with small ones, and then several
>> seconds for one giant blob.
>>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way).  In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>>
>> Possibly we could have a "half throughput" mode that counts up the total
>> size written, but omits the speed indicator. That's not an unreasonable
>> thing to show for a local pack, since you end up with the final pack
>> size. The object counter would still be jumpy, but you'd at least have
>> one number updated at least once per second as you put through a large
>> blob.
>>
>> If you really want a smooth percentage, then we'd have to start counting
>> bytes instead of objects. Two reasons we don't do that are:
>>
>>   - we often don't know the total byte size exactly. E.g., for a
>> packfile write, it depends on the result of deflating each object.
>> You can make an approximation and just pretend at the end that you
>> hit 100%.  Or you can count the pre-deflate sizes, but then your
>> meter doesn't match the bytes from the throughput counter.
>>
>>   - for something like fsck, we're not actually writing out bytes.  So I
>> guess you'd be measuring "here's how many bytes of objects I
>> fsck-ed". But is that on-disk compressed bytes, or decompressed
>> bytes?
>>
>> If the former, that's only marginally better as a measure of effort,
>> since delta compression means that a small number of on-disk bytes
>> may require a big effort (imagine processing a 100 byte blob versus
>> a 100 byte delta off of a 100MB blob).
>>
>> The latter is probably more accurate. But it's also not free to
>> pre-generate the total. We can get the number of objects or the size
>> of the packfile in constant-time, but totaling up the uncompressed
>> size of all objects is O(n). So that's extra computation, but it
>> also means a potential lag before we can start the progress meter.
>>
>> I'm also not sure how meaningful a byte count is for a user there.
>>
>> So there. That's probably more than you wanted to know about Git's
>> progress code. I think it probably _could_ be improved by counting
>> more/different things, but I also think it can be a bit of a rabbit
>> hole. Which is why AFAIK nobody's really looked too seriously into
>> changing it.
>>
>> -Peff
>
> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
>
> Between the two phases of "git fsck" (checking directories and
> checking objects) there was a break of several seconds where no
> progress was indicated
>
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.
>
> Here's a test case you can clone:
> https://github.com/avar/2015-04-03-1M-git (or might already have
> "locally" :)
>
> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
>
> The breakdown of that is (this is from approximate eyeballing):
>
>  * We spend 1-3 seconds just on this:
>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181
>
>  * We spend the majority of the ~30s on this:
>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-21 Thread Duy Nguyen

On Tue, Aug 21, 2018 at 3:13 AM Jeff King  wrote:
> I _think_ they should work together OK without further modification.
> Once upon a time the caller had to say "don't show if we're past N%
> after M seconds", but I think with the current code we'd just show it if
> we're not completely finished after 2 seconds.
>
> So it really should just be a simple:
>
>   progress = start_delayed_progress("Hashing packfile", 0);
>
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or
> something.

Or just an ascii throbber. I think the important thing is communicate
"I'm still doing something, not hanging up". "Hashing packfile"
satisfies the "something" part, the throbber the "hanging".
-- 
Duy

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ulrich Windl

>>> Jeff King  schrieb am 21.08.2018 um 03:07 in Nachricht
<20180821010712.ga32...@sigill.intra.peff.net>:
> On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
[...]
> So it really should just be a simple:
> 
>   progress = start_delayed_progress("Hashing packfile", 0);
> 
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or

Hi!

With big numbers you could apply scaling (the usual ISO suffixes), but after
having read GBs, there wouldn't be much change visible unless you add
significant fractional digits.
So scaling seems good for absolute numbers, but not for growing numbers. Of
course one could recursively scale the fractional parts again, but then the
result will be lengthy again, then.

Regards,
Ulrich

> something. And as you noted, there's some refactoring to be done with
> pack-check for it to show multiple progress meters.
> 
> (I still think in the long run we would want to scrap that code, but
> that's a much bigger job; I'm fine if somebody wants to do incremental
> improvements in the meantime).
> 
> -Peff

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Jeff King

On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > That seems to apply. BTW: Is there a way go get some repository statistics
> > like a histogram of object sizes (or whatever that might be useful to help
> > making decisions)?
> 
> The git-sizer program is really helpful in this regard:
> https://github.com/github/git-sizer

Yeah, I'd very much agree that's the best tool for a general overview of
the repository stats.

Ulrich, if you want to more analysis (like a histogram), probably
something like:

  git cat-file --batch-all-objects --batch-check='%(objectsize)'

might be a good starting place to dump information (see the "BATCH
OUTPUT" section of "git help cat-file" for more format items).

> > If it's sorting, maybe add some code like (wild guess):
> >
> > if (objects_to_sort > magic_number)
> >message("Sorting something...");
> 
> I think a good solution to these cases is to just introduce something to
> the progress.c mode where it learns how to display a counter where we
> don't know what the end-state will be. Something like your proposed
> magic_number can just be covered under the more general case where we
> don't show the progress bar unless it's been 1 second (which I believe
> is the default).

Yeah.  We already have open-ended counters (e.g., "counting objects"),
and delayed meters (we actually normalized the default to 2s recently).

I _think_ they should work together OK without further modification.
Once upon a time the caller had to say "don't show if we're past N%
after M seconds", but I think with the current code we'd just show it if
we're not completely finished after 2 seconds.

So it really should just be a simple:

  progress = start_delayed_progress("Hashing packfile", 0);

That said, counting bytes would probably be ugly (just because the
numbers get really big). We'd probably prefer to show a throughput or
something. And as you noted, there's some refactoring to be done with
pack-check for it to show multiple progress meters.

(I still think in the long run we would want to scrap that code, but
that's a much bigger job; I'm fine if somebody wants to do incremental
improvements in the meantime).

-Peff

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ulrich Windl

Hi!

Here are some stats from the repository. First the fast import ones (which had
good performance, but probably all cached, also):

% git fast-import <../git-stream
/usr/lib/git/git-fast-import statistics:
-
Alloc'd objects:  55000
Total objects:51959 (56 duplicates  )
  blobs  :47991 ( 0 duplicates  0 deltas of   
  0 attempts)
  trees  : 3946 (56 duplicates994 deltas of  
3929 attempts)
  commits:   22 ( 0 duplicates  0 deltas of   
  0 attempts)
  tags   :0 ( 0 duplicates  0 deltas of   
  0 attempts)
Total branches:  15 (15 loads )
  marks:1048576 ( 48013 unique)
  atoms:  43335
Memory total:  9611 KiB
   pools:  7033 KiB
 objects:  2578 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 8589934592
pack_report: pack_used_ctr=   1780
pack_report: pack_mmap_calls  = 23
pack_report: pack_open_windows=  1 /  1
pack_report: pack_mapped  =2905751 /2905751
-

Then the output from git-sizer:

Processing blobs: 47991
Processing trees: 3946
Processing commits: 22
Matching commits to trees: 22
Processing annotated tags: 0
Processing references: 15
| Name | Value | Level of concern   |
|  | - | -- |
| Overall repository size  |   ||
| * Blobs  |   ||
|   * Total size   |  13.7 GiB | *  |
|  |   ||
| Biggest objects  |   ||
| * Trees  |   ||
|   * Maximum entries  [1] |  13.4 k   | *  |
| * Blobs  |   ||
|   * Maximum size [2] |   279 MiB | *  |
|  |   ||
| Biggest checkouts|   ||
| * Maximum path depth [3] |10 | *  |
| * Maximum path length[3] |   130 B   | *  |
| * Total size of files[3] |  8.63 GiB | *  |

[1]  b701345cbd4317276568b9d9890fd77f38933bdc
(refs/heads/master:Resources/default/CIFP)
[2]  19f54c4a7595667329c1be23200234f0fe50ac56
(refs/heads/master:Resources/default/apt.dat)
[3]  b0e3d3a2b7f2504117408f13486c905a8eb8fb1e (refs/heads/master^{tree})

Some notes:
[1] is a directory with many short (typically < 1kB) text files
[2] is a very large text file with many changes
[3] Yes, some paths are long

Regards,
Ulrich


>>> Ævar Arnfjörð Bjarmason  schrieb am 20.08.2018 um 10:57
in
Nachricht <87woslpg9i@evledraar.gmail.com>:

> On Mon, Aug 20 2018, Ulrich Windl wrote:
> 
> Jeff King  schrieb am 16.08.2018 um 22:55 in Nachricht
>> <20180816205556.ga8...@sigill.intra.peff.net>:
>>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
 This is all interesting, but I think unrelated to what Ulrich is talking
 about. Quote:

 Between the two phases of "git fsck" (checking directories and
 checking objects) there was a break of several seconds where no
 progress was indicated

 I.e. it's not about the pause you get with your testcase (which is
 certainly another issue) but the break between the two progress bars.
>>>
>>> I think he's talking about both. What I said responds to this:
>>
>> Hi guys!
>>
>> Yes, I was wondering what git does between the two visible phases, and 
> between
>> the lines I was suggesting another progress message between those phases.
At
>> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
>> ;-) Of course anything more appropriate would be welcome.
>> Also that message should only be displayed if it's foreseeable that the
>> operation will take significant time. In my case (I just repeated it a few
>> minutes ago) the delay is significant (at least 10 seconds). As noted 
> earlier I
>> was hoping to capture the timing in a screencast, but it seems all

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ævar Arnfjörð Bjarmason



On Mon, Aug 20 2018, Ulrich Windl wrote:

 Jeff King  schrieb am 16.08.2018 um 22:55 in Nachricht
> <20180816205556.ga8...@sigill.intra.peff.net>:
>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>> This is all interesting, but I think unrelated to what Ulrich is talking
>>> about. Quote:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated
>>>
>>> I.e. it's not about the pause you get with your testcase (which is
>>> certainly another issue) but the break between the two progress bars.
>>
>> I think he's talking about both. What I said responds to this:
>
> Hi guys!
>
> Yes, I was wondering what git does between the two visible phases, and between
> the lines I was suggesting another progress message between those phases. At
> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
> ;-) Of course anything more appropriate would be welcome.
> Also that message should only be displayed if it's foreseeable that the
> operation will take significant time. In my case (I just repeated it a few
> minutes ago) the delay is significant (at least 10 seconds). As noted earlier 
> I
> was hoping to capture the timing in a screencast, but it seems all the delays
> were just optimized away in the recording.
>
>>
>>> >> During "git gc" the writing objects phase did not update for some
>>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
>>
>> But yeah, I missed that the fsck thing was specifically about a break
>> between two meters. That's a separate problem, but also worth
>> discussing (and hopefully much easier to address).
>>
>>> If you fsck this repository it'll take around (on my spinning rust
>>> server) 30 seconds between 100% of "Checking object directories" before
>>> you get any output from "Checking objects".
>>>
>>> The breakdown of that is (this is from approximate eyeballing):
>>>
>>>  * We spend 1-3 seconds just on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L181
>>
>> OK, so that's checking the sha1 over the .idx file. We could put a meter
>> on that. I wouldn't expect it to generally be all that slow outside of
>> pathological cases, since it scales with the number of objects (and 1s
>> is our minimum update anyway, so that might be OK as-is). Your case has
>> 13M objects, which is quite large.
>
> Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my
> CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is 
> a
> classic spinning disk with 5400RPM built in...
>
>>
>>>  * We spend the majority of the ~30s on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L70-L79
>>
>> This is hashing the actual packfile. This is potentially quite long,
>> especially if you have a ton of big objects.
>
> That seems to apply. BTW: Is there a way go get some repository statistics
> like a histogram of object sizes (or whatever that might be useful to help
> making decisions)?

The git-sizer program is really helpful in this regard:
https://github.com/github/git-sizer

>>
>> I wonder if we need to do this as a separate step anyway, though. Our
>> verification is based on index-pack these days, which means it's going
>> to walk over the whole content as part of the "Indexing objects" step to
>> expand base objects and mark deltas for later. Could we feed this hash
>> as part of that walk over the data? It's not going to save us 30s, but
>> it's likely to be more efficient. And it would fold the effort naturally
>> into the existing progress meter.
>>
>>>  * Wes spend another 3-5 seconds on this QSORT:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L105
>>
>> That's a tough one. I'm not sure how we'd count it (how many compares we
>> do?). And each item is doing so little work that hitting the progress
>> code may make things noticeably slower.
>
> If it's sorting, maybe add some code like (wild guess):
>
> if (objects_to_sort > magic_number)
>message("Sorting something...");

I think a good solution to these cases is to just introduce something to
the progress.c mode where it learns how to display a counter where we
don't know what the end-state will be. Something like your proposed
magic_number can just be covered under the more general case where we
don't show the progress bar unless it's been 1 second (which I believe
is the default).

>>
>> Again, your case is pretty big. Just based on the number of objects,
>> linux.git should be 1.5-2.5 seconds on your machine for the same
>> operation. Which I think may be small enough to ignore (or even just
>> print a generic before/after). It's really the 30s packfile hash that's
>> making the whole thing so terrible.
>>
>> -Peff

Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ulrich Windl

>>> Jeff King  schrieb am 16.08.2018 um 22:55 in Nachricht
<20180816205556.ga8...@sigill.intra.peff.net>:
> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>> This is all interesting, but I think unrelated to what Ulrich is talking
>> about. Quote:
>> 
>> Between the two phases of "git fsck" (checking directories and
>> checking objects) there was a break of several seconds where no
>> progress was indicated
>> 
>> I.e. it's not about the pause you get with your testcase (which is
>> certainly another issue) but the break between the two progress bars.
> 
> I think he's talking about both. What I said responds to this:

Hi guys!

Yes, I was wondering what git does between the two visible phases, and between
the lines I was suggesting another progress message between those phases. At
least the maximum unspecific three-dot-message "Thinking..." could be displayed
;-) Of course anything more appropriate would be welcome.
Also that message should only be displayed if it's foreseeable that the
operation will take significant time. In my case (I just repeated it a few
minutes ago) the delay is significant (at least 10 seconds). As noted earlier I
was hoping to capture the timing in a screencast, but it seems all the delays
were just optimized away in the recording.

> 
>> >> During "git gc" the writing objects phase did not update for some
>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
> 
> But yeah, I missed that the fsck thing was specifically about a break
> between two meters. That's a separate problem, but also worth
> discussing (and hopefully much easier to address).
> 
>> If you fsck this repository it'll take around (on my spinning rust
>> server) 30 seconds between 100% of "Checking object directories" before
>> you get any output from "Checking objects".
>> 
>> The breakdown of that is (this is from approximate eyeballing):
>> 
>>  * We spend 1-3 seconds just on this:
>>
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L181
> 
> OK, so that's checking the sha1 over the .idx file. We could put a meter
> on that. I wouldn't expect it to generally be all that slow outside of
> pathological cases, since it scales with the number of objects (and 1s
> is our minimum update anyway, so that might be OK as-is). Your case has
> 13M objects, which is quite large.

Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my
CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is a
classic spinning disk with 5400RPM built in...

> 
>>  * We spend the majority of the ~30s on this:
>>
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L70-L79
> 
> This is hashing the actual packfile. This is potentially quite long,
> especially if you have a ton of big objects.

That seems to apply. BTW: Is there a way go get some repository statistics
like a histogram of object sizes (or whatever that might be useful to help
making decisions)?

> 
> I wonder if we need to do this as a separate step anyway, though. Our
> verification is based on index-pack these days, which means it's going
> to walk over the whole content as part of the "Indexing objects" step to
> expand base objects and mark deltas for later. Could we feed this hash
> as part of that walk over the data? It's not going to save us 30s, but
> it's likely to be more efficient. And it would fold the effort naturally
> into the existing progress meter.
> 
>>  * Wes spend another 3-5 seconds on this QSORT:
>>
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L105
> 
> That's a tough one. I'm not sure how we'd count it (how many compares we
> do?). And each item is doing so little work that hitting the progress
> code may make things noticeably slower.

If it's sorting, maybe add some code like (wild guess):

if (objects_to_sort > magic_number)
   message("Sorting something...");

> 
> Again, your case is pretty big. Just based on the number of objects,
> linux.git should be 1.5-2.5 seconds on your machine for the same
> operation. Which I think may be small enough to ignore (or even just
> print a generic before/after). It's really the 30s packfile hash that's
> making the whole thing so terrible.
> 
> -Peff

Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ulrich Windl

>>> Duy Nguyen  schrieb am 16.08.2018 um 17:18 in Nachricht
:
> On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl
>  wrote:
>>
>> Hi!
>>
>> I'd like to point out some minor issue observed while processing some 
> 5-object repository with many binary objects, but most are rather small:
>>
>> Between the two phases of "git fsck" (checking directories and checking 
> objects) there was a break of several seconds where no progress was 
> indicated.
>>
>> During "git gc" the writing objects phase did not update for some seconds, 
> but then the percentage counter jumped like from 15% to 42%.
>>
>> I understand that updating the progress output too often can be a 
> performance bottleneck, while upating it too rarely might only bore the 
> user... ;-)
>>
>> But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).
> 
> Is it possible to make this repository public? You can also use "git
> fast-export --anonymize" to make a repo with same structure but no
> real content (but read the man page about that option first)

Hi!

Actually I tried that locally, but with the resulting repository both, fsck and 
gc are very fast. So I guess it won't be very useful. Also the original .git 
directory uses 5.3G, while the anonymous .git just used 4.3M...

I tried to capture the behavior as screencast, but it seems the screencast 
optimized the little cahnges away, and in the result git almost had no delay on 
any operation 8-(

Regards,
Ulrich

> 
>> Regards,
>> Ulrich
>>
>>
> 
> 
> -- 
> Duy

Re: non-smooth progress indication for git fsck and git gc

2018-08-17 Thread Duy Nguyen

On Thu, Aug 16, 2018 at 11:08 PM Jeff King  wrote:
>
> On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote:
>
> > >  * We spend the majority of the ~30s on this:
> > >
> > > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
> >
> > This is hashing the actual packfile. This is potentially quite long,
> > especially if you have a ton of big objects.
> >
> > I wonder if we need to do this as a separate step anyway, though. Our
> > verification is based on index-pack these days, which means it's going
> > to walk over the whole content as part of the "Indexing objects" step to
> > expand base objects and mark deltas for later. Could we feed this hash
> > as part of that walk over the data? It's not going to save us 30s, but
> > it's likely to be more efficient. And it would fold the effort naturally
> > into the existing progress meter.
>
> Actually, I take it back. That's the nice, modern way we do it in
> git-verify-pack. But git-fsck uses the ancient "just walk over all of
> the idx entries method". It at least sorts in pack order, which is good,
> but:
>
>   - it's not multi-threaded, like index-pack/verify-pack
>
>   - the index-pack way is actually more efficient than pack-ordering for
> the delta-base cache, because it actually walks the delta-graph in
> the optimal order
>

I actually tried to make git-fsck use index-pack --verify at one
point. The only thing that stopped it from working was index-pack
automatically wrote the newer index version if I remember correctly,
and that would fail the final hash check. fsck performance was not a
big deal so I dropped it. Just saying it should be possible, if
someone's interested in that direction.
-- 
Duy

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Junio C Hamano

Jeff King  writes:

> On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote:
>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way).  In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>
> Just realized I forgot to include the patch. Here it is, for reference.

I've been wondering when you'd realize the omission ;-)

> Doing something similar for fsck would be quite a bit more invasive.

Yeah, on that codepath there is no streaming write passing through a
single chokepoint you can count bytes X-<.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 80c880e9ad..e1130b959d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -837,7 +837,7 @@ static void write_pack_file(void)
>   if (pack_to_stdout)
>   f = hashfd_throughput(1, "", progress_state);
>   else
> - f = create_tmp_packfile(&pack_tmp_name);
> + f = create_tmp_packfile(&pack_tmp_name, progress_state);
>  
>   offset = write_pack_header(f, nr_remaining);
>  
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 9f3b644811..0df45b8f55 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state 
> *state,
>   if (!(flags & HASH_WRITE_OBJECT) || state->f)
>   return;
>  
> - state->f = create_tmp_packfile(&state->pack_tmp_name);
> + state->f = create_tmp_packfile(&state->pack_tmp_name, NULL);
>   reset_pack_idx_option(&state->pack_idx_opts);
>  
>   /* Pretend we are going to write only one object */
> diff --git a/pack-write.c b/pack-write.c
> index a9d46bc03f..b72480b440 100644
> --- a/pack-write.c
> +++ b/pack-write.c
> @@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, 
> int hdr_len,
>   return n;
>  }
>  
> -struct hashfile *create_tmp_packfile(char **pack_tmp_name)
> +struct hashfile *create_tmp_packfile(char **pack_tmp_name,
> +  struct progress *progress)
>  {
>   struct strbuf tmpname = STRBUF_INIT;
>   int fd;
>  
>   fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XX");
>   *pack_tmp_name = strbuf_detach(&tmpname, NULL);
> - return hashfd(fd, *pack_tmp_name);
> + return hashfd_throughput(fd, *pack_tmp_name, progress);
>  }
>  
>  void finish_tmp_packfile(struct strbuf *name_buffer,
> diff --git a/pack.h b/pack.h
> index 34a9d458b4..c87628b093 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, 
> int hdr_len,
>  #define PH_ERROR_PROTOCOL(-3)
>  extern int read_pack_header(int fd, struct pack_header *);
>  
> -extern struct hashfile *create_tmp_packfile(char **pack_tmp_name);
> +extern struct hashfile *create_tmp_packfile(char **pack_tmp_name,
> + struct progress *progress);
>  extern void finish_tmp_packfile(struct strbuf *name_buffer, const char 
> *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, 
> struct pack_idx_option *pack_idx_opts, unsigned char sha1[]);
>  
>  #endif

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Jeff King

On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote:

> >  * We spend the majority of the ~30s on this:
> >
> > https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
> 
> This is hashing the actual packfile. This is potentially quite long,
> especially if you have a ton of big objects.
> 
> I wonder if we need to do this as a separate step anyway, though. Our
> verification is based on index-pack these days, which means it's going
> to walk over the whole content as part of the "Indexing objects" step to
> expand base objects and mark deltas for later. Could we feed this hash
> as part of that walk over the data? It's not going to save us 30s, but
> it's likely to be more efficient. And it would fold the effort naturally
> into the existing progress meter.

Actually, I take it back. That's the nice, modern way we do it in
git-verify-pack. But git-fsck uses the ancient "just walk over all of
the idx entries method". It at least sorts in pack order, which is good,
but:

  - it's not multi-threaded, like index-pack/verify-pack

  - the index-pack way is actually more efficient than pack-ordering for
the delta-base cache, because it actually walks the delta-graph in
the optimal order

Once upon a time verify-pack used this same pack-check code, and we
switched in 3de89c9d42 (verify-pack: use index-pack --verify,
2011-06-03). So I suspect the right thing to do here is for fsck to
switch to that, too, and then delete pack-check.c entirely.

That may well involve us switching the progress to a per-pack view
(e.g., "checking pack 1/10 (10%)", followed by its own progress meter).
But I don't think that would be a bad thing. It's a more realistic view
of the work we're actually doing. Although perhaps it's misleading about
the total work remaining, because not all packs are the same size (so
you see that you're halfway through pack 2/10, but that's quite likely
to be half of the total work if it's the one gigantic pack).

-Peff

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Jeff King

On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:

> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
> 
> Between the two phases of "git fsck" (checking directories and
> checking objects) there was a break of several seconds where no
> progress was indicated
> 
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.

I think he's talking about both. What I said responds to this:

> >> During "git gc" the writing objects phase did not update for some
> >> seconds, but then the percentage counter jumped like from 15% to 42%.

But yeah, I missed that the fsck thing was specifically about a break
between two meters. That's a separate problem, but also worth
discussing (and hopefully much easier to address).

> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
> 
> The breakdown of that is (this is from approximate eyeballing):
> 
>  * We spend 1-3 seconds just on this:
>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181

OK, so that's checking the sha1 over the .idx file. We could put a meter
on that. I wouldn't expect it to generally be all that slow outside of
pathological cases, since it scales with the number of objects (and 1s
is our minimum update anyway, so that might be OK as-is). Your case has
13M objects, which is quite large.

>  * We spend the majority of the ~30s on this:
>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79

This is hashing the actual packfile. This is potentially quite long,
especially if you have a ton of big objects.

I wonder if we need to do this as a separate step anyway, though. Our
verification is based on index-pack these days, which means it's going
to walk over the whole content as part of the "Indexing objects" step to
expand base objects and mark deltas for later. Could we feed this hash
as part of that walk over the data? It's not going to save us 30s, but
it's likely to be more efficient. And it would fold the effort naturally
into the existing progress meter.

>  * Wes spend another 3-5 seconds on this QSORT:
>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105

That's a tough one. I'm not sure how we'd count it (how many compares we
do?). And each item is doing so little work that hitting the progress
code may make things noticeably slower.

Again, your case is pretty big. Just based on the number of objects,
linux.git should be 1.5-2.5 seconds on your machine for the same
operation. Which I think may be small enough to ignore (or even just
print a generic before/after). It's really the 30s packfile hash that's
making the whole thing so terrible.

-Peff

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Ævar Arnfjörð Bjarmason



On Thu, Aug 16 2018, Jeff King wrote:

> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>
>> I'd like to point out some minor issue observed while processing some
>> 5-object repository with many binary objects, but most are rather
>> small:
>>
>> Between the two phases of "git fsck" (checking directories and
>> checking objects) there was a break of several seconds where no
>> progress was indicated.
>>
>> During "git gc" the writing objects phase did not update for some
>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>
>> I understand that updating the progress output too often can be a
>> performance bottleneck, while upating it too rarely might only bore
>> the user... ;-)
>
> We update the counter integer for every object we process, and then
> actually update the display whenever the percentage increases or a
> second has elapsed, whichever comes first.
>
> What you're seeing is likely an artifact of _what_ we're counting:
> written objects. Not all objects are the same size, so it's not uncommon
> to see thousands/sec when dealing with small ones, and then several
> seconds for one giant blob.
>
> The only way to solve that is to count bytes. We don't have a total byte
> count in most cases, and it wouldn't always make sense (e.g., the
> "Compressing objects" meter can show the same issue, but it's not really
> putting through bytes in a linear way).  In some cases we do show
> transmitted size and throughput, but that's just for network operations.
> We could do the same for "gc" with the patch below. But usually
> throughput isn't all that interesting for a filesystem write, because
> bandwidth isn't the bottleneck.
>
> Possibly we could have a "half throughput" mode that counts up the total
> size written, but omits the speed indicator. That's not an unreasonable
> thing to show for a local pack, since you end up with the final pack
> size. The object counter would still be jumpy, but you'd at least have
> one number updated at least once per second as you put through a large
> blob.
>
> If you really want a smooth percentage, then we'd have to start counting
> bytes instead of objects. Two reasons we don't do that are:
>
>   - we often don't know the total byte size exactly. E.g., for a
> packfile write, it depends on the result of deflating each object.
> You can make an approximation and just pretend at the end that you
> hit 100%.  Or you can count the pre-deflate sizes, but then your
> meter doesn't match the bytes from the throughput counter.
>
>   - for something like fsck, we're not actually writing out bytes.  So I
> guess you'd be measuring "here's how many bytes of objects I
> fsck-ed". But is that on-disk compressed bytes, or decompressed
> bytes?
>
> If the former, that's only marginally better as a measure of effort,
> since delta compression means that a small number of on-disk bytes
> may require a big effort (imagine processing a 100 byte blob versus
> a 100 byte delta off of a 100MB blob).
>
> The latter is probably more accurate. But it's also not free to
> pre-generate the total. We can get the number of objects or the size
> of the packfile in constant-time, but totaling up the uncompressed
> size of all objects is O(n). So that's extra computation, but it
> also means a potential lag before we can start the progress meter.
>
> I'm also not sure how meaningful a byte count is for a user there.
>
> So there. That's probably more than you wanted to know about Git's
> progress code. I think it probably _could_ be improved by counting
> more/different things, but I also think it can be a bit of a rabbit
> hole. Which is why AFAIK nobody's really looked too seriously into
> changing it.
>
> -Peff

This is all interesting, but I think unrelated to what Ulrich is talking
about. Quote:

Between the two phases of "git fsck" (checking directories and
checking objects) there was a break of several seconds where no
progress was indicated

I.e. it's not about the pause you get with your testcase (which is
certainly another issue) but the break between the two progress bars.

Here's a test case you can clone:
https://github.com/avar/2015-04-03-1M-git (or might already have
"locally" :)

If you fsck this repository it'll take around (on my spinning rust
server) 30 seconds between 100% of "Checking object directories" before
you get any output from "Checking objects".

The breakdown of that is (this is from approximate eyeballing):

 * We spend 1-3 seconds just on this:
   
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181

 * We spend the majority of the ~30s on this:
   
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79

 * Wes spend another 3-5 seconds on this QSORT:
   
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105

I.e. it's not about

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Jeff King

On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote:

> The only way to solve that is to count bytes. We don't have a total byte
> count in most cases, and it wouldn't always make sense (e.g., the
> "Compressing objects" meter can show the same issue, but it's not really
> putting through bytes in a linear way).  In some cases we do show
> transmitted size and throughput, but that's just for network operations.
> We could do the same for "gc" with the patch below. But usually
> throughput isn't all that interesting for a filesystem write, because
> bandwidth isn't the bottleneck.

Just realized I forgot to include the patch. Here it is, for reference.

Doing something similar for fsck would be quite a bit more invasive.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80c880e9ad..e1130b959d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,7 +837,7 @@ static void write_pack_file(void)
if (pack_to_stdout)
f = hashfd_throughput(1, "", progress_state);
else
-   f = create_tmp_packfile(&pack_tmp_name);
+   f = create_tmp_packfile(&pack_tmp_name, progress_state);
 
offset = write_pack_header(f, nr_remaining);
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 9f3b644811..0df45b8f55 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state 
*state,
if (!(flags & HASH_WRITE_OBJECT) || state->f)
return;
 
-   state->f = create_tmp_packfile(&state->pack_tmp_name);
+   state->f = create_tmp_packfile(&state->pack_tmp_name, NULL);
reset_pack_idx_option(&state->pack_idx_opts);
 
/* Pretend we are going to write only one object */
diff --git a/pack-write.c b/pack-write.c
index a9d46bc03f..b72480b440 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, int 
hdr_len,
return n;
 }
 
-struct hashfile *create_tmp_packfile(char **pack_tmp_name)
+struct hashfile *create_tmp_packfile(char **pack_tmp_name,
+struct progress *progress)
 {
struct strbuf tmpname = STRBUF_INIT;
int fd;
 
fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XX");
*pack_tmp_name = strbuf_detach(&tmpname, NULL);
-   return hashfd(fd, *pack_tmp_name);
+   return hashfd_throughput(fd, *pack_tmp_name, progress);
 }
 
 void finish_tmp_packfile(struct strbuf *name_buffer,
diff --git a/pack.h b/pack.h
index 34a9d458b4..c87628b093 100644
--- a/pack.h
+++ b/pack.h
@@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, 
int hdr_len,
 #define PH_ERROR_PROTOCOL  (-3)
 extern int read_pack_header(int fd, struct pack_header *);
 
-extern struct hashfile *create_tmp_packfile(char **pack_tmp_name);
+extern struct hashfile *create_tmp_packfile(char **pack_tmp_name,
+   struct progress *progress);
 extern void finish_tmp_packfile(struct strbuf *name_buffer, const char 
*pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, 
struct pack_idx_option *pack_idx_opts, unsigned char sha1[]);
 
 #endif

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Jeff King

On Thu, Aug 16, 2018 at 05:18:51PM +0200, Duy Nguyen wrote:

> > During "git gc" the writing objects phase did not update for some
> > seconds, but then the percentage counter jumped like from 15% to
> > 42%.
> [...]
> 
> Is it possible to make this repository public? You can also use "git
> fast-export --anonymize" to make a repo with same structure but no
> real content (but read the man page about that option first)

Try this:

-- >8 --
git init repo
cd repo

# one big blob
dd if=/dev/urandom of=big bs=1M count=100
git add big
git commit -m big

# several small blobs
for i in $(seq 10); do
  echo $i >file
  git add file
  git commit -m $i
done

git gc
-- >8 --

It "stalls" at 33%, and then jumps straight to 100%.

-Peff

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Jeff King

On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:

> I'd like to point out some minor issue observed while processing some
> 5-object repository with many binary objects, but most are rather
> small:
> 
> Between the two phases of "git fsck" (checking directories and
> checking objects) there was a break of several seconds where no
> progress was indicated.
> 
> During "git gc" the writing objects phase did not update for some
> seconds, but then the percentage counter jumped like from 15% to 42%.
> 
> I understand that updating the progress output too often can be a
> performance bottleneck, while upating it too rarely might only bore
> the user... ;-)

We update the counter integer for every object we process, and then
actually update the display whenever the percentage increases or a
second has elapsed, whichever comes first.

What you're seeing is likely an artifact of _what_ we're counting:
written objects. Not all objects are the same size, so it's not uncommon
to see thousands/sec when dealing with small ones, and then several
seconds for one giant blob.

The only way to solve that is to count bytes. We don't have a total byte
count in most cases, and it wouldn't always make sense (e.g., the
"Compressing objects" meter can show the same issue, but it's not really
putting through bytes in a linear way).  In some cases we do show
transmitted size and throughput, but that's just for network operations.
We could do the same for "gc" with the patch below. But usually
throughput isn't all that interesting for a filesystem write, because
bandwidth isn't the bottleneck.

Possibly we could have a "half throughput" mode that counts up the total
size written, but omits the speed indicator. That's not an unreasonable
thing to show for a local pack, since you end up with the final pack
size. The object counter would still be jumpy, but you'd at least have
one number updated at least once per second as you put through a large
blob.

If you really want a smooth percentage, then we'd have to start counting
bytes instead of objects. Two reasons we don't do that are:

  - we often don't know the total byte size exactly. E.g., for a
packfile write, it depends on the result of deflating each object.
You can make an approximation and just pretend at the end that you
hit 100%.  Or you can count the pre-deflate sizes, but then your
meter doesn't match the bytes from the throughput counter.

  - for something like fsck, we're not actually writing out bytes.  So I
guess you'd be measuring "here's how many bytes of objects I
fsck-ed". But is that on-disk compressed bytes, or decompressed
bytes?

If the former, that's only marginally better as a measure of effort,
since delta compression means that a small number of on-disk bytes
may require a big effort (imagine processing a 100 byte blob versus
a 100 byte delta off of a 100MB blob).

The latter is probably more accurate. But it's also not free to
pre-generate the total. We can get the number of objects or the size
of the packfile in constant-time, but totaling up the uncompressed
size of all objects is O(n). So that's extra computation, but it
also means a potential lag before we can start the progress meter.

I'm also not sure how meaningful a byte count is for a user there.

So there. That's probably more than you wanted to know about Git's
progress code. I think it probably _could_ be improved by counting
more/different things, but I also think it can be a bit of a rabbit
hole. Which is why AFAIK nobody's really looked too seriously into
changing it.

-Peff

Re: non-smooth progress indication for git fsck and git gc

2018-08-16 Thread Duy Nguyen

On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl
 wrote:
>
> Hi!
>
> I'd like to point out some minor issue observed while processing some 
> 5-object repository with many binary objects, but most are rather small:
>
> Between the two phases of "git fsck" (checking directories and checking 
> objects) there was a break of several seconds where no progress was indicated.
>
> During "git gc" the writing objects phase did not update for some seconds, 
> but then the percentage counter jumped like from 15% to 42%.
>
> I understand that updating the progress output too often can be a performance 
> bottleneck, while upating it too rarely might only bore the user... ;-)
>
> But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).

Is it possible to make this repository public? You can also use "git
fast-export --anonymize" to make a repo with same structure but no
real content (but read the man page about that option first)

> Regards,
> Ulrich
>
>


-- 
Duy

non-smooth progress indication for git fsck and git gc

2018-08-15 Thread Ulrich Windl

Hi!

I'd like to point out some minor issue observed while processing some 
5-object repository with many binary objects, but most are rather small:

Between the two phases of "git fsck" (checking directories and checking 
objects) there was a break of several seconds where no progress was indicated.

During "git gc" the writing objects phase did not update for some seconds, but 
then the percentage counter jumped like from 15% to 42%.

I understand that updating the progress output too often can be a performance 
bottleneck, while upating it too rarely might only bore the user... ;-)

But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).

Regards,
Ulrich

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

Antw: Re: non-smooth progress indication for git fsck and git gc

Antw: Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

Re: non-smooth progress indication for git fsck and git gc

non-smooth progress indication for git fsck and git gc

25 matches

Site Navigation

Mail list logo

Footer information