Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-10-02 Thread Jeff King
On Tue, Oct 02, 2018 at 09:05:32PM +0200, René Scharfe wrote:

> > The reason hashmap.c was added was to avoid open addressing. ;)
> Because efficient removal of elements is easier to implement with
> chaining, according to 6a364ced49 (add a hashtable implementation that
> supports O(1) removal).  khash.h deletes using its flags bitmap.  We
> didn't compare their performance when entries are removed so far.

I think it may depend on your workload. Open-addressing generally uses a
tombstone, so you're still dealing with the "deleted" entries until the
next table resize. I suspect that's fine in most cases, but I also am
sure you could find a benchmark that favors the chained approach (I
think in most cases we actually never delete at all -- we simply fill up
a table and then eventually clear it).

> > So yeah, I think it could perhaps be improved, but in my mind talking
> > about "hashmap.c" is fundamentally talking about chained buckets.
> 
> Admittedly I wouldn't touch hashmap.c, as I find its interface too
> complex to wrap my head around.  But perhaps I just didn't try hard
> enough, yet.

FWIW, it's not just you. ;)

> > Yeah. And if it really does perform better, I think we should stick with
> > it in the code base. I wonder if we could stand to clean up the
> > interfaces a little.  E.g., I had a hard time declaring a hash in one
> > place, and then defining it somewhere else.
> 
> You can't use KHASH_DECLARE and KHASH_INIT together, as both declare
> the same structs.  So I guess the idea is to have a header file with
> KHASH_DECLARE and a .c file with KHASH_INIT, the latter *not* including
> the former, but both including khash.h.  I didn't actually try that,
> though.

Yeah, that seems weird. You'd want to include one from the other to make
sure that they both match.

By the way, if you do want to pursue changes, I have no problem at all
hacking up khash into something that can't be merged with its upstream.
It's nice that it's a well-used and tested library, but I'd much rather
have something that we on this project understand (and that matches our
conventions and style).

> > This is kind of a layering violation, too. You're assuming that struct
> > assignment is sufficient to make one kh struct freeable from another
> > pointer. That's probably reasonable, since you're just destroying them
> > both (e.g., some of our FLEX structs point into their own struct memory,
> > making a hidden dependency; but they obviously would not need to free
> > such a field).
> 
> Fair enough.  How about this on top?  (The khash.h part would go in
> first in a separate patch in a proper series.)

Yes, much nicer, and the khash change wasn't too painful.

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-10-02 Thread René Scharfe
Am 01.10.2018 um 22:26 schrieb Jeff King:
> On Mon, Oct 01, 2018 at 09:15:53PM +0200, René Scharfe wrote:
> The reason hashmap.c was added was to avoid open addressing. ;)
Because efficient removal of elements is easier to implement with
chaining, according to 6a364ced49 (add a hashtable implementation that
supports O(1) removal).  khash.h deletes using its flags bitmap.  We
didn't compare their performance when entries are removed so far.

> So yeah, I think it could perhaps be improved, but in my mind talking
> about "hashmap.c" is fundamentally talking about chained buckets.

Admittedly I wouldn't touch hashmap.c, as I find its interface too
complex to wrap my head around.  But perhaps I just didn't try hard
enough, yet.

>> But I like how khash.h is both already in the tree and also really easy
>> to deploy, as it's just a single header file.  It's a tasty low-hanging
>> fruit.
> 
> Yeah. And if it really does perform better, I think we should stick with
> it in the code base. I wonder if we could stand to clean up the
> interfaces a little.  E.g., I had a hard time declaring a hash in one
> place, and then defining it somewhere else.

You can't use KHASH_DECLARE and KHASH_INIT together, as both declare
the same structs.  So I guess the idea is to have a header file with
KHASH_DECLARE and a .c file with KHASH_INIT, the latter *not* including
the former, but both including khash.h.  I didn't actually try that,
though.

> And I think as you found
> that it insists on heap-allocating the hash-table struct itself, which
> does not match our usual style.

Perhaps we can fix that with little effort (see below).

>> This is straight-forward, except for oidset_clear(), which needs to
>> allocate a kh_oid_t on the heap in order to be able to feed it to
>> kh_destroy_oid() for release it.  Alternatively we could open-code the
>> relevant parts of the latter, but that would be a layering violation.
> 
> This is kind of a layering violation, too. You're assuming that struct
> assignment is sufficient to make one kh struct freeable from another
> pointer. That's probably reasonable, since you're just destroying them
> both (e.g., some of our FLEX structs point into their own struct memory,
> making a hidden dependency; but they obviously would not need to free
> such a field).

Fair enough.  How about this on top?  (The khash.h part would go in
first in a separate patch in a proper series.)

NB: I stuck to the 4-spaces-tabs formatting in khash.h here.

---
 khash.h  | 9 +++--
 oidset.c | 4 +---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/khash.h b/khash.h
index 07b4cc2e67..d10caa0c35 100644
--- a/khash.h
+++ b/khash.h
@@ -82,11 +82,16 @@ static const double __ac_HASH_UPPER = 0.77;
SCOPE kh_##name##_t *kh_init_##name(void) { 
\
return (kh_##name##_t*)xcalloc(1, sizeof(kh_##name##_t));   
\
}   
\
+   SCOPE void kh_release_##name(kh_##name##_t *h)  
\
+   {   
\
+   free(h->flags); 
\
+   free((void *)h->keys);  
\
+   free((void *)h->vals);  
\
+   }   
\
SCOPE void kh_destroy_##name(kh_##name##_t *h)  
\
{   
\
if (h) {
\
-   free((void *)h->keys); free(h->flags);  
\
-   free((void *)h->vals);  
\
+   kh_release_##name(h);   
\
free(h);
\
}   
\
}   
  

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-10-01 Thread Jeff King
On Mon, Oct 01, 2018 at 09:15:53PM +0200, René Scharfe wrote:

> Am 28.08.2018 um 01:03 schrieb Jeff King:
> > On Sun, Aug 26, 2018 at 01:37:41PM +0200, René Scharfe wrote:
> >> So it seems worth it.
> > 
> > Hmm, that really does. Which is a shame, because I hoped that one day we
> > could get rid of the nasty macro-infestation that is khash.h. But it
> > really is a lot faster than the alternatives.
> 
> Well, we only compare it to hashmap.c here.  There might be better
> implementations out there.  Hash tables in plain old C seem to be much
> harder to find than e.g. in C++, though.

I think it may be either-or here. The speed benefit to using khash here
is that we stick the data into the hash table itself, rather than
incurring a separate malloc for each entry. Which is pretty hard to do
in C without macros (the alternative is lots of void pointers, and
telling the hash table "your size is X bytes").

> And then there are several possible variations that affect
> performance -- perhaps hashmap.c could be improved by using open
> addressing, maybe with Robin Hood hashing, and/or by offering better
> support for sets, and/or by having a few macros to allow type
> specialization.

The reason hashmap.c was added was to avoid open addressing. ;)

So yeah, I think it could perhaps be improved, but in my mind talking
about "hashmap.c" is fundamentally talking about chained buckets.

But whatever you want to call it, I would be happy with a more type-safe
and performance hashmap.

> But I like how khash.h is both already in the tree and also really easy
> to deploy, as it's just a single header file.  It's a tasty low-hanging
> fruit.

Yeah. And if it really does perform better, I think we should stick with
it in the code base. I wonder if we could stand to clean up the
interfaces a little.  E.g., I had a hard time declaring a hash in one
place, and then defining it somewhere else. And I think as you found
that it insists on heap-allocating the hash-table struct itself, which
does not match our usual style.

> > Do you want to pick up this topic and carry it forward?
> 
> Well, I tried to simplify the implementation as much as possible and
> ended up doing the opposite of what I wrote earlier.  Hmm.
> 
> The patch below postpones struct allocation until cleanup time, which is
> a bit weird.  We can't avoid it fully without reimplementing kh_destroy,
> but we can use structs supplied by callers for basically all other
> operations.  That avoids NULL checks, and the main benefits of that are
> simplicity and safety; performance is not much better without them.

Your patch looks OK from a cursory view. I actually think that retaining
the extra NULL encapsulation is not the worst thing in the world. Most
callers should be going through the oid_* functions anyway.

But if we want to take the time to refactor khash (or even write our own
version, taking inspiration from what's there), the result would be
better still. If you're not planning to work on that in the near future,
though, I'd be OK with either of the alternates (the extra level of
pointer indirection from earlier, or this slight kh_destroy hackery
here).

> -- >8 --
> Subject: [PATCH] oidset: use khash
> 
> Reimplement struct oidset using khash.h in order to reduce its memory
> footprint and make it faster.
> 
> This is straight-forward, except for oidset_clear(), which needs to
> allocate a kh_oid_t on the heap in order to be able to feed it to
> kh_destroy_oid() for release it.  Alternatively we could open-code the
> relevant parts of the latter, but that would be a layering violation.

This is kind of a layering violation, too. You're assuming that struct
assignment is sufficient to make one kh struct freeable from another
pointer. That's probably reasonable, since you're just destroying them
both (e.g., some of our FLEX structs point into their own struct memory,
making a hidden dependency; but they obviously would not need to free
such a field).

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-10-01 Thread René Scharfe
Am 28.08.2018 um 01:03 schrieb Jeff King:
> On Sun, Aug 26, 2018 at 01:37:41PM +0200, René Scharfe wrote:
>> So it seems worth it.
> 
> Hmm, that really does. Which is a shame, because I hoped that one day we
> could get rid of the nasty macro-infestation that is khash.h. But it
> really is a lot faster than the alternatives.

Well, we only compare it to hashmap.c here.  There might be better
implementations out there.  Hash tables in plain old C seem to be much
harder to find than e.g. in C++, though.

And then there are several possible variations that affect
performance -- perhaps hashmap.c could be improved by using open
addressing, maybe with Robin Hood hashing, and/or by offering better
support for sets, and/or by having a few macros to allow type
specialization.

But I like how khash.h is both already in the tree and also really easy
to deploy, as it's just a single header file.  It's a tasty low-hanging
fruit.

> Do you want to pick up this topic and carry it forward?

Well, I tried to simplify the implementation as much as possible and
ended up doing the opposite of what I wrote earlier.  Hmm.

The patch below postpones struct allocation until cleanup time, which is
a bit weird.  We can't avoid it fully without reimplementing kh_destroy,
but we can use structs supplied by callers for basically all other
operations.  That avoids NULL checks, and the main benefits of that are
simplicity and safety; performance is not much better without them.

This patch doesn't provide a _count (or _is_empty) method.  It would be
cleaner to have one added as the first step and just change its
implementation when switching to khash, but it's harder than expected
with hashmap.c; doing that later (if at all) is easier.

Using sha1hash instead of open-coding it may seem a bit anachronistic,
but is the best way to document the usage of that pattern (i.e. to
truncate a SHA1 hash to get a shorter one for a hash table).

-- >8 --
Subject: [PATCH] oidset: use khash

Reimplement struct oidset using khash.h in order to reduce its memory
footprint and make it faster.

This is straight-forward, except for oidset_clear(), which needs to
allocate a kh_oid_t on the heap in order to be able to feed it to
kh_destroy_oid() for release it.  Alternatively we could open-code the
relevant parts of the latter, but that would be a layering violation.

Performance of a command that mainly checks for duplicate objects
using an oidset, with master and Clang 6.0.1:

$ cmd="./git-cat-file --batch-all-objects --unordered --buffer 
--batch-check='%(objectname)'"

$ /usr/bin/time $cmd >/dev/null
0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k
0inputs+0outputs (0major+11204minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer 
--batch-check='%(objectname)'

  Time (mean ± σ): 250.0 ms ±   6.0 ms[User: 225.9 ms, System: 23.6 ms]

  Range (min … max):   242.0 ms … 261.1 ms

And with this patch:

$ /usr/bin/time $cmd >/dev/null
0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k
0inputs+0outputs (0major+8318minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer 
--batch-check='%(objectname)'

  Time (mean ± σ): 151.9 ms ±   4.9 ms[User: 130.5 ms, System: 21.2 ms]

  Range (min … max):   148.2 ms … 170.4 ms

Initial-patch-by: Jeff King 
Signed-off-by: Rene Scharfe 
---
 fetch-pack.c |  2 +-
 oidset.c | 36 ++--
 oidset.h | 36 
 3 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 75047a4b2a..a839315726 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -536,7 +536,7 @@ static int tip_oids_contain(struct oidset *tip_oids,
 * add to "newlist" between calls, the additions will always be for
 * oids that are already in the set.
 */
-   if (!tip_oids->map.map.tablesize) {
+   if (!tip_oids->set.n_buckets) {
add_refs_to_oidset(tip_oids, unmatched);
add_refs_to_oidset(tip_oids, newlist);
}
diff --git a/oidset.c b/oidset.c
index 454c54f933..d15b2b7a89 100644
--- a/oidset.c
+++ b/oidset.c
@@ -3,38 +3,30 @@
 
 int oidset_contains(const struct oidset *set, const struct object_id *oid)
 {
-   if (!set->map.map.tablesize)
-   return 0;
-   return !!oidmap_get(>map, oid);
+   khiter_t pos = kh_get_oid(>set, *oid);
+   return pos != kh_end(>set);
 }
 
 int oidset_insert(struct oidset *set, const struct object_id *oid)
 {
-   struct oidmap_entry *entry;
-
-   if (!set->map.map.tablesize)
-   oidmap_init(>map, 0);
-   else if (oidset_contains(set, oid))
-   return 1;
-
-   entry = xmalloc(sizeof(*entry));
-   oidcpy(>oid, oid);
-
-   oidmap_put(>map, entry);
-   return 0;
+   int added;
+   

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-27 Thread Jeff King
On Sun, Aug 26, 2018 at 01:37:41PM +0200, René Scharfe wrote:

> Am 14.08.2018 um 03:58 schrieb Jeff King:
> > Your suggestion can be implemented using khash (my patch below).
> > 
> >> Before:
> >> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
> >> --batch-check='%(objectname)'
> >>
> >>   Time (mean ± σ): 269.5 ms ±  26.7 ms[User: 247.7 ms, System: 
> >> 21.4 ms]
> >>
> >>   Range (min … max):   240.3 ms … 339.3 ms
> >>
> >> After:
> >> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
> >> --batch-check='%(objectname)'
> >>
> >>   Time (mean ± σ): 224.2 ms ±  18.2 ms[User: 201.7 ms, System: 
> >> 22.1 ms]
> >>
> >>   Range (min … max):   205.0 ms … 259.0 ms
> > 
> > Yeah. My best-of-five dropped from 300ms to 247ms. That 300 was using
> > the memory pool, though khash's deletion strategy isn't all that
> > different (the wasted memory hangs around until the next hash resize,
> > but if you're evenly dropping and adding, you likely won't need to
> > resize).
> 
> With your khash patch:
> 
> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
> --batch-check='%(objectname)'
> 
>   Time (mean ± σ): 159.1 ms ±  20.5 ms[User: 140.3 ms, System: 18.5 
> ms]
> 
>   Range (min … max):   140.0 ms … 214.0 ms
> 
> So it seems worth it.

Hmm, that really does. Which is a shame, because I hoped that one day we
could get rid of the nasty macro-infestation that is khash.h. But it
really is a lot faster than the alternatives.

> [...]

I agree with all of your comments here. What I posted was just trying to
do the least amount of work to get something we could time.

Do you want to pick up this topic and carry it forward?

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-26 Thread René Scharfe
Am 14.08.2018 um 03:58 schrieb Jeff King:
> Your suggestion can be implemented using khash (my patch below).
> 
>> Before:
>> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
>> --batch-check='%(objectname)'
>>
>>   Time (mean ± σ): 269.5 ms ±  26.7 ms[User: 247.7 ms, System: 21.4 
>> ms]
>>
>>   Range (min … max):   240.3 ms … 339.3 ms
>>
>> After:
>> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
>> --batch-check='%(objectname)'
>>
>>   Time (mean ± σ): 224.2 ms ±  18.2 ms[User: 201.7 ms, System: 22.1 
>> ms]
>>
>>   Range (min … max):   205.0 ms … 259.0 ms
> 
> Yeah. My best-of-five dropped from 300ms to 247ms. That 300 was using
> the memory pool, though khash's deletion strategy isn't all that
> different (the wasted memory hangs around until the next hash resize,
> but if you're evenly dropping and adding, you likely won't need to
> resize).

With your khash patch:

Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
--batch-check='%(objectname)'

  Time (mean ± σ): 159.1 ms ±  20.5 ms[User: 140.3 ms, System: 18.5 ms]

  Range (min … max):   140.0 ms … 214.0 ms

So it seems worth it.

> Anyway, here's the khash patch for reference.
> 
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 5714bcbddd..5a86b10a5e 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -534,7 +534,7 @@ static int tip_oids_contain(struct oidset *tip_oids,
>* add to "newlist" between calls, the additions will always be for
>* oids that are already in the set.
>*/
> - if (!tip_oids->map.map.tablesize) {
> + if (!tip_oids->map) {
>   add_refs_to_oidset(tip_oids, unmatched);
>   add_refs_to_oidset(tip_oids, newlist);
>   }

The caller shouldn't look at the private parts of the implementation
like this.  tablesize is the allocation count, which becomes non-zero
if at least one entry was added.  tip_oids is only inserted into, never
deleted from, so a count or size function could be used instead as a
cleaner interface.  (In a separate patch..)

> diff --git a/oidset.c b/oidset.c
> index 454c54f933..2964b43b2d 100644
> --- a/oidset.c
> +++ b/oidset.c
> @@ -3,38 +3,44 @@
>  
>  int oidset_contains(const struct oidset *set, const struct object_id *oid)
>  {
> - if (!set->map.map.tablesize)
> + khiter_t pos;
> +
> + if (!set->map)
>   return 0;
> - return !!oidmap_get(>map, oid);
> +
> + pos = kh_get_oid(set->map, *oid);
> + return pos < kh_end(set->map);
>  }
>  
>  int oidset_insert(struct oidset *set, const struct object_id *oid)
>  {
> - struct oidmap_entry *entry;
> + int hash_ret;
>  
> - if (!set->map.map.tablesize)
> - oidmap_init(>map, 0);
> - else if (oidset_contains(set, oid))
> - return 1;
> + if (!set->map)
> + set->map = kh_init_oid();
>  
> - entry = xmalloc(sizeof(*entry));
> - oidcpy(>oid, oid);
> -
> - oidmap_put(>map, entry);
> - return 0;
> + kh_put_oid(set->map, *oid, _ret);
> + return !hash_ret;
>  }

So initialization is deferred to the first insert, and the empty set
can be represented in two ways -- map == NULL and map->size == 0.

I wondered about the performance impact of all those NULL checks at
insert and lookup and converted it to stack storage, with a dirty
hand-rolled oidset_clear() implementation.  It wasn't any faster.

>  
>  int oidset_remove(struct oidset *set, const struct object_id *oid)
>  {
> - struct oidmap_entry *entry;
> + khiter_t pos;
>  
> - entry = oidmap_remove(>map, oid);
> - free(entry);
> + if (!set->map)
> + return 0;
> +
> + pos = kh_get_oid(set->map, *oid);
> + if (pos < kh_end(set->map)) {
> + kh_del_oid(set->map, pos);
> + return 1;
> + }
>  
> - return (entry != NULL);
> + return 0;
>  }
>  
>  void oidset_clear(struct oidset *set)
>  {
> - oidmap_free(>map, 1);
> + kh_destroy_oid(set->map);
> + set->map = NULL;
>  }
> diff --git a/oidset.h b/oidset.h
> index 40ec5f87fe..4c4c5a42fe 100644
> --- a/oidset.h
> +++ b/oidset.h
> @@ -2,6 +2,7 @@
>  #define OIDSET_H
>  
>  #include "oidmap.h"

This can go.

> +#include "khash.h"
>  
>  /**
>   * This API is similar to sha1-array, in that it maintains a set of object 
> ids
> @@ -15,19 +16,34 @@
>   *  table overhead.
>   */
>  
> +static inline unsigned int oid_hash(const struct object_id oid)
> +{
> + unsigned int hash;
> + memcpy(, oid.hash, sizeof(hash));
> + return hash;
> +}
> +
> +static inline int oid_equal(const struct object_id a,
> + const struct object_id b)
> +{
> + return !oidcmp(, );
> +}

Look, it's oideq() from that other series in disguise! :)

> +
> +KHASH_INIT(oid, struct object_id, int, 0, oid_hash, oid_equal)

Note to self: The 0 is for kh_is_map, which means that no values are
kept and the given value type (int) doesn't 

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-25 Thread René Scharfe
Am 11.08.2018 um 18:54 schrieb Ævar Arnfjörð Bjarmason:
> 
> On Sat, Aug 11 2018, René Scharfe wrote:
> 
>> Object IDs to skip are stored in a shared static oid_array.  Lookups do
>> a binary search on the sorted array.  The code checks if the object IDs
>> are already in the correct order while loading and skips sorting in that
>> case.
> 
> I think this change makes sense, but it's missing an update to the
> relevant documentation in Documentation/config.txt:
> 
> fsck.skipList::
>   The path to a sorted list of object names (i.e. one SHA-1 per
>   line) that are known to be broken in a non-fatal way and should
>   be ignored. This feature is useful when an established project
>   should be accepted despite early commits containing errors that
>   can be safely ignored such as invalid committer email addresses.
>   Note: corrupt objects cannot be skipped with this setting.
> 
> Also, while I use the skipList feature it's for something on the order
> of 10-100 objects, so whatever algorithm the lookup uses isn't going to
> matter, but I think it's interesting to describe the trade-off in the
> commit message.
> 
> I.e. what if I have 100K objects listed in the skipList, is it only
> going to be read lazily during fsck if there's an issue, or on every
> object etc? What's the difference in performance?

The list is loaded once up-front and a lookup is done for each
reportable issue and .gitmodules file.  Repositories without them
won't be affected at all.

100K bad objects sounds a bit extreme, but a fast-import can create
such a repository relatively quickly.  Here are the numbers with and
without the two patches:

Testorigin/master HEAD

1450.3: fsck with 0 skipped bad commits 0.84(0.78+0.05)   
0.83(0.80+0.03) -1.2%
1450.5: fsck with 1 skipped bad commits 0.84(0.77+0.07)   
0.84(0.79+0.05) +0.0%
1450.7: fsck with 10 skipped bad commits0.86(0.81+0.05)   
0.84(0.78+0.06) -2.3%
1450.9: fsck with 100 skipped bad commits   0.85(0.78+0.07)   
0.84(0.78+0.05) -1.2%
1450.11: fsck with 1000 skipped bad commits 0.85(0.80+0.05)   
0.84(0.79+0.05) -1.2%
1450.13: fsck with 1 skipped bad commits0.85(0.78+0.07)   
0.82(0.75+0.06) -3.5%
1450.15: fsck with 10 skipped bad commits   0.73(0.64+0.09)   
0.64(0.62+0.02) -12.3%

They look the same for most sizes, but with all 10 bad objects in
the skiplist the oidset wins decisively.  Dialing it up a notch:

Test origin/master HEAD
-
1450.3: fsck with 0 skipped bad commits  8.61(7.94+0.66)   
8.72(8.14+0.58) +1.3%
1450.5: fsck with 1 skipped bad commits  8.51(7.87+0.63)   
8.53(7.93+0.60) +0.2%
1450.7: fsck with 10 skipped bad commits 8.56(7.98+0.57)   
8.54(7.91+0.63) -0.2%
1450.9: fsck with 100 skipped bad commits8.65(8.00+0.65)   
8.47(7.93+0.53) -2.1%
1450.11: fsck with 1000 skipped bad commits  8.69(8.16+0.53)   
8.49(8.00+0.49) -2.3%
1450.13: fsck with 1 skipped bad commits 8.69(8.13+0.56)   
8.50(7.93+0.56) -2.2%
1450.15: fsck with 10 skipped bad commits8.78(8.18+0.60)   
8.36(7.85+0.50) -4.8%
1450.17: fsck with 100 skipped bad commits   7.83(7.07+0.76)   
6.55(6.42+0.12) -16.3%

So it looks like successful lookups are faster in the oidset and
the oid_array is quicker with just a handful of entries.

A L1 cache line of 64 bytes holds 3 consecutive SHA1 hashes, which
might explain it.

Here's the perf test code:

---
 t/perf/p1450-fsck.sh | 40 
 1 file changed, 40 insertions(+)
 create mode 100755 t/perf/p1450-fsck.sh

diff --git a/t/perf/p1450-fsck.sh b/t/perf/p1450-fsck.sh
new file mode 100755
index 00..7c89690777
--- /dev/null
+++ b/t/perf/p1450-fsck.sh
@@ -0,0 +1,40 @@
+#!/bin/sh
+
+test_description='Test fsck skipList performance'
+
+. ./perf-lib.sh
+
+test_perf_fresh_repo
+
+n=10
+
+test_expect_success "setup $n bad commits" '
+   for i in $(test_seq 1 $n)
+   do
+   echo "commit refs/heads/master" &&
+   echo "committer C  1234567890 +" &&
+   echo "data 

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-25 Thread René Scharfe
Am 11.08.2018 um 22:48 schrieb Ramsay Jones:
> On 11/08/18 16:47, René Scharfe wrote:
>> @@ -34,12 +36,12 @@ struct fsck_options {
>>  fsck_error error_func;
>>  unsigned strict:1;
>>  int *msg_type;
>> -    struct oid_array *skiplist;
>> +    struct oidset skiplist;
>>  struct decoration *object_names;
>>  };
>>  
>> -#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL }
>> -#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL }
>> +#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL, 
>> OIDSET_INIT }
>> +#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL, 
>> OIDSET_INIT }
> 
> Note that a NULL initialiser, for the object_names field, is missing
> (not introduced by this patch). Since you have bumped into the 80th
> column, you may not want to add that NULL to the end of these macros
> (it is not _necessary_ after all). However, ... :-D

Exactly my thoughts -- except the "However" part. :)

I even thought about reordering the struct to move the NULL-initialized
elements to the end, allowing us to drop them from the initializer, but
felt that this would be a bit too much..

René


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread Jeff King
On Mon, Aug 13, 2018 at 09:58:42PM -0400, Jeff King wrote:

> >  builtin/cat-file.c | 93 +++---
> >  1 file changed, 88 insertions(+), 5 deletions(-)
> 
> By the way, your patch seemed damaged (wouldn't apply, and "am -3"
> complained of hand-editing). It looks like maybe there's an extra space
> inserted in the context lines?

Oh nevermind, I see you guys dug to the bottom of it elsewhere in the
thread. :)

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread Jeff King
On Mon, Aug 13, 2018 at 07:15:19PM +0200, René Scharfe wrote:

> Getting sidetracked here, but the following patch helps both sides a bit:
> 
> -- >8 --
> Subject: [PATCH] cat-file: reuse strbuf in batch_object_write()
> 
> Avoid allocating and then releasing memory for constructing the output
> for each object by reusing the strbuf for all of them.

Thanks, this an easy and sensible optimization. I have a few patches to
send on top of my cat-file topic, so I'll pick this up there.

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread Jeff King
On Mon, Aug 13, 2018 at 07:15:23PM +0200, René Scharfe wrote:

> Am 11.08.2018 um 22:59 schrieb René Scharfe:
> > If the current oidset implementation is so bad, why not replace it with
> > one based on oid_array? ;-)
> > 
> > Intuitively I'd try a hashmap with no payload and open addressing via
> > sha1hash(), which should reduce memory allocations quite a bit -- no
> > need to store hash codes and next pointers, only an array of object IDs
> > with a fill rate of 50% or so.  Deletions are a bit awkward with that
> > scheme, though; they could perhaps be implemented as insertions into a
> > second hashmap.
> 
> Here's roughly what I had in mind, only with a free/occupied bitmap (or
> a one-bit payload, if you will).  I tried a variant that encoded empty
> slots as null_oid first, which has lower memory usage, but isn't any
> faster than the current code.

Hmph, I thought I had sent my version out last night, but it looks like
I didn't. I got similarly mediocre results.

Your suggestion can be implemented using khash (my patch below).

> Before:
> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
> --batch-check='%(objectname)'
> 
>   Time (mean ± σ): 269.5 ms ±  26.7 ms[User: 247.7 ms, System: 21.4 
> ms]
> 
>   Range (min … max):   240.3 ms … 339.3 ms
> 
> After:
> Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
> --batch-check='%(objectname)'
> 
>   Time (mean ± σ): 224.2 ms ±  18.2 ms[User: 201.7 ms, System: 22.1 
> ms]
> 
>   Range (min … max):   205.0 ms … 259.0 ms

Yeah. My best-of-five dropped from 300ms to 247ms. That 300 was using
the memory pool, though khash's deletion strategy isn't all that
different (the wasted memory hangs around until the next hash resize,
but if you're evenly dropping and adding, you likely won't need to
resize).

Applying your patch, I get 337ms, worse than the hashmap with a memory
pool. I'm not sure why.

>  builtin/cat-file.c | 93 +++---
>  1 file changed, 88 insertions(+), 5 deletions(-)

By the way, your patch seemed damaged (wouldn't apply, and "am -3"
complained of hand-editing). It looks like maybe there's an extra space
inserted in the context lines?

Anyway, here's the khash patch for reference.

diff --git a/fetch-pack.c b/fetch-pack.c
index 5714bcbddd..5a86b10a5e 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -534,7 +534,7 @@ static int tip_oids_contain(struct oidset *tip_oids,
 * add to "newlist" between calls, the additions will always be for
 * oids that are already in the set.
 */
-   if (!tip_oids->map.map.tablesize) {
+   if (!tip_oids->map) {
add_refs_to_oidset(tip_oids, unmatched);
add_refs_to_oidset(tip_oids, newlist);
}
diff --git a/oidset.c b/oidset.c
index 454c54f933..2964b43b2d 100644
--- a/oidset.c
+++ b/oidset.c
@@ -3,38 +3,44 @@
 
 int oidset_contains(const struct oidset *set, const struct object_id *oid)
 {
-   if (!set->map.map.tablesize)
+   khiter_t pos;
+
+   if (!set->map)
return 0;
-   return !!oidmap_get(>map, oid);
+
+   pos = kh_get_oid(set->map, *oid);
+   return pos < kh_end(set->map);
 }
 
 int oidset_insert(struct oidset *set, const struct object_id *oid)
 {
-   struct oidmap_entry *entry;
+   int hash_ret;
 
-   if (!set->map.map.tablesize)
-   oidmap_init(>map, 0);
-   else if (oidset_contains(set, oid))
-   return 1;
+   if (!set->map)
+   set->map = kh_init_oid();
 
-   entry = xmalloc(sizeof(*entry));
-   oidcpy(>oid, oid);
-
-   oidmap_put(>map, entry);
-   return 0;
+   kh_put_oid(set->map, *oid, _ret);
+   return !hash_ret;
 }
 
 int oidset_remove(struct oidset *set, const struct object_id *oid)
 {
-   struct oidmap_entry *entry;
+   khiter_t pos;
 
-   entry = oidmap_remove(>map, oid);
-   free(entry);
+   if (!set->map)
+   return 0;
+
+   pos = kh_get_oid(set->map, *oid);
+   if (pos < kh_end(set->map)) {
+   kh_del_oid(set->map, pos);
+   return 1;
+   }
 
-   return (entry != NULL);
+   return 0;
 }
 
 void oidset_clear(struct oidset *set)
 {
-   oidmap_free(>map, 1);
+   kh_destroy_oid(set->map);
+   set->map = NULL;
 }
diff --git a/oidset.h b/oidset.h
index 40ec5f87fe..4c4c5a42fe 100644
--- a/oidset.h
+++ b/oidset.h
@@ -2,6 +2,7 @@
 #define OIDSET_H
 
 #include "oidmap.h"
+#include "khash.h"
 
 /**
  * This API is similar to sha1-array, in that it maintains a set of object ids
@@ -15,19 +16,34 @@
  *  table overhead.
  */
 
+static inline unsigned int oid_hash(const struct object_id oid)
+{
+   unsigned int hash;
+   memcpy(, oid.hash, sizeof(hash));
+   return hash;
+}
+
+static inline int oid_equal(const struct object_id a,
+   const struct object_id b)
+{
+   return !oidcmp(, );
+}
+

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread René Scharfe
Am 13.08.2018 um 23:07 schrieb Junio C Hamano:
> René Scharfe  writes:
> 
>> the mailing list [1], nor on the web interface [2].  The latter shows
>> extra spaces on the context lines of the first hunk, though, which I
>> can't see anywhere else.  All the lines look fine in the citation of
>> Ramsay's reply [3].  So I don't know where these extra spaces are
>> coming from. :-/
> 
> Hmph, interesting.
> 
> https://public-inbox.org/git/54a5367f-f832-402c-f51b-3225c92b4...@web.de/raw
> 
> has "Content-Type: text/plain; charset=utf-8; format=flowed".  That
> page's rendition is more faithful to the bare text.

That explains it: Thunderbird 60 disables most older Add-ons, among them
Toggle Word Wrap, which used to turn off format=flowed for me.  I did
that now using the config settings mailnews.send_plaintext_flowed and
mailnews.display.disable_format_flowed_support.

> The funky " -" one I showed was what Gnus/Emacs came up with as the
> result of its best effort to make the format=flawed into something
> closer to "text", I think X-<.  

Sorry. :(

> In any case, I do not think format=flowed can be reverted reliably
> (or can it be?  If so we should teach mailinfo to repair them).

RFC3676 gives me a headache, perhaps I should go to bed.  If we can
assume that lines don't have trailing spaces originally then we should
be able to reconstruct their contents, no?  "A generating agent SHOULD:
[...] Trim spaces before user-inserted hard line breaks.", i.e. lines
with trailing spaces are doomed to truncation without hope for repair.

René


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread Junio C Hamano
René Scharfe  writes:

> the mailing list [1], nor on the web interface [2].  The latter shows
> extra spaces on the context lines of the first hunk, though, which I
> can't see anywhere else.  All the lines look fine in the citation of
> Ramsay's reply [3].  So I don't know where these extra spaces are
> coming from. :-/

Hmph, interesting.

https://public-inbox.org/git/54a5367f-f832-402c-f51b-3225c92b4...@web.de/raw

has "Content-Type: text/plain; charset=utf-8; format=flowed".  That
page's rendition is more faithful to the bare text.

The funky " -" one I showed was what Gnus/Emacs came up with as the
result of its best effort to make the format=flawed into something
closer to "text", I think X-<.  

In any case, I do not think format=flowed can be reverted reliably
(or can it be?  If so we should teach mailinfo to repair them).


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread René Scharfe

Am 13.08.2018 um 20:43 schrieb Junio C Hamano:

René Scharfe  writes:


@@ -182,19 +181,10 @@ static int fsck_msg_type(enum fsck_msg_id msg_id,
  static void init_skiplist(struct fsck_options *options, const char
*path)
  {
-   static struct oid_array skiplist = OID_ARRAY_INIT;
-   int sorted;
FILE *fp;
struct strbuf sb = STRBUF_INIT;
struct object_id oid;
  - if (options->skiplist)
-   sorted = options->skiplist->sorted;
-   else {


That SP before '-' on the removed line is the most curious aspect of
this patch; I do not recall the last time I saw a corrupt patch from
you---changed where you read/write your mails recently?


Well, I updated Thunderbird to version 60 a few days ago, but I can't
see that extra space in my Sent folder, nor via the NNTP interface of
the mailing list [1], nor on the web interface [2].  The latter shows
extra spaces on the context lines of the first hunk, though, which I
can't see anywhere else.  All the lines look fine in the citation of
Ramsay's reply [3].  So I don't know where these extra spaces are
coming from. :-/

René


[1] news://news.public-inbox.org:119/54a5367f-f832-402c-f51b-3225c92b4...@web.de
[2] https://public-inbox.org/git/54a5367f-f832-402c-f51b-3225c92b4...@web.de/
[3] 
https://public-inbox.org/git/bc9f21c6-b362-2e3f-1820-7da93a76a...@ramsayjones.plus.com/



Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread Junio C Hamano
René Scharfe  writes:

> @@ -182,19 +181,10 @@ static int fsck_msg_type(enum fsck_msg_id msg_id,
>  static void init_skiplist(struct fsck_options *options, const char
> *path)
>  {
> - static struct oid_array skiplist = OID_ARRAY_INIT;
> - int sorted;
>   FILE *fp;
>   struct strbuf sb = STRBUF_INIT;
>   struct object_id oid;
>  -if (options->skiplist)
> - sorted = options->skiplist->sorted;
> - else {

That SP before '-' on the removed line is the most curious aspect of
this patch; I do not recall the last time I saw a corrupt patch from
you---changed where you read/write your mails recently?

> @@ -317,9 +300,7 @@ static void append_msg_id(struct strbuf *sb, const char 
> *msg_id)
>  static int object_on_skiplist(struct fsck_options *opts, struct
> object *obj)
>  {
> - if (opts && opts->skiplist && obj)
> - return oid_array_lookup(opts->skiplist, >oid) >= 0;
> - return 0;
> + return opts && obj && oidset_contains(>skiplist, >oid);

OK ;-)



Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread René Scharfe

Am 11.08.2018 um 22:59 schrieb René Scharfe:

If the current oidset implementation is so bad, why not replace it with
one based on oid_array? ;-)

Intuitively I'd try a hashmap with no payload and open addressing via
sha1hash(), which should reduce memory allocations quite a bit -- no
need to store hash codes and next pointers, only an array of object IDs
with a fill rate of 50% or so.  Deletions are a bit awkward with that
scheme, though; they could perhaps be implemented as insertions into a
second hashmap.


Here's roughly what I had in mind, only with a free/occupied bitmap (or
a one-bit payload, if you will).  I tried a variant that encoded empty
slots as null_oid first, which has lower memory usage, but isn't any
faster than the current code.

# in git.git
$ hyperfine "./git-cat-file --batch-all-objects --buffer --unordered 
--batch-check='%(objectname)'"

Before:
Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
--batch-check='%(objectname)'

  Time (mean ± σ): 269.5 ms ±  26.7 ms[User: 247.7 ms, System: 21.4 ms]

  Range (min … max):   240.3 ms … 339.3 ms

After:
Benchmark #1: ./git-cat-file --batch-all-objects --buffer --unordered 
--batch-check='%(objectname)'

  Time (mean ± σ): 224.2 ms ±  18.2 ms[User: 201.7 ms, System: 22.1 ms]

  Range (min … max):   205.0 ms … 259.0 ms

So that's only slightly faster. :-|

---
 builtin/cat-file.c | 93 +++---
 1 file changed, 88 insertions(+), 5 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 45992c9be9..b197cca861 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -408,10 +408,93 @@ static void batch_one_object(const char *obj_name, struct 
batch_options *opt,
batch_object_write(obj_name, opt, data);
 }
 
+struct oidset2 {

+   size_t nr, alloc;
+   struct object_id *entries;
+   uint32_t *bitmap;
+};
+
+static int is_bit_set(const uint32_t *bitmap, size_t idx)
+{
+   uint32_t mask = 1 << (idx % bitsizeof(bitmap[0]));
+   return bitmap[idx / bitsizeof(bitmap[0])] & mask;
+}
+
+static void set_bit(uint32_t *bitmap, size_t idx)
+{
+   uint32_t mask = 1 << (idx % bitsizeof(bitmap[0]));
+   bitmap[idx / bitsizeof(bitmap[0])] |= mask;
+}
+
+static void oidset2_add(struct oidset2 *set, const struct object_id *oid)
+{
+   size_t idx;
+
+   for (idx = sha1hash(oid->hash) % set->alloc;;) {
+   if (!is_bit_set(set->bitmap, idx))
+   break;
+   if (!oidcmp(>entries[idx], oid))
+   return;
+   if (++idx >= set->alloc)
+   idx = 0;
+   }
+   oidcpy(>entries[idx], oid);
+   set_bit(set->bitmap, idx);
+   set->nr++;
+}
+
+static void oidset2_grow(struct oidset2 *set)
+{
+   struct oidset2 old_set = *set;
+   size_t idx;
+
+   set->alloc = (old_set.alloc + 1000) * 3 / 2;
+   set->nr = 0;
+   set->entries = xcalloc(set->alloc, sizeof(set->entries[0]));
+   set->bitmap = xcalloc(set->alloc / 32 + 1, sizeof(set->bitmap[0]));
+   for (idx = 0; idx < old_set.alloc; idx++) {
+   if (!is_bit_set(old_set.bitmap, idx))
+   continue;
+   oidset2_add(set, _set.entries[idx]);
+   }
+   free(old_set.entries);
+   free(old_set.bitmap);
+}
+
+static void oidset2_insert(struct oidset2 *set, const struct object_id *oid)
+{
+   if (set->nr + 1 > set->alloc * 2 / 3)
+   oidset2_grow(set);
+   oidset2_add(set, oid);
+}
+
+static int oidset2_contains(struct oidset2 *set, const struct object_id *oid)
+{
+   size_t idx;
+
+   if (!set->nr)
+   return 0;
+   for (idx = sha1hash(oid->hash) % set->alloc;;) {
+   if (!is_bit_set(set->bitmap, idx))
+   return 0;
+   if (!oidcmp(>entries[idx], oid))
+   return 1;
+   if (++idx >= set->alloc)
+   idx = 0;
+   }
+}
+
+static void oidset2_clear(struct oidset2 *set)
+{
+   FREE_AND_NULL(set->entries);
+   FREE_AND_NULL(set->bitmap);
+   set->alloc = set->nr = 0;
+}
+
 struct object_cb_data {
struct batch_options *opt;
struct expand_data *expand;
-   struct oidset *seen;
+   struct oidset2 *seen;
 };
 
 static int batch_object_cb(const struct object_id *oid, void *vdata)

@@ -443,9 +526,9 @@ static int batch_unordered_object(const struct object_id 
*oid, void *vdata)
 {
struct object_cb_data *data = vdata;
 
-	if (oidset_contains(data->seen, oid))

+   if (oidset2_contains(data->seen, oid))
return 0;
-   oidset_insert(data->seen, oid);
+   oidset2_insert(data->seen, oid);
 
 	return batch_object_cb(oid, data);

 }
@@ -510,7 +593,7 @@ static int batch_objects(struct batch_options *opt)
cb.expand = 
 
 		if (opt->unordered) {

-   struct oidset seen = OIDSET_INIT;

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-13 Thread René Scharfe

Am 11.08.2018 um 19:23 schrieb Jeff King:

On Sat, Aug 11, 2018 at 01:02:48PM -0400, Jeff King wrote:


   - we could probably improve the speed of oidset. Two things I notice
 about its implementation:

   - it has to malloc for each entry, which I suspect is the main
bottleneck. We could probably pool-allocate blocks, and when
entries get removed just leave the allocations in place until we
clear(). Most callers tend to build up a set and then query it a
lot, or possibly remove items from the set until it's empty. But
my guess is that few or none want a long-lived set that they add
and remove from randomly.

   - insertion lets you do check-and-insert as a single operation
(something I failed to notice in [1]). But it's not implemented
as efficiently as it could be, since the "contains" and "put"
operations do separate lookups. This doesn't matter for a set
that's queried a lot more, but for something like de-duping
(like I was doing in [1]) most operations are check-and-insert.
[...]
[1] https://public-inbox.org/git/20180810232457.gg19...@sigill.intra.peff.net/
 but note that it's buried pretty deep.


Some notes on this, based on the cat-file patch that I linked to.

Before any optimizations, my best-of-five timing for:

   git cat-file --batch-all-objects --unordered --buffer \
--batch-check='%(objectname)' >/dev/null

in git.git was:

   real 0m0.434s
   user 0m0.414s
   sys  0m0.020s

That's enumerating every object in the repo but not doing much more than
de-duping the names and printing them.

Applying this patch:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 45992c9be9..04b5cda191 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -443,9 +443,8 @@ static int batch_unordered_object(const struct object_id 
*oid, void *vdata)
  {
struct object_cb_data *data = vdata;
  
-	if (oidset_contains(data->seen, oid))

+   if (oidset_insert(data->seen, oid))
return 0;
-   oidset_insert(data->seen, oid);
  
  	return batch_object_cb(oid, data);

  }

to use the single-call set-and-replace doesn't seem to help (any
improvement is within the run-to-run noise). So a single hash lookup per
object does not seem to be measurable. And thus teaching oidset_insert()
to do a single hash lookup for check-and-insert is unlikely to help us.

On top of that, I tried using a pool to store the set entries:

diff --git a/oidset.c b/oidset.c
index 454c54f933..504929f177 100644
--- a/oidset.c
+++ b/oidset.c
@@ -17,7 +17,10 @@ int oidset_insert(struct oidset *set, const struct object_id 
*oid)
else if (oidset_contains(set, oid))
return 1;
  
-	entry = xmalloc(sizeof(*entry));

+   if (!set->pool)
+   mem_pool_init(>pool, 0);
+
+   entry = mem_pool_alloc(set->pool, sizeof(*entry));
oidcpy(>oid, oid);
  
  	oidmap_put(>map, entry);

@@ -29,12 +32,13 @@ int oidset_remove(struct oidset *set, const struct 
object_id *oid)
struct oidmap_entry *entry;
  
  	entry = oidmap_remove(>map, oid);

-   free(entry);
+   /* abandon pool memory for "entry" */
  
  	return (entry != NULL);

  }
  
  void oidset_clear(struct oidset *set)

  {
-   oidmap_free(>map, 1);
+   oidmap_free(>map, 0);
+   mem_pool_discard(set->pool, 0);
  }
diff --git a/oidset.h b/oidset.h
index 40ec5f87fe..6b8b802987 100644
--- a/oidset.h
+++ b/oidset.h
@@ -20,6 +20,7 @@
   */
  struct oidset {
struct oidmap map;
+   struct mem_pool *pool;
  };
  
  #define OIDSET_INIT { OIDMAP_INIT }


That drops my best-of-five to:

   real 0m0.300s
   user 0m0.288s
   sys  0m0.012s

which is over a 25% speedup. So that does seem worth pursuing.

For reference, the oid_array code path for cat-file is still:

   real 0m0.161s
   user 0m0.157s
   sys  0m0.004s

but that's not completely apples to apples. The oidset code is also
iterating the packfiles in a different order and generating a revidx
(which I know is about 25ms in this repo). So a better test would
actually swap one data structure out for the other with no other changes
(I just happened to have this test handy, and really wanted to know
whether the mem_pool stuff would help).


Getting sidetracked here, but the following patch helps both sides a bit:

-- >8 --
Subject: [PATCH] cat-file: reuse strbuf in batch_object_write()

Avoid allocating and then releasing memory for constructing the output
for each object by reusing the strbuf for all of them.

Signed-off-by: Rene Scharfe 
---
# on git.git
$ hyperfine "./git-cat-file --batch-all-objects --buffer 
--batch-check='%(objectname)'"

Before:
Benchmark #1: ./git-cat-file --batch-all-objects --buffer 
--batch-check='%(objectname)'

  Time (mean ± σ): 139.3 ms ±  20.1 ms[User: 124.2 ms, System: 14.8 ms]

  Range (min … max):   124.4 ms … 205.9 ms

After:
Benchmark #1: ./git-cat-file --batch-all-objects 

Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread René Scharfe

Am 11.08.2018 um 19:23 schrieb Jeff King:

On Sat, Aug 11, 2018 at 01:02:48PM -0400, Jeff King wrote:


   - we could probably improve the speed of oidset. Two things I notice
 about its implementation:



Before any optimizations, my best-of-five timing for:

   git cat-file --batch-all-objects --unordered --buffer \
--batch-check='%(objectname)' >/dev/null

in git.git was:

   real 0m0.434s
   user 0m0.414s
   sys  0m0.020s

That's enumerating every object in the repo but not doing much more than
de-duping the names and printing them.



So a single hash lookup per
object does not seem to be measurable. And thus teaching oidset_insert()
to do a single hash lookup for check-and-insert is unlikely to help us.



On top of that, I tried using a pool to store the set entries:



That drops my best-of-five to:

   real 0m0.300s
   user 0m0.288s
   sys  0m0.012s

which is over a 25% speedup. So that does seem worth pursuing.



For reference, the oid_array code path for cat-file is still:

   real 0m0.161s
   user 0m0.157s
   sys  0m0.004s

but that's not completely apples to apples. The oidset code is also
iterating the packfiles in a different order and generating a revidx
(which I know is about 25ms in this repo). So a better test would
actually swap one data structure out for the other with no other changes
(I just happened to have this test handy, and really wanted to know
whether the mem_pool stuff would help).


If the current oidset implementation is so bad, why not replace it with
one based on oid_array? ;-)

Intuitively I'd try a hashmap with no payload and open addressing via
sha1hash(), which should reduce memory allocations quite a bit -- no
need to store hash codes and next pointers, only an array of object IDs
with a fill rate of 50% or so.  Deletions are a bit awkward with that
scheme, though; they could perhaps be implemented as insertions into a
second hashmap.

Balancing might become a problem.  Your cat-file example should be fine,
but someone could decide to add only hashes with a certain prefix to
their skiplist, and lookups would lopsided.

But first we'd need something like test-sha1-array for oidset and
some kind of performance tests based on these helpers, right?

René


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread Ramsay Jones



On 11/08/18 16:47, René Scharfe wrote:
> Object IDs to skip are stored in a shared static oid_array.  Lookups do
> a binary search on the sorted array.  The code checks if the object IDs
> are already in the correct order while loading and skips sorting in that
> case.
> 
> Simplify the code by using an oidset instead.  Memory usage is a bit
> higher, but lookups are done in constant time and there is no need to
> worry about any sort order.
> 
> Embed the oidset into struct fsck_options to make its ownership clear
> (no hidden sharing) and avoid unnecessary pointer indirection.
> 
> Signed-off-by: Rene Scharfe 
> ---
>  fsck.c | 23 ++-
>  fsck.h |  8 +---
>  2 files changed, 7 insertions(+), 24 deletions(-)
> 
> diff --git a/fsck.c b/fsck.c
> index 83f4562390..9246afee22 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -10,7 +10,6 @@
>  #include "fsck.h"
>  #include "refs.h"
>  #include "utf8.h"
> -#include "sha1-array.h"
>  #include "decorate.h"
>  #include "oidset.h"
>  #include "packfile.h"
> @@ -182,19 +181,10 @@ static int fsck_msg_type(enum fsck_msg_id msg_id,
>  
>  static void init_skiplist(struct fsck_options *options, const char *path)
>  {
> -    static struct oid_array skiplist = OID_ARRAY_INIT;
> -    int sorted;
>  FILE *fp;
>  struct strbuf sb = STRBUF_INIT;
>  struct object_id oid;
>  
> -    if (options->skiplist)
> -    sorted = options->skiplist->sorted;
> -    else {
> -    sorted = 1;
> -    options->skiplist = 
> -    }
> -
>  fp = fopen(path, "r");
>  if (!fp)
>  die("Could not open skip list: %s", path);
> @@ -202,17 +192,10 @@ static void init_skiplist(struct fsck_options *options, 
> const char *path)
>  const char *p;
>  if (parse_oid_hex(sb.buf, , ) || *p != '\0')
>  die("Invalid SHA-1: %s", sb.buf);
> -    oid_array_append(, );
> -    if (sorted && skiplist.nr > 1 &&
> -    oidcmp([skiplist.nr - 2],
> -   ) > 0)
> -    sorted = 0;
> +    oidset_insert(>skiplist, );
>  }
>  fclose(fp);
>  strbuf_release();
> -
> -    if (sorted)
> -    skiplist.sorted = 1;
>  }
>  
>  static int parse_msg_type(const char *str)
> @@ -317,9 +300,7 @@ static void append_msg_id(struct strbuf *sb, const char 
> *msg_id)
>  
>  static int object_on_skiplist(struct fsck_options *opts, struct object *obj)
>  {
> -    if (opts && opts->skiplist && obj)
> -    return oid_array_lookup(opts->skiplist, >oid) >= 0;
> -    return 0;
> +    return opts && obj && oidset_contains(>skiplist, >oid);
>  }
>  
>  __attribute__((format (printf, 4, 5)))
> diff --git a/fsck.h b/fsck.h
> index c3cf5e0034..26120e6186 100644
> --- a/fsck.h
> +++ b/fsck.h
> @@ -1,6 +1,8 @@
>  #ifndef GIT_FSCK_H
>  #define GIT_FSCK_H
>  
> +#include "oidset.h"
> +
>  #define FSCK_ERROR 1
>  #define FSCK_WARN 2
>  #define FSCK_IGNORE 3
> @@ -34,12 +36,12 @@ struct fsck_options {
>  fsck_error error_func;
>  unsigned strict:1;
>  int *msg_type;
> -    struct oid_array *skiplist;
> +    struct oidset skiplist;
>  struct decoration *object_names;
>  };
>  
> -#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL }
> -#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL }
> +#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL, 
> OIDSET_INIT }
> +#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL, 
> OIDSET_INIT }

Note that a NULL initialiser, for the object_names field, is missing
(not introduced by this patch). Since you have bumped into the 80th
column, you may not want to add that NULL to the end of these macros
(it is not _necessary_ after all). However, ... :-D

Otherwise, LGTM.

Thanks!

ATB,
Ramsay Jones

>  
>  /* descend in all linked child objects
>   * the return value is:


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread Jeff King
On Sat, Aug 11, 2018 at 01:02:48PM -0400, Jeff King wrote:

>   - we could probably improve the speed of oidset. Two things I notice
> about its implementation:
> 
>   - it has to malloc for each entry, which I suspect is the main
>   bottleneck. We could probably pool-allocate blocks, and when
>   entries get removed just leave the allocations in place until we
>   clear(). Most callers tend to build up a set and then query it a
>   lot, or possibly remove items from the set until it's empty. But
>   my guess is that few or none want a long-lived set that they add
>   and remove from randomly.
> 
>   - insertion lets you do check-and-insert as a single operation
>   (something I failed to notice in [1]). But it's not implemented
>   as efficiently as it could be, since the "contains" and "put"
>   operations do separate lookups. This doesn't matter for a set
>   that's queried a lot more, but for something like de-duping
>   (like I was doing in [1]) most operations are check-and-insert.
> [...]
> [1] https://public-inbox.org/git/20180810232457.gg19...@sigill.intra.peff.net/
> but note that it's buried pretty deep.

Some notes on this, based on the cat-file patch that I linked to.

Before any optimizations, my best-of-five timing for:

  git cat-file --batch-all-objects --unordered --buffer \
   --batch-check='%(objectname)' >/dev/null

in git.git was:

  real  0m0.434s
  user  0m0.414s
  sys   0m0.020s

That's enumerating every object in the repo but not doing much more than
de-duping the names and printing them.

Applying this patch:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 45992c9be9..04b5cda191 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -443,9 +443,8 @@ static int batch_unordered_object(const struct object_id 
*oid, void *vdata)
 {
struct object_cb_data *data = vdata;
 
-   if (oidset_contains(data->seen, oid))
+   if (oidset_insert(data->seen, oid))
return 0;
-   oidset_insert(data->seen, oid);
 
return batch_object_cb(oid, data);
 }

to use the single-call set-and-replace doesn't seem to help (any
improvement is within the run-to-run noise). So a single hash lookup per
object does not seem to be measurable. And thus teaching oidset_insert()
to do a single hash lookup for check-and-insert is unlikely to help us.

On top of that, I tried using a pool to store the set entries:

diff --git a/oidset.c b/oidset.c
index 454c54f933..504929f177 100644
--- a/oidset.c
+++ b/oidset.c
@@ -17,7 +17,10 @@ int oidset_insert(struct oidset *set, const struct object_id 
*oid)
else if (oidset_contains(set, oid))
return 1;
 
-   entry = xmalloc(sizeof(*entry));
+   if (!set->pool)
+   mem_pool_init(>pool, 0);
+
+   entry = mem_pool_alloc(set->pool, sizeof(*entry));
oidcpy(>oid, oid);
 
oidmap_put(>map, entry);
@@ -29,12 +32,13 @@ int oidset_remove(struct oidset *set, const struct 
object_id *oid)
struct oidmap_entry *entry;
 
entry = oidmap_remove(>map, oid);
-   free(entry);
+   /* abandon pool memory for "entry" */
 
return (entry != NULL);
 }
 
 void oidset_clear(struct oidset *set)
 {
-   oidmap_free(>map, 1);
+   oidmap_free(>map, 0);
+   mem_pool_discard(set->pool, 0);
 }
diff --git a/oidset.h b/oidset.h
index 40ec5f87fe..6b8b802987 100644
--- a/oidset.h
+++ b/oidset.h
@@ -20,6 +20,7 @@
  */
 struct oidset {
struct oidmap map;
+   struct mem_pool *pool;
 };
 
 #define OIDSET_INIT { OIDMAP_INIT }

That drops my best-of-five to:

  real  0m0.300s
  user  0m0.288s
  sys   0m0.012s

which is over a 25% speedup. So that does seem worth pursuing.

For reference, the oid_array code path for cat-file is still:

  real  0m0.161s
  user  0m0.157s
  sys   0m0.004s

but that's not completely apples to apples. The oidset code is also
iterating the packfiles in a different order and generating a revidx
(which I know is about 25ms in this repo). So a better test would
actually swap one data structure out for the other with no other changes
(I just happened to have this test handy, and really wanted to know
whether the mem_pool stuff would help).

-Peff


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread Ævar Arnfjörð Bjarmason


On Sat, Aug 11 2018, René Scharfe wrote:

> Object IDs to skip are stored in a shared static oid_array.  Lookups do
> a binary search on the sorted array.  The code checks if the object IDs
> are already in the correct order while loading and skips sorting in that
> case.

I think this change makes sense, but it's missing an update to the
relevant documentation in Documentation/config.txt:

fsck.skipList::
The path to a sorted list of object names (i.e. one SHA-1 per
line) that are known to be broken in a non-fatal way and should
be ignored. This feature is useful when an established project
should be accepted despite early commits containing errors that
can be safely ignored such as invalid committer email addresses.
Note: corrupt objects cannot be skipped with this setting.

Also, while I use the skipList feature it's for something on the order
of 10-100 objects, so whatever algorithm the lookup uses isn't going to
matter, but I think it's interesting to describe the trade-off in the
commit message.

I.e. what if I have 100K objects listed in the skipList, is it only
going to be read lazily during fsck if there's an issue, or on every
object etc? What's the difference in performance?

Before this change, I wanted to follow-up my ab/fsck-transfer-updates
with something where we'd die if we found the skipList wasn't ordered as
we read it, but from a UI POV this is even better.


Re: [PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread Jeff King
On Sat, Aug 11, 2018 at 05:47:56PM +0200, René Scharfe wrote:

> Object IDs to skip are stored in a shared static oid_array.  Lookups do
> a binary search on the sorted array.  The code checks if the object IDs
> are already in the correct order while loading and skips sorting in that
> case.
> 
> Simplify the code by using an oidset instead.  Memory usage is a bit
> higher, but lookups are done in constant time and there is no need to
> worry about any sort order.
> 
> Embed the oidset into struct fsck_options to make its ownership clear
> (no hidden sharing) and avoid unnecessary pointer indirection.

I actually had a case[1] yesterday where it seems like oidset is a fair
bit slower than oid_array for a large set.

But:

  - loading the skiplist into memory has pretty lousy performance
anyway. If we really care about performance of large lists, we
should define a sorted on-disk format that can be mmap'd and
searched directly.  Or if people are willing to tolerate false
positives, even a bloom filter.

I've never really used a big skiplist myself, so I haven't done any
work towards those things.

  - we could probably improve the speed of oidset. Two things I notice
about its implementation:

  - it has to malloc for each entry, which I suspect is the main
bottleneck. We could probably pool-allocate blocks, and when
entries get removed just leave the allocations in place until we
clear(). Most callers tend to build up a set and then query it a
lot, or possibly remove items from the set until it's empty. But
my guess is that few or none want a long-lived set that they add
and remove from randomly.

  - insertion lets you do check-and-insert as a single operation
(something I failed to notice in [1]). But it's not implemented
as efficiently as it could be, since the "contains" and "put"
operations do separate lookups. This doesn't matter for a set
that's queried a lot more, but for something like de-duping
(like I was doing in [1]) most operations are check-and-insert.

Most of that is just food for thought, but it possibly argues that we
should not care about performance characteristics for swapping out
oid_array and oidset here (i.e., that your patch is fine, and the
simplicity benefit is the most important thing).

[1] https://public-inbox.org/git/20180810232457.gg19...@sigill.intra.peff.net/
but note that it's buried pretty deep.

> ---
>  fsck.c | 23 ++-
>  fsck.h |  8 +---
>  2 files changed, 7 insertions(+), 24 deletions(-)

Again, I didn't see anything wrong with the patch itself.

-Peff


[PATCH 2/2] fsck: use oidset for skiplist

2018-08-11 Thread René Scharfe

Object IDs to skip are stored in a shared static oid_array.  Lookups do
a binary search on the sorted array.  The code checks if the object IDs
are already in the correct order while loading and skips sorting in that
case.

Simplify the code by using an oidset instead.  Memory usage is a bit
higher, but lookups are done in constant time and there is no need to
worry about any sort order.

Embed the oidset into struct fsck_options to make its ownership clear
(no hidden sharing) and avoid unnecessary pointer indirection.

Signed-off-by: Rene Scharfe 
---
 fsck.c | 23 ++-
 fsck.h |  8 +---
 2 files changed, 7 insertions(+), 24 deletions(-)

diff --git a/fsck.c b/fsck.c
index 83f4562390..9246afee22 100644
--- a/fsck.c
+++ b/fsck.c
@@ -10,7 +10,6 @@
 #include "fsck.h"
 #include "refs.h"
 #include "utf8.h"
-#include "sha1-array.h"
 #include "decorate.h"
 #include "oidset.h"
 #include "packfile.h"
@@ -182,19 +181,10 @@ static int fsck_msg_type(enum fsck_msg_id msg_id,
 
 static void init_skiplist(struct fsck_options *options, const char *path)

 {
-   static struct oid_array skiplist = OID_ARRAY_INIT;
-   int sorted;
FILE *fp;
struct strbuf sb = STRBUF_INIT;
struct object_id oid;
 
-	if (options->skiplist)

-   sorted = options->skiplist->sorted;
-   else {
-   sorted = 1;
-   options->skiplist = 
-   }
-
fp = fopen(path, "r");
if (!fp)
die("Could not open skip list: %s", path);
@@ -202,17 +192,10 @@ static void init_skiplist(struct fsck_options *options, 
const char *path)
const char *p;
if (parse_oid_hex(sb.buf, , ) || *p != '\0')
die("Invalid SHA-1: %s", sb.buf);
-   oid_array_append(, );
-   if (sorted && skiplist.nr > 1 &&
-   oidcmp([skiplist.nr - 2],
-  ) > 0)
-   sorted = 0;
+   oidset_insert(>skiplist, );
}
fclose(fp);
strbuf_release();
-
-   if (sorted)
-   skiplist.sorted = 1;
 }
 
 static int parse_msg_type(const char *str)

@@ -317,9 +300,7 @@ static void append_msg_id(struct strbuf *sb, const char 
*msg_id)
 
 static int object_on_skiplist(struct fsck_options *opts, struct object *obj)

 {
-   if (opts && opts->skiplist && obj)
-   return oid_array_lookup(opts->skiplist, >oid) >= 0;
-   return 0;
+   return opts && obj && oidset_contains(>skiplist, >oid);
 }
 
 __attribute__((format (printf, 4, 5)))

diff --git a/fsck.h b/fsck.h
index c3cf5e0034..26120e6186 100644
--- a/fsck.h
+++ b/fsck.h
@@ -1,6 +1,8 @@
 #ifndef GIT_FSCK_H
 #define GIT_FSCK_H
 
+#include "oidset.h"

+
 #define FSCK_ERROR 1
 #define FSCK_WARN 2
 #define FSCK_IGNORE 3
@@ -34,12 +36,12 @@ struct fsck_options {
fsck_error error_func;
unsigned strict:1;
int *msg_type;
-   struct oid_array *skiplist;
+   struct oidset skiplist;
struct decoration *object_names;
 };
 
-#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL }

-#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL }
+#define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL, OIDSET_INIT 
}
+#define FSCK_OPTIONS_STRICT { NULL, fsck_error_function, 1, NULL, OIDSET_INIT }
 
 /* descend in all linked child objects

  * the return value is:
--
2.18.0