Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Jeff King
On Fri, May 03, 2013 at 02:02:44AM -0400, Jeff King wrote:

> > Can we be sure that the function is never invoked in concurrently from
> > different threads? I attempted to audit code paths, but quickly gave up
> > because I know too little about this machinery.
> 
> I didn't check explicitly, but in general such a program would already
> need a mutex to synchronize object lookup. Not for lookup_object
> specifically, but because lookup_object is mostly used to back
> lookup_commit, lookup_tree, etc, which are already not re-entrant
> (because they lazily insert into the hash behind the scenes).

I just looked through the 25 existing calls to lookup_object. All of
them should be OK. Most of them are coupled with immediate calls to
update the hash table and/or to call read_sha1_file (which is also very
not-thread-safe). So I don't think the patch introduces any bug into the
current code.

It does introduce a potential for future bugs in concurrent code, but I
don't know that it makes a significant dent in the current codebase. We
already have a lot of non-reentrant functions, including all of the
other lookup_* functions.  Our concurrent code is typically very careful
to use a small known-safe subset of functions.

I did look originally at updating the ordering when the hash is already
being updated (i.e., to insert a new entry at the front of a collision
chain, rather than the back). However, that didn't yield nearly as much
improvement. I think partially because the locality of an insert/lookup
pair is not as good as a lookup/lookup pair. And partially because we
destroy that ordering for existing entries whenever we grow the hash
table.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Jeff King
On Fri, May 03, 2013 at 07:59:09AM +0200, Johannes Sixt wrote:

> Am 5/2/2013 17:46, schrieb Jeff King:
> > On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote:
> >> BTW, do you notice that the function is now modifying an object (the hash
> >> table) even though this is rather unexpected from a "lookup" function?
> > 
> > I think this is fine. The function is conceptually constant from the
> > outside; callers don't even know about the hash table. They just know
> > that there is some mapping. It's similar to the way that lookup_commit
> > will lazily allocate the "struct commit". The callers do not care
> > whether it exists already or not; they care that at the end of the
> > function, they have a pointer to the commit. Everything else is an
> > implementation detail.
> 
> Can we be sure that the function is never invoked in concurrently from
> different threads? I attempted to audit code paths, but quickly gave up
> because I know too little about this machinery.

I didn't check explicitly, but in general such a program would already
need a mutex to synchronize object lookup. Not for lookup_object
specifically, but because lookup_object is mostly used to back
lookup_commit, lookup_tree, etc, which are already not re-entrant
(because they lazily insert into the hash behind the scenes).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Johannes Sixt
Am 5/2/2013 17:46, schrieb Jeff King:
> On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote:
>> BTW, do you notice that the function is now modifying an object (the hash
>> table) even though this is rather unexpected from a "lookup" function?
> 
> I think this is fine. The function is conceptually constant from the
> outside; callers don't even know about the hash table. They just know
> that there is some mapping. It's similar to the way that lookup_commit
> will lazily allocate the "struct commit". The callers do not care
> whether it exists already or not; they care that at the end of the
> function, they have a pointer to the commit. Everything else is an
> implementation detail.

Can we be sure that the function is never invoked in concurrently from
different threads? I attempted to audit code paths, but quickly gave up
because I know too little about this machinery.

-- Hannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Jeff King
On Thu, May 02, 2013 at 08:46:08AM -0700, Junio C Hamano wrote:

> Johannes Sixt  writes:
> 
> > BTW, do you notice that the function is now modifying an object (the hash
> > table) even though this is rather unexpected from a "lookup" function?
> 
> At the philosophical level, "lookup" ought to be operating on a
> "const" table.  But at the implementation level, the table does not
> have to be "const" in the sense the implemenation language and data
> structure defines.
> 
> I think this patch is a good example of that.

Very much agreed.

> I am kind of surprised that Jeff's attempt to do a full LRU made
> things worse, though.  The only additional code compared to swap is
> a single memmove(): are we colliding too often (having to move a
> lot)?

I actually didn't use memmove, but a hand-rolled loop. I wonder if that
would have made the difference. It's a little tricky because you may
have to cross the mod(obj_hash_size) wrap-around. The patch I tested
was:

diff --git a/object.c b/object.c
index 1ba2083..8b2412c 100644
--- a/object.c
+++ b/object.c
@@ -85,10 +85,13 @@ struct object *lookup_object(const unsigned char *sha1)
if (i == obj_hash_size)
i = 0;
}
-   if (obj && i != first) {
-   struct object *tmp = obj_hash[i];
-   obj_hash[i] = obj_hash[first];
-   obj_hash[first] = tmp;
+   if (obj) {
+   while (i != first) {
+   int prev = i ? i - 1 : obj_hash_size - 1;
+   obj_hash[i] = obj_hash[prev];
+   i = prev;
+   }
+   obj_hash[i] = obj;
}
return obj;
 }

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Jeff King
On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote:

> > I figured the lengthy description in the commit message would be
> > sufficient,
> 
> It's absolutely sufficient *if* one reads the commit message. In this
> case, though it goes more like "this function should be trivial, and it is
> -- up to this if statement; what the heck is it good for?" and the reader
> is forced to dig the history.

What I really didn't want to do was to rewrite the whole rationale
again. I think something like:

  /*
   * Move the found object to the front of the collision chain;
   * this is a pure optimization that can make future lookups cheaper,
   * assuming there is some time-locality to which objects are looked
   * up.
   */

would be enough, and interested parties can go to the commit history (I
tend to go there pretty rapidly if there is a chunk of code that sticks
out, but I guess not everybody does).

> BTW, do you notice that the function is now modifying an object (the hash
> table) even though this is rather unexpected from a "lookup" function?

I think this is fine. The function is conceptually constant from the
outside; callers don't even know about the hash table. They just know
that there is some mapping. It's similar to the way that lookup_commit
will lazily allocate the "struct commit". The callers do not care
whether it exists already or not; they care that at the end of the
function, they have a pointer to the commit. Everything else is an
implementation detail.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Junio C Hamano
Johannes Sixt  writes:

> BTW, do you notice that the function is now modifying an object (the hash
> table) even though this is rather unexpected from a "lookup" function?

At the philosophical level, "lookup" ought to be operating on a
"const" table.  But at the implementation level, the table does not
have to be "const" in the sense the implemenation language and data
structure defines.

I think this patch is a good example of that.

I am kind of surprised that Jeff's attempt to do a full LRU made
things worse, though.  The only additional code compared to swap is
a single memmove(): are we colliding too often (having to move a
lot)?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Junio C Hamano
Jeff King  writes:

> I figured the lengthy description in the commit message would be
> sufficient, but I don't mind adding something like your suggestion to
> point readers of the code in the right direction when they see it.

Yeah, I'll squash J6t's comment in and requeue.

If somebody is learning the internals by reading code, having a
comment there would certainly help.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-02 Thread Johannes Sixt
Am 5/2/2013 8:46, schrieb Jeff King:
> On Thu, May 02, 2013 at 08:44:07AM +0200, Johannes Sixt wrote:
>> Am 5/1/2013 22:34, schrieb Jeff King:
>>>  struct object *lookup_object(const unsigned char *sha1)
>>>  {
>>> -   unsigned int i;
>>> +   unsigned int i, first;
>>> struct object *obj;
>>>  
>>> if (!obj_hash)
>>> return NULL;
>>>  
>>> -   i = hashtable_index(sha1);
>>> +   first = i = hashtable_index(sha1);
>>> while ((obj = obj_hash[i]) != NULL) {
>>> if (!hashcmp(sha1, obj->sha1))
>>> break;
>>> @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1)
>>> if (i == obj_hash_size)
>>> i = 0;
>>> }
>>> +   if (obj && i != first) {
>>> +   struct object *tmp = obj_hash[i];
>>> +   obj_hash[i] = obj_hash[first];
>>> +   obj_hash[first] = tmp;
>>> +   }
>>> return obj;
>>>  }
>>
>> This is one of the places where I think the code does not speak for itself
>> and a comment is warranted: The new if statement is not about correctness,
>> but about optimization:
> 
> I figured the lengthy description in the commit message would be
> sufficient,

It's absolutely sufficient *if* one reads the commit message. In this
case, though it goes more like "this function should be trivial, and it is
-- up to this if statement; what the heck is it good for?" and the reader
is forced to dig the history.

BTW, do you notice that the function is now modifying an object (the hash
table) even though this is rather unexpected from a "lookup" function?

-- Hannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-01 Thread Jeff King
On Thu, May 02, 2013 at 08:44:07AM +0200, Johannes Sixt wrote:

> Am 5/1/2013 22:34, schrieb Jeff King:
> >  struct object *lookup_object(const unsigned char *sha1)
> >  {
> > -   unsigned int i;
> > +   unsigned int i, first;
> > struct object *obj;
> >  
> > if (!obj_hash)
> > return NULL;
> >  
> > -   i = hashtable_index(sha1);
> > +   first = i = hashtable_index(sha1);
> > while ((obj = obj_hash[i]) != NULL) {
> > if (!hashcmp(sha1, obj->sha1))
> > break;
> > @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1)
> > if (i == obj_hash_size)
> > i = 0;
> > }
> > +   if (obj && i != first) {
> > +   struct object *tmp = obj_hash[i];
> > +   obj_hash[i] = obj_hash[first];
> > +   obj_hash[first] = tmp;
> > +   }
> > return obj;
> >  }
> 
> This is one of the places where I think the code does not speak for itself
> and a comment is warranted: The new if statement is not about correctness,
> but about optimization:

I figured the lengthy description in the commit message would be
sufficient, but I don't mind adding something like your suggestion to
point readers of the code in the right direction when they see it.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-01 Thread Johannes Sixt
Am 5/1/2013 22:34, schrieb Jeff King:
>  struct object *lookup_object(const unsigned char *sha1)
>  {
> - unsigned int i;
> + unsigned int i, first;
>   struct object *obj;
>  
>   if (!obj_hash)
>   return NULL;
>  
> - i = hashtable_index(sha1);
> + first = i = hashtable_index(sha1);
>   while ((obj = obj_hash[i]) != NULL) {
>   if (!hashcmp(sha1, obj->sha1))
>   break;
> @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1)
>   if (i == obj_hash_size)
>   i = 0;
>   }
> + if (obj && i != first) {
> + struct object *tmp = obj_hash[i];
> + obj_hash[i] = obj_hash[first];
> + obj_hash[first] = tmp;
> + }
>   return obj;
>  }

This is one of the places where I think the code does not speak for itself
and a comment is warranted: The new if statement is not about correctness,
but about optimization:

/*
 * Move object to where we started to look for it
 * so that we do not need to walk the hash table
 * the next time we look for it.
 */

or something.

-- Hannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] lookup_object: prioritize recently found objects

2013-05-01 Thread Junio C Hamano
Jeff King  writes:

> We could instead bump X into the `i` slot, and then shift
> the whole contiguous chain down by one, resulting in:
>
> index   | i-1 |  i  | i+1 | i+2 |
>---
> entry   ... |  A  |  X  |  B  |  C  | ...
>---
>
> That puts our chain in true most-recently-used order.
> However, experiments show that it is not any faster (and in
> fact, is slightly slower due to the extra manipulation).

Makes quite a lot of sense to me.  Very simple, clean and nice.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lookup_object: prioritize recently found objects

2013-05-01 Thread Jeff King
The lookup_object function is backed by a hash table of all
objects we have seen in the program. We manage collisions
with a linear walk over the colliding entries, checking each
with hashcmp(). The main cost of lookup is in these
hashcmp() calls; finding our item in the first slot is
cheaper than finding it in the second slot, which is cheaper
than the third, and so on.

If we assume that there is some locality to the object
lookups (e.g., if X and Y collide, and we have just looked
up X, the next lookup is more likely to be for X than for
Y), then we can improve our average lookup speed by checking
X before Y.

This patch does so by swapping a found item to the front of
the collision chain. The p0001 perf test reveals that this
does indeed exploit locality in the case of "rev-list --all
--objects":

Test   origin  this tree
-
0001.1: rev-list --all 0.40(0.38+0.02) 0.40(0.36+0.03) +0.0%
0001.2: rev-list --all --objects   2.24(2.17+0.05) 1.86(1.79+0.05) -17.0%

This is not surprising, as the full object traversal will
hit the same tree entries over and over (e.g., for every
commit that doesn't change "Documentation/", we will have to
look up the same sha1 just to find out that we already
processed it).

The reason why this technique works (and does not violate
any properties of the hash table) is subtle and bears some
explanation. Let's imagine we get a lookup for sha1 `X`, and
it hashes to bucket `i` in our table. That stretch of the
table may look like:

index   | i-1 |  i  | i+1 | i+2 |
   ---
entry   ... |  A  |  B  |  C  |  X  | ...
   ---

We start our probe at i, see that B does not match, nor does
C, and finally find X. There may be multiple C's in the
middle, but we know that there are no empty slots (or else
we would not find X at all).

We do not know the original index of B; it may be `i`, or it
may be less than i (e.g., if it were `i-1`, it would collide
with A and spill over into the `i` bucket). So it is
acceptable for us to move it to the right of a contiguous
stretch of entries (because we will find it from a linear
walk starting anywhere at `i` or before), but never to the
left (if we moved it to `i-1`, we would miss it when
starting our walk at `i`).

We do know the original index of X; it is `i`, so it is safe
to place it anywhere in the contiguous stretch between `i`
and where we found it (`i+2` in the this case).

This patch does a pure swap; after finding X in the
situation above, we would end with:

index   | i-1 |  i  | i+1 | i+2 |
   ---
entry   ... |  A  |  X  |  C  |  B  | ...
   ---

We could instead bump X into the `i` slot, and then shift
the whole contiguous chain down by one, resulting in:

index   | i-1 |  i  | i+1 | i+2 |
   ---
entry   ... |  A  |  X  |  B  |  C  | ...
   ---

That puts our chain in true most-recently-used order.
However, experiments show that it is not any faster (and in
fact, is slightly slower due to the extra manipulation).

Signed-off-by: Jeff King 
---
The speedup on linux.git is less spectacular; I got about 9% improvement
versus stock git. However, when I combine this with the commit cache we
were looking at a few months ago, at which point the lookup_object time
starts to dominate, then doing this nets more like 20% on linux.git.

It may speed up other tree-oriented loads, too, but I didn't measure. At
the very least, it should generally not be making other loads worse.

 object.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/object.c b/object.c
index 20703f5..1ba2083 100644
--- a/object.c
+++ b/object.c
@@ -71,13 +71,13 @@ struct object *lookup_object(const unsigned char *sha1)
 
 struct object *lookup_object(const unsigned char *sha1)
 {
-   unsigned int i;
+   unsigned int i, first;
struct object *obj;
 
if (!obj_hash)
return NULL;
 
-   i = hashtable_index(sha1);
+   first = i = hashtable_index(sha1);
while ((obj = obj_hash[i]) != NULL) {
if (!hashcmp(sha1, obj->sha1))
break;
@@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1)
if (i == obj_hash_size)
i = 0;
}
+   if (obj && i != first) {
+   struct object *tmp = obj_hash[i];
+   obj_hash[i] = obj_hash[first];
+   obj_hash[first] = tmp;
+   }
return obj;
 }
 
-- 
1.8.2.2.10.geb24d2a
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html