Re: [PATCH] lookup_object: prioritize recently found objects
On Fri, May 03, 2013 at 02:02:44AM -0400, Jeff King wrote: > > Can we be sure that the function is never invoked in concurrently from > > different threads? I attempted to audit code paths, but quickly gave up > > because I know too little about this machinery. > > I didn't check explicitly, but in general such a program would already > need a mutex to synchronize object lookup. Not for lookup_object > specifically, but because lookup_object is mostly used to back > lookup_commit, lookup_tree, etc, which are already not re-entrant > (because they lazily insert into the hash behind the scenes). I just looked through the 25 existing calls to lookup_object. All of them should be OK. Most of them are coupled with immediate calls to update the hash table and/or to call read_sha1_file (which is also very not-thread-safe). So I don't think the patch introduces any bug into the current code. It does introduce a potential for future bugs in concurrent code, but I don't know that it makes a significant dent in the current codebase. We already have a lot of non-reentrant functions, including all of the other lookup_* functions. Our concurrent code is typically very careful to use a small known-safe subset of functions. I did look originally at updating the ordering when the hash is already being updated (i.e., to insert a new entry at the front of a collision chain, rather than the back). However, that didn't yield nearly as much improvement. I think partially because the locality of an insert/lookup pair is not as good as a lookup/lookup pair. And partially because we destroy that ordering for existing entries whenever we grow the hash table. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
On Fri, May 03, 2013 at 07:59:09AM +0200, Johannes Sixt wrote: > Am 5/2/2013 17:46, schrieb Jeff King: > > On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote: > >> BTW, do you notice that the function is now modifying an object (the hash > >> table) even though this is rather unexpected from a "lookup" function? > > > > I think this is fine. The function is conceptually constant from the > > outside; callers don't even know about the hash table. They just know > > that there is some mapping. It's similar to the way that lookup_commit > > will lazily allocate the "struct commit". The callers do not care > > whether it exists already or not; they care that at the end of the > > function, they have a pointer to the commit. Everything else is an > > implementation detail. > > Can we be sure that the function is never invoked in concurrently from > different threads? I attempted to audit code paths, but quickly gave up > because I know too little about this machinery. I didn't check explicitly, but in general such a program would already need a mutex to synchronize object lookup. Not for lookup_object specifically, but because lookup_object is mostly used to back lookup_commit, lookup_tree, etc, which are already not re-entrant (because they lazily insert into the hash behind the scenes). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Am 5/2/2013 17:46, schrieb Jeff King: > On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote: >> BTW, do you notice that the function is now modifying an object (the hash >> table) even though this is rather unexpected from a "lookup" function? > > I think this is fine. The function is conceptually constant from the > outside; callers don't even know about the hash table. They just know > that there is some mapping. It's similar to the way that lookup_commit > will lazily allocate the "struct commit". The callers do not care > whether it exists already or not; they care that at the end of the > function, they have a pointer to the commit. Everything else is an > implementation detail. Can we be sure that the function is never invoked in concurrently from different threads? I attempted to audit code paths, but quickly gave up because I know too little about this machinery. -- Hannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
On Thu, May 02, 2013 at 08:46:08AM -0700, Junio C Hamano wrote: > Johannes Sixt writes: > > > BTW, do you notice that the function is now modifying an object (the hash > > table) even though this is rather unexpected from a "lookup" function? > > At the philosophical level, "lookup" ought to be operating on a > "const" table. But at the implementation level, the table does not > have to be "const" in the sense the implemenation language and data > structure defines. > > I think this patch is a good example of that. Very much agreed. > I am kind of surprised that Jeff's attempt to do a full LRU made > things worse, though. The only additional code compared to swap is > a single memmove(): are we colliding too often (having to move a > lot)? I actually didn't use memmove, but a hand-rolled loop. I wonder if that would have made the difference. It's a little tricky because you may have to cross the mod(obj_hash_size) wrap-around. The patch I tested was: diff --git a/object.c b/object.c index 1ba2083..8b2412c 100644 --- a/object.c +++ b/object.c @@ -85,10 +85,13 @@ struct object *lookup_object(const unsigned char *sha1) if (i == obj_hash_size) i = 0; } - if (obj && i != first) { - struct object *tmp = obj_hash[i]; - obj_hash[i] = obj_hash[first]; - obj_hash[first] = tmp; + if (obj) { + while (i != first) { + int prev = i ? i - 1 : obj_hash_size - 1; + obj_hash[i] = obj_hash[prev]; + i = prev; + } + obj_hash[i] = obj; } return obj; } -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
On Thu, May 02, 2013 at 09:05:01AM +0200, Johannes Sixt wrote: > > I figured the lengthy description in the commit message would be > > sufficient, > > It's absolutely sufficient *if* one reads the commit message. In this > case, though it goes more like "this function should be trivial, and it is > -- up to this if statement; what the heck is it good for?" and the reader > is forced to dig the history. What I really didn't want to do was to rewrite the whole rationale again. I think something like: /* * Move the found object to the front of the collision chain; * this is a pure optimization that can make future lookups cheaper, * assuming there is some time-locality to which objects are looked * up. */ would be enough, and interested parties can go to the commit history (I tend to go there pretty rapidly if there is a chunk of code that sticks out, but I guess not everybody does). > BTW, do you notice that the function is now modifying an object (the hash > table) even though this is rather unexpected from a "lookup" function? I think this is fine. The function is conceptually constant from the outside; callers don't even know about the hash table. They just know that there is some mapping. It's similar to the way that lookup_commit will lazily allocate the "struct commit". The callers do not care whether it exists already or not; they care that at the end of the function, they have a pointer to the commit. Everything else is an implementation detail. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Johannes Sixt writes: > BTW, do you notice that the function is now modifying an object (the hash > table) even though this is rather unexpected from a "lookup" function? At the philosophical level, "lookup" ought to be operating on a "const" table. But at the implementation level, the table does not have to be "const" in the sense the implemenation language and data structure defines. I think this patch is a good example of that. I am kind of surprised that Jeff's attempt to do a full LRU made things worse, though. The only additional code compared to swap is a single memmove(): are we colliding too often (having to move a lot)? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Jeff King writes: > I figured the lengthy description in the commit message would be > sufficient, but I don't mind adding something like your suggestion to > point readers of the code in the right direction when they see it. Yeah, I'll squash J6t's comment in and requeue. If somebody is learning the internals by reading code, having a comment there would certainly help. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Am 5/2/2013 8:46, schrieb Jeff King: > On Thu, May 02, 2013 at 08:44:07AM +0200, Johannes Sixt wrote: >> Am 5/1/2013 22:34, schrieb Jeff King: >>> struct object *lookup_object(const unsigned char *sha1) >>> { >>> - unsigned int i; >>> + unsigned int i, first; >>> struct object *obj; >>> >>> if (!obj_hash) >>> return NULL; >>> >>> - i = hashtable_index(sha1); >>> + first = i = hashtable_index(sha1); >>> while ((obj = obj_hash[i]) != NULL) { >>> if (!hashcmp(sha1, obj->sha1)) >>> break; >>> @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1) >>> if (i == obj_hash_size) >>> i = 0; >>> } >>> + if (obj && i != first) { >>> + struct object *tmp = obj_hash[i]; >>> + obj_hash[i] = obj_hash[first]; >>> + obj_hash[first] = tmp; >>> + } >>> return obj; >>> } >> >> This is one of the places where I think the code does not speak for itself >> and a comment is warranted: The new if statement is not about correctness, >> but about optimization: > > I figured the lengthy description in the commit message would be > sufficient, It's absolutely sufficient *if* one reads the commit message. In this case, though it goes more like "this function should be trivial, and it is -- up to this if statement; what the heck is it good for?" and the reader is forced to dig the history. BTW, do you notice that the function is now modifying an object (the hash table) even though this is rather unexpected from a "lookup" function? -- Hannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
On Thu, May 02, 2013 at 08:44:07AM +0200, Johannes Sixt wrote: > Am 5/1/2013 22:34, schrieb Jeff King: > > struct object *lookup_object(const unsigned char *sha1) > > { > > - unsigned int i; > > + unsigned int i, first; > > struct object *obj; > > > > if (!obj_hash) > > return NULL; > > > > - i = hashtable_index(sha1); > > + first = i = hashtable_index(sha1); > > while ((obj = obj_hash[i]) != NULL) { > > if (!hashcmp(sha1, obj->sha1)) > > break; > > @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1) > > if (i == obj_hash_size) > > i = 0; > > } > > + if (obj && i != first) { > > + struct object *tmp = obj_hash[i]; > > + obj_hash[i] = obj_hash[first]; > > + obj_hash[first] = tmp; > > + } > > return obj; > > } > > This is one of the places where I think the code does not speak for itself > and a comment is warranted: The new if statement is not about correctness, > but about optimization: I figured the lengthy description in the commit message would be sufficient, but I don't mind adding something like your suggestion to point readers of the code in the right direction when they see it. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Am 5/1/2013 22:34, schrieb Jeff King: > struct object *lookup_object(const unsigned char *sha1) > { > - unsigned int i; > + unsigned int i, first; > struct object *obj; > > if (!obj_hash) > return NULL; > > - i = hashtable_index(sha1); > + first = i = hashtable_index(sha1); > while ((obj = obj_hash[i]) != NULL) { > if (!hashcmp(sha1, obj->sha1)) > break; > @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1) > if (i == obj_hash_size) > i = 0; > } > + if (obj && i != first) { > + struct object *tmp = obj_hash[i]; > + obj_hash[i] = obj_hash[first]; > + obj_hash[first] = tmp; > + } > return obj; > } This is one of the places where I think the code does not speak for itself and a comment is warranted: The new if statement is not about correctness, but about optimization: /* * Move object to where we started to look for it * so that we do not need to walk the hash table * the next time we look for it. */ or something. -- Hannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] lookup_object: prioritize recently found objects
Jeff King writes: > We could instead bump X into the `i` slot, and then shift > the whole contiguous chain down by one, resulting in: > > index | i-1 | i | i+1 | i+2 | >--- > entry ... | A | X | B | C | ... >--- > > That puts our chain in true most-recently-used order. > However, experiments show that it is not any faster (and in > fact, is slightly slower due to the extra manipulation). Makes quite a lot of sense to me. Very simple, clean and nice. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lookup_object: prioritize recently found objects
The lookup_object function is backed by a hash table of all objects we have seen in the program. We manage collisions with a linear walk over the colliding entries, checking each with hashcmp(). The main cost of lookup is in these hashcmp() calls; finding our item in the first slot is cheaper than finding it in the second slot, which is cheaper than the third, and so on. If we assume that there is some locality to the object lookups (e.g., if X and Y collide, and we have just looked up X, the next lookup is more likely to be for X than for Y), then we can improve our average lookup speed by checking X before Y. This patch does so by swapping a found item to the front of the collision chain. The p0001 perf test reveals that this does indeed exploit locality in the case of "rev-list --all --objects": Test origin this tree - 0001.1: rev-list --all 0.40(0.38+0.02) 0.40(0.36+0.03) +0.0% 0001.2: rev-list --all --objects 2.24(2.17+0.05) 1.86(1.79+0.05) -17.0% This is not surprising, as the full object traversal will hit the same tree entries over and over (e.g., for every commit that doesn't change "Documentation/", we will have to look up the same sha1 just to find out that we already processed it). The reason why this technique works (and does not violate any properties of the hash table) is subtle and bears some explanation. Let's imagine we get a lookup for sha1 `X`, and it hashes to bucket `i` in our table. That stretch of the table may look like: index | i-1 | i | i+1 | i+2 | --- entry ... | A | B | C | X | ... --- We start our probe at i, see that B does not match, nor does C, and finally find X. There may be multiple C's in the middle, but we know that there are no empty slots (or else we would not find X at all). We do not know the original index of B; it may be `i`, or it may be less than i (e.g., if it were `i-1`, it would collide with A and spill over into the `i` bucket). So it is acceptable for us to move it to the right of a contiguous stretch of entries (because we will find it from a linear walk starting anywhere at `i` or before), but never to the left (if we moved it to `i-1`, we would miss it when starting our walk at `i`). We do know the original index of X; it is `i`, so it is safe to place it anywhere in the contiguous stretch between `i` and where we found it (`i+2` in the this case). This patch does a pure swap; after finding X in the situation above, we would end with: index | i-1 | i | i+1 | i+2 | --- entry ... | A | X | C | B | ... --- We could instead bump X into the `i` slot, and then shift the whole contiguous chain down by one, resulting in: index | i-1 | i | i+1 | i+2 | --- entry ... | A | X | B | C | ... --- That puts our chain in true most-recently-used order. However, experiments show that it is not any faster (and in fact, is slightly slower due to the extra manipulation). Signed-off-by: Jeff King --- The speedup on linux.git is less spectacular; I got about 9% improvement versus stock git. However, when I combine this with the commit cache we were looking at a few months ago, at which point the lookup_object time starts to dominate, then doing this nets more like 20% on linux.git. It may speed up other tree-oriented loads, too, but I didn't measure. At the very least, it should generally not be making other loads worse. object.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/object.c b/object.c index 20703f5..1ba2083 100644 --- a/object.c +++ b/object.c @@ -71,13 +71,13 @@ struct object *lookup_object(const unsigned char *sha1) struct object *lookup_object(const unsigned char *sha1) { - unsigned int i; + unsigned int i, first; struct object *obj; if (!obj_hash) return NULL; - i = hashtable_index(sha1); + first = i = hashtable_index(sha1); while ((obj = obj_hash[i]) != NULL) { if (!hashcmp(sha1, obj->sha1)) break; @@ -85,6 +85,11 @@ struct object *lookup_object(const unsigned char *sha1) if (i == obj_hash_size) i = 0; } + if (obj && i != first) { + struct object *tmp = obj_hash[i]; + obj_hash[i] = obj_hash[first]; + obj_hash[first] = tmp; + } return obj; } -- 1.8.2.2.10.geb24d2a -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html