Re: [PATCH v3] unpack-trees: avoid duplicate ODB lookups during checkout
On 4/13/2017 8:59 PM, Jeff King wrote: On Thu, Apr 13, 2017 at 04:11:31PM -0700, Junio C Hamano wrote: g...@jeffhostetler.com writes: + /* +* Fetch the tree from the ODB for each peer directory in the +* n commits. +* +* For 2- and 3-way traversals, we try to avoid hitting the +* ODB twice for the same OID. This should yield a nice speed +* up in checkouts and merges when the commits are similar. +* +* We don't bother doing the full O(n^2) search for larger n, +* because wider traversals don't happen that often and we +* avoid the search setup. +* +* When 2 peer OIDs are the same, we just copy the tree +* descriptor data. This implicitly borrows the buffer +* data from the earlier cell. This "borrowing" made me worried, but it turns out that this is perfectly OK. fill_tree_descriptor() uses read_sha1_file() to give a tree_desc its own copy of the tree object data, so the code that calls into the tree traversal API is responsible for freeing the buffer returned from fill_tree_descriptor(). The updated code avoids double freeing by setting buf[i] to NULL for borrowing [i]. Yeah, I think that logic is fine. We could also keep a separate counter for the size of buf, like: buf[nr_buf++] = fill_tree_descriptor(t+i, sha1); and then just count from zero to nr_buf in the freeing loop. I don't think it matters much either way (the free(NULL) calls are presumably cheap). Agreed. I don't think this matters one way or the other, but I'll go ahead and do this while it is fresh. Thanks, Jeff
Re: [PATCH v3] unpack-trees: avoid duplicate ODB lookups during checkout
On Thu, Apr 13, 2017 at 04:11:31PM -0700, Junio C Hamano wrote: > g...@jeffhostetler.com writes: > > > + /* > > +* Fetch the tree from the ODB for each peer directory in the > > +* n commits. > > +* > > +* For 2- and 3-way traversals, we try to avoid hitting the > > +* ODB twice for the same OID. This should yield a nice speed > > +* up in checkouts and merges when the commits are similar. > > +* > > +* We don't bother doing the full O(n^2) search for larger n, > > +* because wider traversals don't happen that often and we > > +* avoid the search setup. > > +* > > +* When 2 peer OIDs are the same, we just copy the tree > > +* descriptor data. This implicitly borrows the buffer > > +* data from the earlier cell. > > This "borrowing" made me worried, but it turns out that this is > perfectly OK. > > fill_tree_descriptor() uses read_sha1_file() to give a tree_desc its > own copy of the tree object data, so the code that calls into the > tree traversal API is responsible for freeing the buffer returned > from fill_tree_descriptor(). The updated code avoids double freeing > by setting buf[i] to NULL for borrowing [i]. Yeah, I think that logic is fine. We could also keep a separate counter for the size of buf, like: buf[nr_buf++] = fill_tree_descriptor(t+i, sha1); and then just count from zero to nr_buf in the freeing loop. I don't think it matters much either way (the free(NULL) calls are presumably cheap). -Peff
Re: [PATCH v3] unpack-trees: avoid duplicate ODB lookups during checkout
g...@jeffhostetler.com writes: > + /* > + * Fetch the tree from the ODB for each peer directory in the > + * n commits. > + * > + * For 2- and 3-way traversals, we try to avoid hitting the > + * ODB twice for the same OID. This should yield a nice speed > + * up in checkouts and merges when the commits are similar. > + * > + * We don't bother doing the full O(n^2) search for larger n, > + * because wider traversals don't happen that often and we > + * avoid the search setup. > + * > + * When 2 peer OIDs are the same, we just copy the tree > + * descriptor data. This implicitly borrows the buffer > + * data from the earlier cell. This "borrowing" made me worried, but it turns out that this is perfectly OK. fill_tree_descriptor() uses read_sha1_file() to give a tree_desc its own copy of the tree object data, so the code that calls into the tree traversal API is responsible for freeing the buffer returned from fill_tree_descriptor(). The updated code avoids double freeing by setting buf[i] to NULL for borrowing [i]. > + */ > for (i = 0; i < n; i++, dirmask >>= 1) { > - const unsigned char *sha1 = NULL; > - if (dirmask & 1) > - sha1 = names[i].oid->hash; > - buf[i] = fill_tree_descriptor(t+i, sha1); > + if (i > 0 && are_same_oid([i], [i - 1])) { > + t[i] = t[i - 1]; > + buf[i] = NULL; > + } else if (i > 1 && are_same_oid([i], [i - 2])) { > + t[i] = t[i - 2]; > + buf[i] = NULL; > + } else { > + const unsigned char *sha1 = NULL; > + if (dirmask & 1) > + sha1 = names[i].oid->hash; > + buf[i] = fill_tree_descriptor(t+i, sha1); > + } > } > > bottom = switch_cache_bottom();
[PATCH v3] unpack-trees: avoid duplicate ODB lookups during checkout
From: Jeff HostetlerVersion 3 uses a structure copy rather than memcpy and adds a comment. Jeff Hostetler (1): unpack-trees: avoid duplicate ODB lookups during checkout unpack-trees.c | 37 + 1 file changed, 33 insertions(+), 4 deletions(-) -- 2.9.3
[PATCH v3] unpack-trees: avoid duplicate ODB lookups during checkout
From: Jeff HostetlerTeach traverse_trees_recursive() to not do redundant ODB lookups when both directories refer to the same OID. In operations such as read-tree and checkout, there will likely be many peer directories that have the same OID when the differences between the commits are relatively small. In these cases we can avoid hitting the ODB multiple times for the same OID. This patch handles n=2 and n=3 cases and simply copies the data rather than repeating the fill_tree_descriptor(). On the Windows repo (500K trees, 3.1M files, 450MB index), this reduced the overall time by 0.75 seconds when cycling between 2 commits with a single file difference. (avg) before: 22.699 (avg) after: 21.955 === On Linux using p0006-read-tree-checkout.sh with linux.git: Test HEAD^ HEAD --- 0006.2: read-tree br_base br_ballast (57994) 0.24(0.20+0.03) 0.24(0.22+0.01) +0.0% 0006.3: switch between br_base br_ballast (57994) 10.58(6.23+2.86) 10.67(5.94+2.87) +0.9% 0006.4: switch between br_ballast br_ballast_plus_1 (57994) 0.60(0.44+0.17) 0.57(0.44+0.14) -5.0% 0006.5: switch between aliases (57994)0.59(0.48+0.13) 0.57(0.44+0.15) -3.4% Signed-off-by: Jeff Hostetler --- unpack-trees.c | 37 + 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/unpack-trees.c b/unpack-trees.c index 3a8ee19..a674423 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -531,6 +531,11 @@ static int switch_cache_bottom(struct traverse_info *info) return ret; } +static inline int are_same_oid(struct name_entry *name_j, struct name_entry *name_k) +{ + return name_j->oid && name_k->oid && !oidcmp(name_j->oid, name_k->oid); +} + static int traverse_trees_recursive(int n, unsigned long dirmask, unsigned long df_conflicts, struct name_entry *names, @@ -553,11 +558,35 @@ static int traverse_trees_recursive(int n, unsigned long dirmask, newinfo.pathlen += tree_entry_len(p) + 1; newinfo.df_conflicts |= df_conflicts; + /* +* Fetch the tree from the ODB for each peer directory in the +* n commits. +* +* For 2- and 3-way traversals, we try to avoid hitting the +* ODB twice for the same OID. This should yield a nice speed +* up in checkouts and merges when the commits are similar. +* +* We don't bother doing the full O(n^2) search for larger n, +* because wider traversals don't happen that often and we +* avoid the search setup. +* +* When 2 peer OIDs are the same, we just copy the tree +* descriptor data. This implicitly borrows the buffer +* data from the earlier cell. +*/ for (i = 0; i < n; i++, dirmask >>= 1) { - const unsigned char *sha1 = NULL; - if (dirmask & 1) - sha1 = names[i].oid->hash; - buf[i] = fill_tree_descriptor(t+i, sha1); + if (i > 0 && are_same_oid([i], [i - 1])) { + t[i] = t[i - 1]; + buf[i] = NULL; + } else if (i > 1 && are_same_oid([i], [i - 2])) { + t[i] = t[i - 2]; + buf[i] = NULL; + } else { + const unsigned char *sha1 = NULL; + if (dirmask & 1) + sha1 = names[i].oid->hash; + buf[i] = fill_tree_descriptor(t+i, sha1); + } } bottom = switch_cache_bottom(); -- 2.9.3