Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread John Keeping
On Wed, May 29, 2013 at 11:48:53AM -0700, Junio C Hamano wrote:
> Jeff King  writes:
> 
> > On Wed, May 29, 2013 at 11:08:46AM -0700, Junio C Hamano wrote:
> >
> >> This has rather interesting ramifications on cherry-pick and rebase,
> >> though.  Both command can handle changes that come from an old tree
> >> before some paths were renamed, but strict patch-id would not spot
> >> equivalent changes we already have in our history if our change
> >> happened after a rename, i.e.
> >> 
> >>Z
> >>   /
> >>  O---R---X---Y
> >> 
> >> where Z updates path F, R moves F to G and X changes G the same way
> >> as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
> >> cherry-pick filter will see different patch-id for Z and X.
> >
> > True. That is a problem with the current patch-id system, no?
> 
> Correct.  That is why I suggested not to change the external
> interface (i.e. what is shown from patch-id command) as people may
> have kept them, but wondered if a possible improvement may be to
> exclude the name from ids when we internally generate two sets of
> Ids and intersect them, i.e. log --cherry.

Would that not rely on the files still being in the same order?  Since
we essentially hash the content of the patch, it will presumably be
quite different if the order of the files in the patch changes.

I expect it's possible to work around that by being a bit more clever
though.  Perhaps hash each file individually and then sort those and
hash the concatenated result?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Junio C Hamano
Jeff King  writes:

> On Wed, May 29, 2013 at 11:08:46AM -0700, Junio C Hamano wrote:
>
>> This has rather interesting ramifications on cherry-pick and rebase,
>> though.  Both command can handle changes that come from an old tree
>> before some paths were renamed, but strict patch-id would not spot
>> equivalent changes we already have in our history if our change
>> happened after a rename, i.e.
>> 
>>Z
>>   /
>>  O---R---X---Y
>> 
>> where Z updates path F, R moves F to G and X changes G the same way
>> as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
>> cherry-pick filter will see different patch-id for Z and X.
>
> True. That is a problem with the current patch-id system, no?

Correct.  That is why I suggested not to change the external
interface (i.e. what is shown from patch-id command) as people may
have kept them, but wondered if a possible improvement may be to
exclude the name from ids when we internally generate two sets of
Ids and intersect them, i.e. log --cherry.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 11:08:46AM -0700, Junio C Hamano wrote:

> This has rather interesting ramifications on cherry-pick and rebase,
> though.  Both command can handle changes that come from an old tree
> before some paths were renamed, but strict patch-id would not spot
> equivalent changes we already have in our history if our change
> happened after a rename, i.e.
> 
>Z
>   /
>  O---R---X---Y
> 
> where Z updates path F, R moves F to G and X changes G the same way
> as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
> cherry-pick filter will see different patch-id for Z and X.

True. That is a problem with the current patch-id system, no? Though my
proposal would make it harder to change it in the future (as does
John's).

With mine, I wonder if you could use a different "loose" definition that
does not take the filenames into account.  Using something like the
changes in file sizes would be unique, but would not properly map to
strict cases (a patch-id that adds 5 lines to the end of a 100-byte file
may match one that adds the same five lines to the end of a 200-byte
file).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Junio C Hamano
Jeff King  writes:

> I think such a loose patch-id could just be a hash of the filenames that
> were changed by the patch (e.g., the first 32-bits of the sha1 of the
> concatenated filenames). Computing that should be about as expensive as
> a tree-diff. Per observation 2 above, if two commits do not have the
> same loose id, we know that they cannot possibly have the same strict
> id.

Because the "strict" one already hashes the filenames, if files that
are touched by a patch is different from that of another patch, we
judge them being different.

> Then we can forget about the smaller-side and bigger-side entirely, and
> just do something like:
>
>   1. Make a sorted list (or hash table) of loose ids for one side.
>
>   2. For each commit on the other side, calculate its loose id and look
>  that up in the sorted list. If no hits, we know that there is no
>  match. For any hits, lazily calculate (and cache) the strict patch
>  id for both sides and compare as usual.
>
> In the best case, we compute no patch-ids at all. And even for the
> average case, I'd expect our lazy calculation to only have to compute a
> handful of ids.

Correct.

This has rather interesting ramifications on cherry-pick and rebase,
though.  Both command can handle changes that come from an old tree
before some paths were renamed, but strict patch-id would not spot
equivalent changes we already have in our history if our change
happened after a rename, i.e.

   Z
  /
 O---R---X---Y

where Z updates path F, R moves F to G and X changes G the same way
as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
cherry-pick filter will see different patch-id for Z and X.

We will likely to notice that "patch already applied" (if using am-3
machinery) or "already up-to-date" (if using merge machinery) even
when we missed this equivalency and drop the duplicate from the
result, so it is not a big loss, but we might want to consider
removing the filename from patch-id computation, at least for the
ones we internally use and discard for revs->cherry_pick filtering.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 03:22:25AM -0400, Jeff King wrote:

>   revs=origin/master...origin/jk/submodule-subdirectory-ok
> stock|you   |me
> ---
>   real  0m0.501s | 0m0.078s | 0m0.098s
>   user  0m0.480s | 0m0.056s | 0m0.084s
>   sys   0m0.016s | 0m0.020s | 0m0.012s
> 
>   revs=origin/next...origin/pu
> stock|you   |me
> ---
>   real  0m0.857s | 0m0.847s | 0m0.519s
>   user  0m0.828s | 0m0.812s | 0m0.488s
>   sys   0m0.024s | 0m0.028s | 0m0.024s
> 
> So it performs slightly less well on the small case, and a bit better on
> the large case. I think part of the problem is that when we do have a
> "loose" hit, we end up re-doing the tree diff to find the strict, which
> involves re-inflating the trees. It's unavoidable for the lazy entries
> unless we want to cache the whole diff_queued_diff, but for the non-lazy
> entries we should be able to do the strict patch-id incrementally. We
> just need to improve the function interfaces.

The (somewhat hacky) patch below, on top of my previous one, does that
optimization.  But the timings aren't improved by much. My best-of-five
for the second case went down to:

  real0m0.495s
  user0m0.488s
  sys 0m0.004s

However, the actual time to just do "git log --raw $revs" is:

  real0m0.333s
  user0m0.292s
  sys 0m0.032s

which provides a lower-bound for how well we can do, as it is just doing
a single tree diff for each commit. So I think we have reaped most of
the benefits of this approach already (and we will generally have to do
_some_ true patch-id calculations, so we can never meet that lower
bound).

---
 diff.c  | 11 +++---
 diff.h  |  3 ++-
 patch-ids.c | 39 ++-
 3 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/diff.c b/diff.c
index 3b55788..161c5bf 100644
--- a/diff.c
+++ b/diff.c
@@ -4233,8 +4233,8 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
-int loose)
+int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
+ int loose)
 {
struct diff_queue_struct *q = &diff_queued_diff;
int i;
@@ -4330,21 +4330,16 @@ int diff_flush_patch_id(struct diff_options *options,
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options,
-   unsigned char *sha1,
-   int loose)
+void diff_queue_clear(void)
 {
struct diff_queue_struct *q = &diff_queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1, loose);
 
for (i = 0; i < q->nr; i++)
diff_free_filepair(q->queue[i]);
 
free(q->queue);
DIFF_QUEUE_CLEAR(q);
-
-   return result;
 }
 
 static int is_summary_empty(const struct diff_queue_struct *q)
diff --git a/diff.h b/diff.h
index b018aaf..7207d4b 100644
--- a/diff.h
+++ b/diff.h
@@ -320,7 +320,8 @@ extern int do_diff_cache(const unsigned char *, struct 
diff_options *);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int 
loose);
+extern int diff_get_patch_id(struct diff_options *, unsigned char *, int 
loose);
+extern void diff_queue_clear(void);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index 3a83ee6..83cda92 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -4,8 +4,7 @@
 #include "sha1-lookup.h"
 #include "patch-ids.h"
 
-static int commit_patch_id(struct commit *commit, struct diff_options *options,
-  unsigned char *sha1, int loose)
+static void start_patch_id(struct commit *commit, struct diff_options *options)
 {
if (commit->parents)
diff_tree_sha1(commit->parents->item->object.sha1,
@@ -13,7 +12,6 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit->object.sha1, "", options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1, loose);
 }
 
 int init_patch_ids(struct patch_ids *ids)
@@ -50,12 +48,16 @@ static struct patch_id *add_commit(struct commit *commit,
   int no_add)
 {
struct patch_id *p;
-   unsigned char loose[20], strict[20];
+   unsigned char loose[20], strict[20] = {0};
unsigned long hash;
void **pos;
 
-   if (commit_patch_id(commit, &ids->diffopts, loose, 1))
+   start_patch_id(commit, &ids->diffopts);
+   if (diff_get_patch_id(&ids->diffopts, loose, 1)) {
+   diff_queue_clear();
return NULL;
+   }
+
   

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 02:20:07AM -0400, Jeff King wrote:

> In the best case, we compute no patch-ids at all. And even for the
> average case, I'd expect our lazy calculation to only have to compute a
> handful of ids.

Here is a not-well-tested version of the idea. I tried to contain the
changes to patch-ids.c, though we may also be able to simplify the
--cherry code, too.

Here are my timings compared to stock git and to yours
(all are best-of-five, "git log --cherry $revs"):

  revs=origin/master...origin/jk/submodule-subdirectory-ok
stock|you   |me
---
  real  0m0.501s | 0m0.078s | 0m0.098s
  user  0m0.480s | 0m0.056s | 0m0.084s
  sys   0m0.016s | 0m0.020s | 0m0.012s

  revs=origin/next...origin/pu
stock|you   |me
---
  real  0m0.857s | 0m0.847s | 0m0.519s
  user  0m0.828s | 0m0.812s | 0m0.488s
  sys   0m0.024s | 0m0.028s | 0m0.024s

So it performs slightly less well on the small case, and a bit better on
the large case. I think part of the problem is that when we do have a
"loose" hit, we end up re-doing the tree diff to find the strict, which
involves re-inflating the trees. It's unavoidable for the lazy entries
unless we want to cache the whole diff_queued_diff, but for the non-lazy
entries we should be able to do the strict patch-id incrementally. We
just need to improve the function interfaces.

---
 diff.c  |  15 -
 diff.h  |   2 +-
 patch-ids.c | 117 +++
 patch-ids.h |  11 ++--
 4 files changed, 82 insertions(+), 63 deletions(-)

diff --git a/diff.c b/diff.c
index f0b3e7c..3b55788 100644
--- a/diff.c
+++ b/diff.c
@@ -4233,7 +4233,8 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
+static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
+int loose)
 {
struct diff_queue_struct *q = &diff_queued_diff;
int i;
@@ -4266,6 +4267,12 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
if (DIFF_PAIR_UNMERGED(p))
continue;
 
+   if (loose) {
+   git_SHA1_Update(&ctx, p->one->path, 
strlen(p->one->path));
+   git_SHA1_Update(&ctx, p->two->path, 
strlen(p->two->path));
+   continue;
+   }
+
diff_fill_sha1_info(p->one);
diff_fill_sha1_info(p->two);
if (fill_mmfile(&mf1, p->one) < 0 ||
@@ -4323,11 +4330,13 @@ int diff_flush_patch_id(struct diff_options *options, 
unsigned char *sha1)
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1)
+int diff_flush_patch_id(struct diff_options *options,
+   unsigned char *sha1,
+   int loose)
 {
struct diff_queue_struct *q = &diff_queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1);
+   int result = diff_get_patch_id(options, sha1, loose);
 
for (i = 0; i < q->nr; i++)
diff_free_filepair(q->queue[i]);
diff --git a/diff.h b/diff.h
index 78b4091..b018aaf 100644
--- a/diff.h
+++ b/diff.h
@@ -320,7 +320,7 @@ extern int do_diff_cache(const unsigned char *, struct 
diff_options *);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *);
+extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int 
loose);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index bc8a28f..3a83ee6 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -5,7 +5,7 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
 #include "patch-ids.h"
 
 static int commit_patch_id(struct commit *commit, struct diff_options *options,
-   unsigned char *sha1)
+  unsigned char *sha1, int loose)
 {
if (commit->parents)
diff_tree_sha1(commit->parents->item->object.sha1,
@@ -13,27 +13,9 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit->object.sha1, "", options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1);
+   return diff_flush_patch_id(options, sha1, loose);
 }
 
-static const unsigned char *patch_id_access(size_t index, void *table)
-{
-   struct patch_id **id_table = table;
-   return id_table[index]->patch_id;
-}
-
-static int patch_pos(struct patch_id **table, int nr, const unsigned char *id)
-{
-   return sha1_pos(id, table, nr, patc

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-28 Thread Jeff King
On Sun, May 19, 2013 at 02:17:35PM +0100, John Keeping wrote:

> When using "git cherry" or "git log --cherry-pick" we often have a small
> number of commits on one side and a large number on the other.  In
> revision.c::cherry_pick_list we store the patch IDs for the small side
> before comparing the large side to this.
> 
> In this case, it is likely that only a small number of paths are touched
> by the commits on the smaller side of the range and by storing these we
> can ignore many commits on the other side before unpacking blobs and
> diffing them.

There are two observations that make your scheme work:

  1. For something like --cherry-pick, we do not even care about the
 actual patch-ids, only whether there are matches from the left and
 right sides.

  2. Comparing the sets of paths touched by two commits is often
 sufficient to realize they do not have the same patch-ids.

But I think those observations mean we can do even better than
calculating the real patch-id for the small side at all.

Imagine we have both "loose" and "strict" patch-ids, where the loose one
is much faster to compute, and has that property that a match _might_
mean we have the same strict patch-id, but a non-match means we
definitely do not have the same strict patch-id.

I think such a loose patch-id could just be a hash of the filenames that
were changed by the patch (e.g., the first 32-bits of the sha1 of the
concatenated filenames). Computing that should be about as expensive as
a tree-diff. Per observation 2 above, if two commits do not have the
same loose id, we know that they cannot possibly have the same strict
id.

Then we can forget about the smaller-side and bigger-side entirely, and
just do something like:

  1. Make a sorted list (or hash table) of loose ids for one side.

  2. For each commit on the other side, calculate its loose id and look
 that up in the sorted list. If no hits, we know that there is no
 match. For any hits, lazily calculate (and cache) the strict patch
 id for both sides and compare as usual.

In the best case, we compute no patch-ids at all. And even for the
average case, I'd expect our lazy calculation to only have to compute a
handful of ids.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-20 Thread John Keeping
On Sun, May 19, 2013 at 11:36:23PM -0700, Jonathan Nieder wrote:
> I don't know what it should mean to try to use --cherry without
> --no-merges or --first-parent, so I guess this is harmless.

Currently --no-merges doesn't actually get passed down this far.  We do
the patch ID calculations without taking that into account.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread Jonathan Nieder
John Keeping wrote:
> On Sun, May 19, 2013 at 12:36:12PM -0700, Jonathan Nieder wrote:

>>Who is responsible for freeing
>> "path"?  Would it make sense to use a strbuf here?
>>
>>  strbuf_setlen(&buf, traverse_path_len(info, n));
>>  make_traverse_path(&buf.buf, info, n);
>
> Perhaps alloc_traverse_path?  I'm not sure the strbuf buys us much for
> this case, since we only use the result for a few lines before freeing
> it.

Good idea.  Generally strbufs are nice for this kind of thing because
they make it easy to reuse buffers, but in this case there's no
opportunity for that.

[...]
> then this is "process_changed_path".

Sounds good.

[...]
>> What should happen for commits with more than 1 parent?  If they're
>> supposed to not enter this codepath because of a previous check, a
>> die("BUG: ...") could be a good way to make reading easier.
>
> Currently the patch ID code compares the commit with its first parent,
> so I think this should do the same.

Ok.  I guess a comment nearby would help future readers avoid the same
question.

I don't know what it should mean to try to use --cherry without
--no-merges or --first-parent, so I guess this is harmless.

[...]
> Thanks for the review.

No problem.  Thanks for a pleasant read.

Ciao,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread Junio C Hamano
Jonathan Nieder  writes:

>> @@ -64,6 +199,13 @@ static struct patch_id *add_commit(struct commit *commit,
>>  unsigned char sha1[20];
>>  int pos;
>>  
>> +if (no_add) {
>> +if (collect_touched_paths(commit, ids, 1) < 0)
>> +return NULL;
>
> Ah, so this is what the "searching" does.
>
> The existing "no_add" argument is very confusing (what does it mean to
> add a commit without adding?),

Such a mode of operation is usually called "dry-run", no?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread John Keeping
On Sun, May 19, 2013 at 12:36:12PM -0700, Jonathan Nieder wrote:
> John Keeping wrote:
> 
> > In this case, it is likely that only a small number of paths are touched
> > by the commits on the smaller side of the range and by storing these we
> > can ignore many commits on the other side before unpacking blobs and
> > diffing them.
> 
> I like this idea a lot.
> 
> > --- a/patch-ids.c
> > +++ b/patch-ids.c
> > @@ -1,5 +1,6 @@
> >  #include "cache.h"
> >  #include "diff.h"
> > +#include "diffcore.h"
> >  #include "commit.h"
> >  #include "sha1-lookup.h"
> >  #include "patch-ids.h"
> > @@ -16,6 +17,137 @@ static int commit_patch_id(struct commit *commit, 
> > struct diff_options *options,
> > return diff_flush_patch_id(options, sha1);
> >  }
> >  
> > +struct collect_paths_info {
> > +   struct string_list *paths;
> > +   int searching;
> > +};
> > +
> > +static int collect_paths_recursive(int n, struct tree_desc *t,
> > +   const char *base, struct pathspec *pathspec,
> > +   struct collect_paths_info *data);
> 
> Can you say a little about how this function works (e.g., in a
> comment)?  What are the preconditions and postconditions?  How does
> the state evolve (e.g, when is "searching" set)?

As you figure out below, the patch ID API has two modes, "compute and
store" and "have we seen before".  The "searching" flag passes this
information down as we recursively compare the before and after trees
for a commit.  So when we hit a file that is changed by the commit we
either add it to the list of touched paths or tests if it is in the list
and stops comparing the trees if it isn't.

> > +
> > +static int same_entry(struct name_entry *a, struct name_entry *b)
> > +{
> > +   if (!a->sha1 || !b->sha1)
> > +   return a->sha1 == b->sha1;
> > +   return  !hashcmp(a->sha1, b->sha1) &&
> > +   a->mode == b->mode;
> 
> Style: unusual whitespace (the tab after "return").  I'd just do
> 
>   if (!a->sha1 || ...)
>   return ...
>   return !hashcmp(a->sha1, b->sha1) && a->mode == b->mode;
> 
> since it is not too long.

Makes sense, I copied from the function of the same name in
builtin/merge-tree.c, although I then changed the semantics for
non-existent entries.

> [...]
> > +static char *traverse_path(const struct traverse_info *info,
> > +   const struct name_entry *n)
> > +{
> > +   char *path = xmalloc(traverse_path_len(info, n) + 1);
> > +   return make_traverse_path(path, info, n);
> > +}
> 
> This function seems to be the same as one of the same name in
> builtin/merge-tree.c.  Should it be put somewhere more public, for
> example as part of the tree-walk API?  Who is responsible for freeing
> "path"?  Would it make sense to use a strbuf here?
> 
>   strbuf_setlen(&buf, traverse_path_len(info, n));
>   make_traverse_path(&buf.buf, info, n);

Perhaps alloc_traverse_path?  I'm not sure the strbuf buys us much for
this case, since we only use the result for a few lines before freeing
it.

> > +
> > +static int add_touched_path(struct collect_paths_info *info, const char 
> > *path)
> > +{
> > +   if (info->searching) {
> > +   if (!string_list_has_string(info->paths, path))
> > +   return -1;
> > +   }
> > +   string_list_insert(info->paths, path);
> > +   return 0;
> > +}
> 
> Same question about internal API: when I see
> 
>   add_touched_path(info, path)
> 
> what should I expect it to do?

Yeah, this patch suffers a lot from writing separate "collect" and
"check" codepaths and refactoring out common functionality later.

Perhaps all of the collect_* functions should be changed_paths_* and
then this is "process_changed_path".

> In the info->searching case, "string_list_insert(info->paths, path)"
> will always be a no-op, right?  What does it mean that this is adding
> a touched path in that case?
> 
> [...]
> > +static int collect_touched_paths_cb(int n, unsigned long mask,
> > +   unsigned long dirmask, struct name_entry *entry,
> > +   struct traverse_info *info)
> > +{
> 
> Same question about contracts.  Ideally a reader in a rush should be
> able to read an opening comment about what the function does and
> then return to the caller without delving into the details of how
> it does its work.
> 
> > +   struct collect_paths_info *collect_info = info->data;
> > +   if (n == 1) {
> > +   /* We're handling a root commit - add all the paths. */
> [...]
> > +   if ((entry[0].sha1 && S_ISDIR(entry[0].mode)) ||
> > +   (entry[1].sha1 && S_ISDIR(entry[1].mode))) {
> 
> At this point I'm not sure what two trees are being traversed in
> parallel, so it's not obvious to me what the code's about.  Are
> they the "before" and "after" trees for commits being rebased?
> 
> [...]
> > +static int collect_touched_paths(struct commit *commit, struct patch_ids 
> > *ids,
> > +   int searching)
> > +{
> > +   struct tree_desc trees[2];
> > +   struct collect_paths_info info = { &ids->touched_paths, searching }

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread Jonathan Nieder
John Keeping wrote:

> In this case, it is likely that only a small number of paths are touched
> by the commits on the smaller side of the range and by storing these we
> can ignore many commits on the other side before unpacking blobs and
> diffing them.

I like this idea a lot.

> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -1,5 +1,6 @@
>  #include "cache.h"
>  #include "diff.h"
> +#include "diffcore.h"
>  #include "commit.h"
>  #include "sha1-lookup.h"
>  #include "patch-ids.h"
> @@ -16,6 +17,137 @@ static int commit_patch_id(struct commit *commit, struct 
> diff_options *options,
>   return diff_flush_patch_id(options, sha1);
>  }
>  
> +struct collect_paths_info {
> + struct string_list *paths;
> + int searching;
> +};
> +
> +static int collect_paths_recursive(int n, struct tree_desc *t,
> + const char *base, struct pathspec *pathspec,
> + struct collect_paths_info *data);

Can you say a little about how this function works (e.g., in a
comment)?  What are the preconditions and postconditions?  How does
the state evolve (e.g, when is "searching" set)?

> +
> +static int same_entry(struct name_entry *a, struct name_entry *b)
> +{
> + if (!a->sha1 || !b->sha1)
> + return a->sha1 == b->sha1;
> + return  !hashcmp(a->sha1, b->sha1) &&
> + a->mode == b->mode;

Style: unusual whitespace (the tab after "return").  I'd just do

if (!a->sha1 || ...)
return ...
return !hashcmp(a->sha1, b->sha1) && a->mode == b->mode;

since it is not too long.

[...]
> +static char *traverse_path(const struct traverse_info *info,
> + const struct name_entry *n)
> +{
> + char *path = xmalloc(traverse_path_len(info, n) + 1);
> + return make_traverse_path(path, info, n);
> +}

This function seems to be the same as one of the same name in
builtin/merge-tree.c.  Should it be put somewhere more public, for
example as part of the tree-walk API?  Who is responsible for freeing
"path"?  Would it make sense to use a strbuf here?

strbuf_setlen(&buf, traverse_path_len(info, n));
make_traverse_path(&buf.buf, info, n);

> +
> +static int add_touched_path(struct collect_paths_info *info, const char 
> *path)
> +{
> + if (info->searching) {
> + if (!string_list_has_string(info->paths, path))
> + return -1;
> + }
> + string_list_insert(info->paths, path);
> + return 0;
> +}

Same question about internal API: when I see

add_touched_path(info, path)

what should I expect it to do?

In the info->searching case, "string_list_insert(info->paths, path)"
will always be a no-op, right?  What does it mean that this is adding
a touched path in that case?

[...]
> +static int collect_touched_paths_cb(int n, unsigned long mask,
> + unsigned long dirmask, struct name_entry *entry,
> + struct traverse_info *info)
> +{

Same question about contracts.  Ideally a reader in a rush should be
able to read an opening comment about what the function does and
then return to the caller without delving into the details of how
it does its work.

> + struct collect_paths_info *collect_info = info->data;
> + if (n == 1) {
> + /* We're handling a root commit - add all the paths. */
[...]
> + if ((entry[0].sha1 && S_ISDIR(entry[0].mode)) ||
> + (entry[1].sha1 && S_ISDIR(entry[1].mode))) {

At this point I'm not sure what two trees are being traversed in
parallel, so it's not obvious to me what the code's about.  Are
they the "before" and "after" trees for commits being rebased?

[...]
> +static int collect_touched_paths(struct commit *commit, struct patch_ids 
> *ids,
> + int searching)
> +{
> + struct tree_desc trees[2];
> + struct collect_paths_info info = { &ids->touched_paths, searching };
> + void *commitbuf;
> + int result;
> +
> + commitbuf = fill_tree_descriptor(trees + 1, commit->object.sha1);
> + if (commit->parents) {
> + void *parentbuf = fill_tree_descriptor(trees + 0,
> + commit->parents->item->object.sha1);

Looks like yes.

What should happen for commits with more than 1 parent?  If they're
supposed to not enter this codepath because of a previous check, a
die("BUG: ...") could be a good way to make reading easier.

[...]
> @@ -40,6 +172,7 @@ int init_patch_ids(struct patch_ids *ids)
>   diff_setup(&ids->diffopts);
>   DIFF_OPT_SET(&ids->diffopts, RECURSIVE);
>   diff_setup_done(&ids->diffopts);
> + ids->touched_paths.strdup_strings = 1;

Good.

[...]
> @@ -64,6 +199,13 @@ static struct patch_id *add_commit(struct commit *commit,
>   unsigned char sha1[20];
>   int pos;
>  
> + if (no_add) {
> + if (collect_touched_paths(commit, ids, 1) < 0)
> + return NULL;

Ah, so this is what the "searching" does.

The existing "no_add" argument is very confusing (what does it mean to
add a commit witho