Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-29 Thread Jakub Narebski
Derrick Stolee  writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).

All right, this looks reasonable thing to do when we have access to
commit generation numbers..

> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.

Thus the condition that if B is reachable from A, then gen(A) >= gen(B),
even if they have generation numbers _INFINITY, _MAX or _ZERO.

We use generation numbers, if possible, to choose closest commit; if
not, we use dates.

>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.

Actually the ordering of commits walked does not affect the correctness
of the result.  Better ordering means that commits do not need to be
walked twice; I think it would be possible to craft repository in which
unlucky clock skew would lead to depth-first walk of commits later part
of walk would mark STALE.

Pedantry aside, I think it is a good description of analysis of change
results.

>
> Signed-off-by: Derrick Stolee 
> ---
>  commit.c | 20 +++-
>  commit.h |  1 +
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 711f674c18..4d00b0a1d6 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void 
> *a_, const void *b_,
>   return 0;
>  }
>  
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
> void *unused)
> +{
> + const struct commit *a = a_, *b = b_;
> +
> + /* newer commits first */

To be pedantic, larger generation number does not necessary mean that
commit was created later (is newer), only that it is on longer chain
since common ancestor or root commit.

> + if (a->generation < b->generation)
> + return 1;
> + else if (a->generation > b->generation)
> + return -1;

If the commit-graph feature is not available, or is disabled, all
commits would have the same generation number (_INFINITY), then this
block would be always practically no-op.

This is not very costly: 2 access to data which should be in cache, and
1 to 2 comparison operations.  But I wonder if we wouldn't want to avoid
this nano-cost if possible...

> +
> + /* use date as a heuristic when generations are equal */
> + if (a->date < b->date)
> + return 1;
> + else if (a->date > b->date)
> + return -1;
> + return 0;

The above is the same code as in compare_commits_by_commit_date(), but
there it is with "newer commits with larger date first" as comment
instead.

All right: we need inlining for speed.

> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void 
> *unused)
>  {
>   const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, 
> struct commit **twos)
>  {
> - struct prio_queue queue = { compare_commits_by_commit_date };
> + struct prio_queue queue = { compare_commits_by_gen_then_commit_date };

I wonder if it would be worth it to avoid comparing by generation
numbers without commit graph data:

  + struct prio_queue queue;
  [...]
  + if (commit_graph)
  + queue.compare = compare_commits_by_gen_then_commit_date;
  + else
  + queue.compare = compare_commits_by_commit_date;

Or something like that.  But perhaps this nano-optimization is not worth
it (it is not that complicated, though).


Sidenote: when I searched for compare_commits_by_commit_date use, I have
noticed that it is used, I think as heuristics, for packfile creation in
upload-pack.c and fetch-pack.c.  Would they similarly improve with
compare_commits_by_gen_then_commit_date?

This is of course not something that this commit, or this patch series,
needs to address now.

>   struct commit_list *result = NULL;
>   int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int 

Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-28 Thread Jakub Narebski
Jakub Narebski  writes:
> Junio C Hamano  writes:
>> Derrick Stolee  writes:
[...]
>>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void 
>>> *b_, void *unused)
>>> +{
>>> +   const struct commit *a = a_, *b = b_;
>>> +
>>> +   /* newer commits first */
>>> +   if (a->generation < b->generation)
>>> +   return 1;
>>> +   else if (a->generation > b->generation)
>>> +   return -1;
>>
>> ... this does not check if a->generation is _ZERO or _INF.  
>>
>> Both being _MAX is OK (the control will fall through and use the
>> dates below).  One being _MAX and the other being a normal value is
>> also OK (the above comparisons will declare the commit with _MAX is
>> farther than less-than-max one from a root).
>>
>> Or is the assumption that if one has _ZERO, that must have come from
>> an ancient commit-graph file and none of the commits have anything
>> but _ZERO?
>
> There is stronger and weaker version of the negative-cut criteria based
> on generation numbers.
>
> The strong criteria:
>
>   if A != B and gen(A) <= gen(B), then A cannot reach B
>
> The weaker criteria:
>
>   if gen(A) < gen(B), then A cannot reach B
>
>
> Because commit-graph is closed under reachability, this means that
>
>   if A is in commit graph, and B is outside of it, then A cannot reach B
>
> If A is in commit graph, then either _MAX >= gen(A) >= 1,
> or gen(A) == _ZERO.  Because _INFINITY > _MAX > _ZERO, then we have
>
>   if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY
>   then A cannot reach B
>
> which also fullfils the weaker criteria
>
>   if gen(A) < gen(B), then A cannot reach B
>
>
> If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY,
> or if both A and B have gen(A) = gen(B) = _MAX,
> or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO,
> then we cannot say anything about reachability... and weak criteria
> also does not say anything about reachability.
>
>
> Maybe the following ASCII table would make it clear.
>
>  |  gen(B)
>  | :::
> gen(A)   | _INFINITY | _MAX | larger   | smaller  | _ZERO
> -+---+--+--+--+
> _INFINITY| = | >| >| >| >
> _MAX | < Nn  | =| >| >| >
> larger   | < Nn  | < Nn | = n  | >| >
> smaller  | < Nn  | < Nn | < Nn | = n  | >
> _ZERO| < Nn  | < Nn | < Nn | < Nn | =
>
> Here "n" denotes stronger condition, and "N" denotes weaker condition.
> We have _INFINITY > _MAX > larger > smaller > _ZERO.
>
>
> NOTE however that it is a *tradeoff*.  Using weaker criteria, with
> strict inequality, means that we don't need to handle _INFINITY, _MAX
> and _ZERO corner-cases in a special way; but it also means that we would
> walk slightly more commits than if we used stronger criteria, with less
> or equals.

Actually, if we look at the table above, it turns out that we can use
the stronger version of negative-cut criteria without special-casing all
the possible combinations.  Just use stronger criteria on normal range,
weaker criteria if any of generation numbers is special generation
number.

  if _MAX > gen(A) > _ZERO and
 _MAX > gen(B) > _ZERO then

if A != B and gen(A) <= gen(B) then
  A cannot reach B
else
  A can reach B

  else /* at least one special case */

if gen(A) < gen(B) then
  A cannot reach B
else
  A can reach B


NOTE that it specifically does not matter for created here
compare_commits_by_gen_then_commit_date(), as it requires strict
inequality for sorting - which using weak criteria explains why we don't
need any special cases in the code here.

Best,
-- 
Jakub Narębski


Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-26 Thread Jakub Narebski
Junio C Hamano  writes:
> Derrick Stolee  writes:
>
>> Define compare_commits_by_gen_then_commit_date(), which uses generation
>> numbers as a primary comparison and commit date to break ties (or as a
>> comparison when both commits do not have computed generation numbers).
>>
>> Since the commit-graph file is closed under reachability, we know that
>> all commits in the file have generation at most GENERATION_NUMBER_MAX
>> which is less than GENERATION_NUMBER_INFINITY.
>
> I suspect that my puzzlement may be coming from my not "getting"
> what you meant by "closed under reachability",

It means that if commit A is in the commit graph, then all of its
ancestors (all commits reachable from A) are also in the commit graph.

>but could you also
> explain how _INF and _ZERO interact with commits with normal
> generation numbers?  I've always assumed that genno will be used
> only when comparing two commits with valid genno and otherwise we'd
> fall back to the traditional date based one, but...
>
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
>> void *unused)
>> +{
>> +const struct commit *a = a_, *b = b_;
>> +
>> +/* newer commits first */
>> +if (a->generation < b->generation)
>> +return 1;
>> +else if (a->generation > b->generation)
>> +return -1;
>
> ... this does not check if a->generation is _ZERO or _INF.  
>
> Both being _MAX is OK (the control will fall through and use the
> dates below).  One being _MAX and the other being a normal value is
> also OK (the above comparisons will declare the commit with _MAX is
> farther than less-than-max one from a root).
>
> Or is the assumption that if one has _ZERO, that must have come from
> an ancient commit-graph file and none of the commits have anything
> but _ZERO?

There is stronger and weaker version of the negative-cut criteria based
on generation numbers.

The strong criteria:

  if A != B and gen(A) <= gen(B), then A cannot reach B

The weaker criteria:

  if gen(A) < gen(B), then A cannot reach B


Because commit-graph is closed under reachability, this means that

  if A is in commit graph, and B is outside of it, then A cannot reach B

If A is in commit graph, then either _MAX >= gen(A) >= 1,
or gen(A) == _ZERO.  Because _INFINITY > _MAX > _ZERO, then we have

  if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY
  then A cannot reach B

which also fullfils the weaker criteria

  if gen(A) < gen(B), then A cannot reach B


If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY,
or if both A and B have gen(A) = gen(B) = _MAX,
or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO,
then we cannot say anything about reachability... and weak criteria
also does not say anything about reachability.


Maybe the following ASCII table would make it clear.

 |  gen(B)
 | :::
gen(A)   | _INFINITY | _MAX | larger   | smaller  | _ZERO
-+---+--+--+--+
_INFINITY| = | >| >| >| >
_MAX | < Nn  | =| >| >| >
larger   | < Nn  | < Nn | = n  | >| >
smaller  | < Nn  | < Nn | < Nn | = n  | >
_ZERO| < Nn  | < Nn | < Nn | < Nn | =

Here "n" denotes stronger condition, and "N" denotes weaker condition.
We have _INFINITY > _MAX > larger > smaller > _ZERO.


NOTE however that it is a *tradeoff*.  Using weaker criteria, with
strict inequality, means that we don't need to handle _INFINITY, _MAX
and _ZERO corner-cases in a special way; but it also means that we would
walk slightly more commits than if we used stronger criteria, with less
or equals.

For Linux kernel public repository commit graph[1] we have maximum of 512
commits sharing the same level, 5.43 sharing the same commit on average,
and 50% of time only 2 commits sharing the same level (median, or 2nd
quartile, or 50% percentile).  This is roughly the amount of commits we
walk more with weaker cut-off condition.

[1]: with 750k commits, but which is not largest commit graph any more :-0

>> +/* use date as a heuristic when generations are equal */
>> +if (a->date < b->date)
>> +return 1;
>> +else if (a->date > b->date)
>> +return -1;
>> +return 0;
>> +}

HTH
-- 
Jakub Narębski


Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-25 Thread Junio C Hamano
Derrick Stolee  writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.

I suspect that my puzzlement may be coming from my not "getting"
what you meant by "closed under reachability", but could you also
explain how _INF and _ZERO interact with commits with normal
generation numbers?  I've always assumed that genno will be used
only when comparing two commits with valid genno and otherwise we'd
fall back to the traditional date based one, but...

> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
> void *unused)
> +{
> + const struct commit *a = a_, *b = b_;
> +
> + /* newer commits first */
> + if (a->generation < b->generation)
> + return 1;
> + else if (a->generation > b->generation)
> + return -1;

... this does not check if a->generation is _ZERO or _INF.  

Both being _MAX is OK (the control will fall through and use the
dates below).  One being _MAX and the other being a normal value is
also OK (the above comparisons will declare the commit with _MAX is
farther than less-than-max one from a root).

Or is the assumption that if one has _ZERO, that must have come from
an ancient commit-graph file and none of the commits have anything
but _ZERO?

> + /* use date as a heuristic when generations are equal */
> + if (a->date < b->date)
> + return 1;
> + else if (a->date > b->date)
> + return -1;
> + return 0;
> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void 
> *unused)
>  {
>   const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, 
> struct commit **twos)
>  {
> - struct prio_queue queue = { compare_commits_by_commit_date };
> + struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>   struct commit_list *result = NULL;
>   int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int check_commit_signature(const struct commit *commit, struct 
> signature_check *sigc);
>  
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void 
> *unused);
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
> void *unused);
>  
>  LAST_ARG_MUST_BE_NULL
>  extern int run_commit_hook(int editor_is_used, const char *index_file, const 
> char *name, ...);


[PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-25 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee 
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generations are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb