Re: [PATCH 1/2] commit-graph write: add progress output

2018-09-07 Thread Derrick Stolee

On 9/7/2018 8:40 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Sep 04 2018, Ævar Arnfjörð Bjarmason wrote:


Before this change the "commit-graph write" command didn't report any
progress. On my machine this command takes more than 10 seconds to
write the graph for linux.git, and around 1m30s on the
2015-04-03-1M-git.git[1] test repository, which is a test case for
larger monorepos.

There's a fun issue with this code that I'll fix, but thought was
informative to send a mail about.

Because the graph verification happens in the main "git gc" process, as
opposed to everything else via external commands, so all this progress
output gets written to .git/gc.log.

Then next time we do a "gc --auto" we vomit out a couple of KB of
progress bar output at the user, since spot that the gc.log isn't empty.
Good catch! (I do want to clarify that the graph _writing_ happens 
during 'git gc' since 'git commit-graph verify' is a different thing.)


Re: [PATCH 09/11] multi-pack-index: verify object offsets

2018-09-07 Thread Derrick Stolee

On 9/6/2018 8:34 PM, Eric Sunshine wrote:

On Wed, Sep 5, 2018 at 10:46 AM Derrick Stolee via GitGitGadget
 wrote:

Replace the BUG() statement with a die() statement, now that we
may hit a bad pack-int-id during a 'verify' command on a corrupt
multi-pack-index, and it is covered by a test.

Signed-off-by: Derrick Stolee 
---
diff --git a/midx.c b/midx.c
@@ -197,7 +197,7 @@ int prepare_midx_pack(struct multi_pack_index *m, uint32_t 
pack_int_id)
 if (pack_int_id >= m->num_packs)
-   BUG("bad pack-int-id");
+   die(_("bad pack-int-id"));

For someone trying to diagnose this issue, would it be helpful to know
(that is, print out) the values of pack_int_id and num_packs?

Can do! Thanks.


[RFC PATCH 6/6] commit-reach: fix first-parent heuristic

2018-09-06 Thread Derrick Stolee
The algorithm in can_all_from_reach_with_flags() performs a depth-
first-search, terminated by generation number, intending to use
a hueristic that "important" commits are found in the first-parent
history. This heuristic is valuable in scenarios like fetch
negotiation.

However, there is a problem! After the search finds a target commit,
it should pop all commits off the stack and mark them as "can reach".
This logic is incorrect, so the algorithm instead walks all reachable
commits above the generation-number cutoff.

The existing algorithm is still an improvement over the previous
algorithm, as the worst-case complexity went from quadratic to linear.
The performance measurement at the time was good, but not dramatic.

By fixing this heuristic, we can see in t6600-test-reach.sh that we
reduce the number of walked commits. This test will prevent a future
performance regression.

We can also re-run the performance tests from commit 4fbcca4e
"commit-reach: make can_all_from_reach... linear".

Performance was measured on the Linux repository using
'test-tool reach can_all_from_reach'. The input included rows seeded by
tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as
v3.[0-9]*. This mimics a (very large) fetch that says "I have all major
v3 releases and want all major v4 releases." The "large" case included
X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate
tags to the set, which does not greatly increase the number of objects
that are considered, but does increase the number of 'from' commits,
demonstrating the quadratic nature of the previous code.

Small Case:

4fbcca4e~1: 0.85 s
  4fbcca4e: 0.26 s (num_walked: 1,011,035)
  HEAD: 0.14 s (num_walked: 8,601)

Large Case:

4fbcca4e~1: 24.0 s
  4fbcca4e:  0.12 s (num_walked:  503,925)
  HEAD:  0.06 s (num_walked:  217,243)

Signed-off-by: Derrick Stolee 
---
 commit-reach.c| 4 +++-
 t/t6600-test-reach.sh | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 0a75644653..bd009260b0 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -588,8 +588,10 @@ int can_all_from_reach_with_flag(struct object_array *from,
while (stack) {
struct commit_list *parent;
 
-   if (stack->item->object.flags & with_flag) {
+   if (stack->item->object.flags & (with_flag | RESULT)) {
pop_commit();
+   if (stack)
+   stack->item->object.flags |= RESULT;
continue;
}
 
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 98ad25bb45..5e231a5955 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -185,7 +185,7 @@ test_expect_success 'can_all_from_reach:hit' '
 
 test_expect_success 'can_all_from_reach:perf' '
cp commit-graph-full .git/objects/info/commit-graph &&
-   run_and_check_trace2 can_all_from_reach_with_flag num_walked 40 input \
+   run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input \
"test-tool reach can_all_from_reach"
 '
 
-- 
2.19.0.rc2



[RFC PATCH 3/6] commit-reach: use trace2 in can_all_from_reach

2018-09-06 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 0fc3b1ac18..0a75644653 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -563,6 +563,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
struct commit **list = NULL;
int i;
int result = 1;
+   uint32_t num_walked = 0;
 
ALLOC_ARRAY(list, from->nr);
for (i = 0; i < from->nr; i++) {
@@ -573,6 +574,8 @@ int can_all_from_reach_with_flag(struct object_array *from,
return 0;
}
 
+   trace2_region_enter("can_all_from_reach_with_flag");
+
QSORT(list, from->nr, compare_commits_by_gen);
 
for (i = 0; i < from->nr; i++) {
@@ -590,6 +593,8 @@ int can_all_from_reach_with_flag(struct object_array *from,
continue;
}
 
+   num_walked++;
+
for (parent = stack->item->parents; parent; parent = 
parent->next) {
if (parent->item->object.flags & (with_flag | 
RESULT))
stack->item->object.flags |= RESULT;
@@ -622,6 +627,9 @@ int can_all_from_reach_with_flag(struct object_array *from,
clear_commit_marks(list[i], RESULT);
clear_commit_marks(list[i], assign_flag);
}
+
+   trace2_data_intmax("can_all_from_reach_with_flag", "num_walked", 
num_walked);
+   trace2_region_leave("can_all_from_reach_with_flag");
return result;
 }
 
-- 
2.19.0.rc2



[RFC PATCH 2/6] comit-reach: use trace2 for commit_contains_tag_algo

2018-09-06 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index ee374dce20..0fc3b1ac18 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -481,6 +481,7 @@ static enum contains_result contains_tag_algo(struct commit 
*candidate,
enum contains_result result;
uint32_t cutoff = GENERATION_NUMBER_INFINITY;
const struct commit_list *p;
+   uint32_t num_walked = 0;
 
for (p = want; p; p = p->next) {
struct commit *c = p->item;
@@ -493,12 +494,15 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
if (result != CONTAINS_UNKNOWN)
return result;
 
+   trace2_region_enter("contains_tag_algo");
push_to_contains_stack(candidate, _stack);
while (contains_stack.nr) {
struct contains_stack_entry *entry = 
_stack.contains_stack[contains_stack.nr - 1];
struct commit *commit = entry->commit;
struct commit_list *parents = entry->parents;
 
+   num_walked++;
+
if (!parents) {
*contains_cache_at(cache, commit) = CONTAINS_NO;
contains_stack.nr--;
@@ -521,7 +525,13 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache, cutoff);
+
+   result = contains_test(candidate, want, cache, cutoff);
+
+   trace2_data_intmax("contains_tag_algo", "num_walked", num_walked);
+   trace2_region_leave("contains_tag_algo");
+
+   return result;
 }
 
 int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.19.0.rc2



[RFC PATCH 4/6] test-tool: start trace2 environment

2018-09-06 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 t/helper/test-tool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 7566b0786a..f70d5d74f8 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -1,5 +1,6 @@
 #include "git-compat-util.h"
 #include "test-tool.h"
+#include "trace2.h"
 
 struct test_cmd {
const char *name;
@@ -55,6 +56,8 @@ int cmd_main(int argc, const char **argv)
if (argc < 2)
die("I need a test name!");
 
+   trace2_start(argv);
+
for (i = 0; i < ARRAY_SIZE(cmds); i++) {
if (!strcmp(cmds[i].name, argv[1])) {
argv++;
-- 
2.19.0.rc2



[RFC PATCH 5/6] test-lib: add run_and_check_trace2

2018-09-06 Thread Derrick Stolee
The trace2 facility allows tracing category-key-value triples that
we can use to communicate runtime information to a side channel.
One use is to track the number of commits that are walked by a
graph algorithm.

Add run_and_check_trace2 test function to run a given command with
GIT_TR2_PERFORMANCE running. Then, check the output for the
expected category-key-value triple.

Use this function in t6600-test-reach.sh to verify can_all_from_reach
only visits 11 commits in the example.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh |  6 ++
 t/test-lib.sh | 14 ++
 2 files changed, 20 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..98ad25bb45 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -183,6 +183,12 @@ test_expect_success 'can_all_from_reach:hit' '
test_three_modes can_all_from_reach
 '
 
+test_expect_success 'can_all_from_reach:perf' '
+   cp commit-graph-full .git/objects/info/commit-graph &&
+   run_and_check_trace2 can_all_from_reach_with_flag num_walked 40 input \
+   "test-tool reach can_all_from_reach"
+'
+
 test_expect_success 'can_all_from_reach:miss' '
cat >input <<-\EOF &&
X:commit-2-10
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 8bb0f4348e..9b9f68f324 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1231,3 +1231,17 @@ test_lazy_prereq CURL '
 test_lazy_prereq SHA1 '
test $(git hash-object /dev/null) = 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
 '
+
+# Useage: run_and_check_trace2 
+# Run "command 

[RFC PATCH 1/6] commit-reach: add trace2 telemetry and walk count

2018-09-06 Thread Derrick Stolee
paint_down_to_common() is used by many Git commands, and sometimes
multiple times in a single call. It is important to measure
performance of this method, but the actual time it takes can vary
due to interactions outside Git's control (file system, CPU
contention, etc.). Instead, count how many times we execute the
while loop, which is consistent between runs.

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 86715c103c..ee374dce20 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -38,6 +38,9 @@ static struct commit_list *paint_down_to_common(struct commit 
*one, int n,
struct commit_list *result = NULL;
int i;
uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+   uint32_t num_walked = 0;
+
+   trace2_region_enter("paint_down_to_common");
 
one->object.flags |= PARENT1;
if (!n) {
@@ -55,6 +58,7 @@ static struct commit_list *paint_down_to_common(struct commit 
*one, int n,
struct commit *commit = prio_queue_get();
struct commit_list *parents;
int flags;
+   num_walked++;
 
if (commit->generation > last_gen)
BUG("bad generation skip %8x > %8x at %s",
@@ -88,6 +92,10 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n,
}
 
clear_prio_queue();
+
+   trace2_data_intmax("paint_down_to_common", "num_walked", num_walked);
+   trace2_region_leave("paint_down_to_common");
+
return result;
 }
 
-- 
2.19.0.rc2



[RFC PATCH 0/6] Use trace2 in commit-reach

2018-09-06 Thread Derrick Stolee
As promised, here is the direction I took when applying the trace2
feature to the walking code in commit-reach.c. Hence, this depends
on Jeff's trace2 patches and ds/reachable.

There are multiple benefits to the approach I take here:

1. If a user has performance problems, we can rerun the command
   with tracing on and get the region enter/leave notifications
   and the extra data like "num_walked".

2. While I was testing this series against real-world examples,
   I found a number I didn't expect. The numbers for
   can_all_from_reach_with_flags were much higher than I expected.
   Turns out the heuristic I wrote was not working correctly. With
   the trace2 library, I was able to add a "run_and_check_trace2"
   function in test-lib.sh so I could make the number of walked
   commits be a condition we check in the test. Then, the benefit
   we expect is demonstrated by the test suite when I fix the
   bug.

Thanks,
-Stolee

P.S. I'm sending this RFC from gmail because I'm having SMTP issues
with my work email.

Derrick Stolee (6):
  commit-reach: add trace2 telemetry and walk count
  comit-reach: use trace2 for commit_contains_tag_algo
  commit-reach: use trace2 in can_all_from_reach
  test-tool: start trace2 environment
  test-lib: add run_and_check_trace2
  commit-reach: fix first-parent heuristic

 commit-reach.c| 32 ++--
 t/helper/test-tool.c  |  3 +++
 t/t6600-test-reach.sh |  6 ++
 t/test-lib.sh | 14 ++
 4 files changed, 53 insertions(+), 2 deletions(-)

-- 
2.19.0-rc2



Re: [PATCH 1/2] commit-graph write: add progress output

2018-09-05 Thread Derrick Stolee

On 9/5/2018 5:46 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


for (i = 0; i < commits->nr; i++) {
+   display_progress(progress, i);
if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY 
&&
commits->list[i]->generation != GENERATION_NUMBER_ZERO)
continue;

I am wondering if the progress call should be moved after this
conditional continue; would we want to count the entry whose
generation is already known here?  Of course, as we give commits->nr
as the 100% ceiling, we cannot avoid doing so, but it somehow smells
wrong.

If we wanted to be completely right, we would count the commits in the
list that do not have a generation number and report that as the 100%
ceiling.

Yeah, but I realize that the definition of "right" really depends on
what we consider a task being accomplished is in this loop.  If we
define the task to "we have some number of commits that lack
generation numbers and our task is to assign numbers to them.", then
yes counting the ones without generation number and culling the ones
that already have generation number is outside the work and we need
another loop to count them.  But the position the posted patch takes
is also a valid one: we have some commits and we are making sure
each and every one of them has assigned a generation number.

So I do not think it is necessary to introduce another loop just for
counting.

Thanks.

Makes sense to me. Thanks!


Re: [PATCH 01/11] multi-pack-index: add 'verify' verb

2018-09-05 Thread Derrick Stolee

On 9/5/2018 2:59 PM, Eric Sunshine wrote:

On Wed, Sep 5, 2018 at 10:46 AM Derrick Stolee via GitGitGadget
 wrote:

The multi-pack-index builtin writes multi-pack-index files, and
uses a 'write' verb to do so. Add a 'verify' verb that checks this
file matches the contents of the pack-indexes it replaces.
[...]
Signed-off-by: Derrick Stolee 
---
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
  static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] write"),
+   N_("git multi-pack-index [--object-dir=] [write|verify]"),

Nit: The square brackets make the verb optional. You probably want
parenthesis to indicate that one or the other is required:

 git multi-pack-index [--object-dir=] (write|verify)
Thanks for catching this mistake! This would be easy to miss and keep 
around forever.


Re: [PATCH 04/11] multi-pack-index: verify packname order

2018-09-05 Thread Derrick Stolee

On 9/5/2018 3:14 PM, Stefan Beller wrote:

On Wed, Sep 5, 2018 at 12:11 PM Derrick Stolee  wrote:

On 9/5/2018 2:15 PM, Stefan Beller wrote:

On Wed, Sep 5, 2018 at 7:46 AM Derrick Stolee via GitGitGadget
 wrote:

From: Derrick Stolee 

The final check we make while loading a multi-pack-index is that
the packfile names are in lexicographical order. Make this error
be a die() instead.

What is the advantage of having a die() here?
Would the midx work under normal operation when
not sorted correctly?

This sounds like a hack for easy testability in this context,
so could you clarify why this helps the regular user?

The multi-pack-index will not work as expected, since
midx_contains_pack() may provide incorrect results, and thus we will add
the "missing" packfiles to the packed_git linked list.

This _should_ be a die(), as something unexpected occurred (the file
doesn't match the format specification).

Thanks for the clarification.

So this patch actually hits 2 birds with one stone, as it fixes
a bug as well as adds the check for such corruption to the
verify step?
This is a common occurrence in this series (replacing error() with 
die()) as we are now exercising those conditions and clarifying what 
should happen.


Re: [PATCH 04/11] multi-pack-index: verify packname order

2018-09-05 Thread Derrick Stolee

On 9/5/2018 2:15 PM, Stefan Beller wrote:

On Wed, Sep 5, 2018 at 7:46 AM Derrick Stolee via GitGitGadget
 wrote:

From: Derrick Stolee 

The final check we make while loading a multi-pack-index is that
the packfile names are in lexicographical order. Make this error
be a die() instead.

What is the advantage of having a die() here?
Would the midx work under normal operation when
not sorted correctly?

This sounds like a hack for easy testability in this context,
so could you clarify why this helps the regular user?


The multi-pack-index will not work as expected, since 
midx_contains_pack() may provide incorrect results, and thus we will add 
the "missing" packfiles to the packed_git linked list.


This _should_ be a die(), as something unexpected occurred (the file 
doesn't match the format specification).




[PATCH 11/11] fsck: verify multi-pack-index

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When core.multiPackIndex is true, we may have a multi-pack-index
in our object directory. Add calls to 'git multi-pack-index verify'
at the end of 'git fsck' if so.

Signed-off-by: Derrick Stolee 
---
 builtin/fsck.c  | 18 ++
 t/t5319-multi-pack-index.sh | 13 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 63c8578cc1..06eb421720 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -848,5 +848,23 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
}
 
+   if (!git_config_get_bool("core.multipackindex", ) && i) {
+   struct child_process midx_verify = CHILD_PROCESS_INIT;
+   const char *midx_argv[] = { "multi-pack-index", "verify", NULL, 
NULL, NULL };
+
+   midx_verify.argv = midx_argv;
+   midx_verify.git_cmd = 1;
+   if (run_command(_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+
+   prepare_alt_odb(the_repository);
+   for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   midx_argv[2] = "--object-dir";
+   midx_argv[3] = alt->path;
+   if (run_command(_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+   }
+
return errors_found;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index b6c34631d3..63f8718fd7 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -160,12 +160,17 @@ corrupt_midx_and_verify() {
DATA="${2:-\0}"
OBJDIR=$3
GREPSTR="$4"
+   COMMAND="$5"
+   if test -z "$COMMAND"
+   then
+   COMMAND="git multi-pack-index verify --object-dir=$OBJDIR"
+   fi
FILE=$OBJDIR/pack/multi-pack-index &&
chmod a+w $FILE &&
test_when_finished mv midx-backup $FILE &&
cp $FILE midx-backup &&
printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc &&
-   test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 
2>test_err &&
+   test_must_fail $COMMAND 2>test_err &&
grep -v "^+" test_err >err &&
test_i18ngrep "$GREPSTR" err
 }
@@ -258,6 +263,12 @@ test_expect_success 'verify incorrect offset' '
"incorrect object offset"
 '
 
+test_expect_success 'git-fsck incorrect offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \
+   "incorrect object offset" \
+   "git -c core.multipackindex=true fsck"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget


[PATCH 10/11] multi-pack-index: report progress during 'verify'

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When verifying a multi-pack-index, the only action that takes
significant time is checking the object offsets. For example,
to verify a multi-pack-index containing 6.2 million objects in
the Linux kernel repository takes 1.3 seconds on my machine.
99% of that time is spent looking up object offsets in each of
the packfiles and comparing them to the multi-pack-index offset.

Add a progress indicator for that section of the 'verify' verb.

Signed-off-by: Derrick Stolee 
---
 midx.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/midx.c b/midx.c
index 87708dc367..489180a599 100644
--- a/midx.c
+++ b/midx.c
@@ -7,6 +7,7 @@
 #include "object-store.h"
 #include "sha1-lookup.h"
 #include "midx.h"
+#include "progress.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -939,6 +940,7 @@ static void midx_report(const char *fmt, ...)
 int verify_midx_file(const char *object_dir)
 {
uint32_t i;
+   struct progress *progress = NULL;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
 
@@ -970,6 +972,7 @@ int verify_midx_file(const char *object_dir)
i, oid_to_hex(), oid_to_hex(), i 
+ 1);
}
 
+   progress = start_progress(_("Verifying object offsets"), 
m->num_objects);
for (i = 0; i < m->num_objects; i++) {
struct object_id oid;
struct pack_entry e;
@@ -994,7 +997,10 @@ int verify_midx_file(const char *object_dir)
if (m_offset != p_offset)
midx_report(_("incorrect object offset for oid[%d] = 
%s: %"PRIx64" != %"PRIx64),
i, oid_to_hex(), m_offset, p_offset);
+
+   display_progress(progress, i + 1);
}
+   stop_progress();
 
return verify_midx_error;
 }
-- 
gitgitgadget



[PATCH 03/11] multi-pack-index: verify corrupt chunk lookup table

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  |  3 +++
 t/t5319-multi-pack-index.sh | 13 +
 2 files changed, 16 insertions(+)

diff --git a/midx.c b/midx.c
index ec78254bb6..8b054b39ab 100644
--- a/midx.c
+++ b/midx.c
@@ -100,6 +100,9 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 
+
 MIDX_CHUNKLOOKUP_WIDTH * i);
 
+   if (chunk_offset >= m->data_len)
+   die(_("invalid chunk offset (too large)"));
+
switch (chunk_id) {
case MIDX_CHUNKID_PACKNAMES:
m->chunk_pack_names = m->data + chunk_offset;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 836d2bdb53..5c9c499aa6 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -178,6 +178,9 @@ test_expect_success 'verify bad signature' '
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
+MIDX_HEADER_SIZE=12
+MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE
+MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -199,6 +202,16 @@ test_expect_success 'verify extended chunk count' '
"terminating multi-pack-index chunk id appears earlier than 
expected"
 '
 
+test_expect_success 'verify missing required chunk' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \
+   "missing required"
+'
+
+test_expect_success 'verify invalid chunk offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
+   "invalid chunk offset (too large)"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH 07/11] multi-pack-index: verify oid lookup order

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 11 +++
 t/t5319-multi-pack-index.sh |  8 
 2 files changed, 19 insertions(+)

diff --git a/midx.c b/midx.c
index dfd26b4d74..06d5cfc826 100644
--- a/midx.c
+++ b/midx.c
@@ -959,5 +959,16 @@ int verify_midx_file(const char *object_dir)
i, oid_fanout1, oid_fanout2, i + 1);
}
 
+   for (i = 0; i < m->num_objects - 1; i++) {
+   struct object_id oid1, oid2;
+
+   nth_midxed_object_oid(, m, i);
+   nth_midxed_object_oid(, m, i + 1);
+
+   if (oidcmp(, ) >= 0)
+   midx_report(_("oid lookup out of order: oid[%d] = %s >= 
%s = oid[%d]"),
+   i, oid_to_hex(), oid_to_hex(), i 
+ 1);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 95ac7a6edd..072b1d1a74 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -175,6 +175,7 @@ test_expect_success 'verify bad signature' '
"multi-pack-index signature"
 '
 
+HASH_LEN=20
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
@@ -189,6 +190,8 @@ MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
 MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652))
 MIDX_OID_FANOUT_WIDTH=4
 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
+MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
+MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -235,6 +238,11 @@ test_expect_success 'verify oid fanout out of order' '
"oid fanout out of order"
 '
 
+test_expect_success 'verify oid lookup out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_LOOKUP "\00" $objdir \
+   "oid lookup out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH 09/11] multi-pack-index: verify object offsets

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'git multi-pack-index verify' command must verify the object
offsets stored in the multi-pack-index are correct. There are two
ways the offset chunk can be incorrect: the pack-int-id and the
object offset.

Replace the BUG() statement with a die() statement, now that we
may hit a bad pack-int-id during a 'verify' command on a corrupt
multi-pack-index, and it is covered by a test.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 28 +++-
 t/t5319-multi-pack-index.sh | 27 +++
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 80094c02a7..87708dc367 100644
--- a/midx.c
+++ b/midx.c
@@ -197,7 +197,7 @@ int prepare_midx_pack(struct multi_pack_index *m, uint32_t 
pack_int_id)
struct strbuf pack_name = STRBUF_INIT;
 
if (pack_int_id >= m->num_packs)
-   BUG("bad pack-int-id");
+   die(_("bad pack-int-id"));
 
if (m->packs[pack_int_id])
return 0;
@@ -970,5 +970,31 @@ int verify_midx_file(const char *object_dir)
i, oid_to_hex(), oid_to_hex(), i 
+ 1);
}
 
+   for (i = 0; i < m->num_objects; i++) {
+   struct object_id oid;
+   struct pack_entry e;
+   off_t m_offset, p_offset;
+
+   nth_midxed_object_oid(, m, i);
+   if (!fill_midx_entry(, , m)) {
+   midx_report(_("failed to load pack entry for oid[%d] = 
%s"),
+   i, oid_to_hex());
+   continue;
+   }
+
+   if (open_pack_index(e.p)) {
+   midx_report(_("failed to load pack-index for packfile 
%s"),
+   e.p->pack_name);
+   break;
+   }
+
+   m_offset = e.offset;
+   p_offset = find_pack_entry_one(oid.hash, e.p);
+
+   if (m_offset != p_offset)
+   midx_report(_("incorrect object offset for oid[%d] = 
%s: %"PRIx64" != %"PRIx64),
+   i, oid_to_hex(), m_offset, p_offset);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 072b1d1a74..b6c34631d3 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -176,6 +176,7 @@ test_expect_success 'verify bad signature' '
 '
 
 HASH_LEN=20
+NUM_OBJECTS=74
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
@@ -192,6 +193,10 @@ MIDX_OID_FANOUT_WIDTH=4
 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
 MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
 MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN))
+MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * 
$HASH_LEN))
+MIDX_OFFSET_WIDTH=8
+MIDX_BYTE_PACK_INT_ID=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH 
+ 2))
+MIDX_BYTE_OFFSET=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 6))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -243,6 +248,16 @@ test_expect_success 'verify oid lookup out of order' '
"oid lookup out of order"
 '
 
+test_expect_success 'verify incorrect pack-int-id' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACK_INT_ID "\07" $objdir \
+   "bad pack-int-id"
+'
+
+test_expect_success 'verify incorrect offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \
+   "incorrect object offset"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -310,4 +325,16 @@ test_expect_success 'verify multi-pack-index with 64-bit 
offsets' '
git multi-pack-index verify --object-dir=objects64
 '
 
+NUM_OBJECTS=63
+MIDX_OFFSET_OID_FANOUT=$((MIDX_OFFSET_PACKNAMES + 54))
+MIDX_OFFSET_OID_LOOKUP=$((MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
+MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * 
$HASH_LEN))
+MIDX_OFFSET_LARGE_OFFSETS=$(($MIDX_OFFSET_OBJECT_OFFSETS + $NUM_OBJECTS * 
$MIDX_OFFSET_WIDTH))
+MIDX_BYTE_LARGE_OFFSET=$(($MIDX_OFFSET_LARGE_OFFSETS + 3))
+
+test_expect_success 'verify incorrect 64-bit offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_LARGE_OFFSET "\07" objects64 \
+   "incorrect object offset"
+'
+
 test_done
-- 
gitgitgadget



[PATCH 08/11] multi-pack-index: fix 32-bit vs 64-bit size check

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When loading a 64-bit offset, we intend to check that off_t can store
the resulting offset. However, the condition accidentally checks the
32-bit offset to see if it is smaller than a 64-bit value. Fix it,
and this will be covered by a test in the 'git multi-pack-index verify'
command in a later commit.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 06d5cfc826..80094c02a7 100644
--- a/midx.c
+++ b/midx.c
@@ -236,7 +236,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, 
uint32_t pos)
offset32 = get_be32(offset_data + sizeof(uint32_t));
 
if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
-   if (sizeof(offset32) < sizeof(uint64_t))
+   if (sizeof(off_t) < sizeof(uint64_t))
die(_("multi-pack-index stores a 64-bit offset, but 
off_t is too small"));
 
offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
-- 
gitgitgadget



[PATCH 06/11] multi-pack-index: verify oid fanout order

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 9 +
 t/t5319-multi-pack-index.sh | 8 
 2 files changed, 17 insertions(+)

diff --git a/midx.c b/midx.c
index a02b19efc1..dfd26b4d74 100644
--- a/midx.c
+++ b/midx.c
@@ -950,5 +950,14 @@ int verify_midx_file(const char *object_dir)
midx_report("failed to load pack in position %d", i);
}
 
+   for (i = 0; i < 255; i++) {
+   uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
+   uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i + 1]);
+
+   if (oid_fanout1 > oid_fanout2)
+   midx_report(_("oid fanout out of order: fanout[%d] = 
%"PRIx32" > %"PRIx32" = fanout[%d]"),
+   i, oid_fanout1, oid_fanout2, i + 1);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 52b7f7905b..95ac7a6edd 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -186,6 +186,9 @@ MIDX_CHUNK_LOOKUP_WIDTH=12
 MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \
 $MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH))
 MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
+MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652))
+MIDX_OID_FANOUT_WIDTH=4
+MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -227,6 +230,11 @@ test_expect_success 'verify packnames out of order' '
"failed to load pack"
 '
 
+test_expect_success 'verify oid fanout out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_FANOUT_ORDER "\01" $objdir \
+   "oid fanout out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH 05/11] multi-pack-index: verify missing pack

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 16 
 t/t5319-multi-pack-index.sh |  5 +
 2 files changed, 21 insertions(+)

diff --git a/midx.c b/midx.c
index e655a15aed..a02b19efc1 100644
--- a/midx.c
+++ b/midx.c
@@ -926,13 +926,29 @@ void clear_midx_file(const char *object_dir)
 
 int verify_midx_error;
 
+static void midx_report(const char *fmt, ...)
+{
+   va_list ap;
+   verify_midx_error = 1;
+   va_start(ap, fmt);
+   vfprintf(stderr, fmt, ap);
+   fprintf(stderr, "\n");
+   va_end(ap);
+}
+
 int verify_midx_file(const char *object_dir)
 {
+   uint32_t i;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
 
if (!m)
return 0;
 
+   for (i = 0; i < m->num_packs; i++) {
+   if (prepare_midx_pack(m, i))
+   midx_report("failed to load pack in position %d", i);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 4a8f231d93..52b7f7905b 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -222,6 +222,11 @@ test_expect_success 'verify packnames out of order' '
"pack names out of order"
 '
 
+test_expect_success 'verify packnames out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \
+   "failed to load pack"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH 04/11] multi-pack-index: verify packname order

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The final check we make while loading a multi-pack-index is that
the packfile names are in lexicographical order. Make this error
be a die() instead.

In order to test this condition, we need multiple packfiles.
Earlier in t5319-multi-pack-index.sh, we tested the interaction with
'git repack' but this limits us to one packfile in our object dir.
Move these repack tests until after the 'verify' tests.

Signed-off-by: Derrick Stolee 
---
 midx.c  |  6 ++
 t/t5319-multi-pack-index.sh | 10 ++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 8b054b39ab..e655a15aed 100644
--- a/midx.c
+++ b/midx.c
@@ -157,12 +157,10 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
 
cur_pack_name += strlen(cur_pack_name) + 1;
 
-   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
-   error(_("multi-pack-index pack names out of order: '%s' 
before '%s'"),
+   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0)
+   die(_("multi-pack-index pack names out of order: '%s' 
before '%s'"),
  m->pack_names[i - 1],
  m->pack_names[i]);
-   goto cleanup_fail;
-   }
}
 
return m;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 5c9c499aa6..4a8f231d93 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -181,6 +181,11 @@ MIDX_BYTE_CHUNK_COUNT=6
 MIDX_HEADER_SIZE=12
 MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE
 MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4))
+MIDX_NUM_CHUNKS=5
+MIDX_CHUNK_LOOKUP_WIDTH=12
+MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \
+$MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH))
+MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -212,6 +217,11 @@ test_expect_success 'verify invalid chunk offset' '
"invalid chunk offset (too large)"
 '
 
+test_expect_success 'verify packnames out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \
+   "pack names out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH 01/11] multi-pack-index: add 'verify' verb

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index builtin writes multi-pack-index files, and
uses a 'write' verb to do so. Add a 'verify' verb that checks this
file matches the contents of the pack-indexes it replaces.

The current implementation is a no-op, but will be extended in
small increments in later commits.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 10 ++
 builtin/multi-pack-index.c |  4 +++-
 midx.c | 13 +
 midx.h |  1 +
 t/t5319-multi-pack-index.sh|  8 
 5 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 1f97e79912..f7778a2c85 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -27,6 +27,10 @@ write::
When given as the verb, write a new MIDX file to
`/packs/multi-pack-index`.
 
+verify::
+   When given as the verb, verify the contents of the MIDX file
+   at `/packs/multi-pack-index`.
+
 
 EXAMPLES
 
@@ -43,6 +47,12 @@ $ git multi-pack-index write
 $ git multi-pack-index --object-dir  write
 ---
 
+* Verify the MIDX file for the packfiles in the current .git folder.
++
+---
+$ git multi-pack-index verify
+---
+
 
 SEE ALSO
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 2633efd95d..7d567dafbc 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] write"),
+   N_("git multi-pack-index [--object-dir=] [write|verify]"),
NULL
 };
 
@@ -42,6 +42,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
+   if (!strcmp(argv[0], "verify"))
+   return verify_midx_file(opts.object_dir);
 
die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index f3e8dbc108..b253bed517 100644
--- a/midx.c
+++ b/midx.c
@@ -928,3 +928,16 @@ void clear_midx_file(const char *object_dir)
 
free(midx);
 }
+
+int verify_midx_error;
+
+int verify_midx_file(const char *object_dir)
+{
+   struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+   verify_midx_error = 0;
+
+   if (!m)
+   return 0;
+
+   return verify_midx_error;
+}
diff --git a/midx.h b/midx.h
index a210f1af2a..ce80b91c68 100644
--- a/midx.h
+++ b/midx.h
@@ -43,5 +43,6 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 
 int write_midx_file(const char *object_dir);
 void clear_midx_file(const char *object_dir);
+int verify_midx_file(const char *object_dir);
 
 #endif
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 6f56b38674..1c4e0e6d31 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -150,6 +150,10 @@ test_expect_success 'write midx with twelve packs' '
 
 compare_results_with_midx "twelve packs"
 
+test_expect_success 'verify multi-pack-index success' '
+   git multi-pack-index verify --object-dir=$objdir
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -214,4 +218,8 @@ test_expect_success 'force some 64-bit offsets with 
pack-objects' '
midx_read_expect 1 63 5 objects64 " large-offsets"
 '
 
+test_expect_success 'verify multi-pack-index with 64-bit offsets' '
+   git multi-pack-index verify --object-dir=objects64
+'
+
 test_done
-- 
gitgitgadget



[PATCH 02/11] multi-pack-index: verify bad header

2018-09-05 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When verifying if a multi-pack-index file is valid, we want the
command to fail to signal an invalid file. Previously, we wrote
an error to stderr and continued as if we had no multi-pack-index.
Now, die() instead of error().

Add tests that check corrupted headers in a few ways:

* Bad signature
* Bad file version
* Bad hash version
* Truncated hash count
* Extended hash count

Signed-off-by: Derrick Stolee 
---
 midx.c  | 18 +--
 t/t5319-multi-pack-index.sh | 46 -
 2 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/midx.c b/midx.c
index b253bed517..ec78254bb6 100644
--- a/midx.c
+++ b/midx.c
@@ -76,24 +76,18 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
m->local = local;
 
m->signature = get_be32(m->data);
-   if (m->signature != MIDX_SIGNATURE) {
-   error(_("multi-pack-index signature 0x%08x does not match 
signature 0x%08x"),
+   if (m->signature != MIDX_SIGNATURE)
+   die(_("multi-pack-index signature 0x%08x does not match 
signature 0x%08x"),
  m->signature, MIDX_SIGNATURE);
-   goto cleanup_fail;
-   }
 
m->version = m->data[MIDX_BYTE_FILE_VERSION];
-   if (m->version != MIDX_VERSION) {
-   error(_("multi-pack-index version %d not recognized"),
+   if (m->version != MIDX_VERSION)
+   die(_("multi-pack-index version %d not recognized"),
  m->version);
-   goto cleanup_fail;
-   }
 
hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-   if (hash_version != MIDX_HASH_VERSION) {
-   error(_("hash version %u does not match"), hash_version);
-   goto cleanup_fail;
-   }
+   if (hash_version != MIDX_HASH_VERSION)
+   die(_("hash version %u does not match"), hash_version);
m->hash_len = MIDX_HASH_LEN;
 
m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1c4e0e6d31..836d2bdb53 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -154,6 +154,51 @@ test_expect_success 'verify multi-pack-index success' '
git multi-pack-index verify --object-dir=$objdir
 '
 
+# usage: corrupt_midx_and_verify
+corrupt_midx_and_verify() {
+   POS=$1
+   DATA="${2:-\0}"
+   OBJDIR=$3
+   GREPSTR="$4"
+   FILE=$OBJDIR/pack/multi-pack-index &&
+   chmod a+w $FILE &&
+   test_when_finished mv midx-backup $FILE &&
+   cp $FILE midx-backup &&
+   printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc &&
+   test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 
2>test_err &&
+   grep -v "^+" test_err >err &&
+   test_i18ngrep "$GREPSTR" err
+}
+
+test_expect_success 'verify bad signature' '
+   corrupt_midx_and_verify 0 "\00" $objdir \
+   "multi-pack-index signature"
+'
+
+MIDX_BYTE_VERSION=4
+MIDX_BYTE_OID_VERSION=5
+MIDX_BYTE_CHUNK_COUNT=6
+
+test_expect_success 'verify bad version' '
+   corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
+   "multi-pack-index version"
+'
+
+test_expect_success 'verify bad OID version' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_VERSION "\02" $objdir \
+   "hash version"
+'
+
+test_expect_success 'verify truncated chunk count' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
+   "missing required"
+'
+
+test_expect_success 'verify extended chunk count' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
+   "terminating multi-pack-index chunk id appears earlier than 
expected"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -191,7 +236,6 @@ test_expect_success 'multi-pack-index in an alternate' '
 
 compare_results_with_midx "with alternate (remote midx)"
 
-
 # usage: corrupt_data   []
 corrupt_data () {
file=$1
-- 
gitgitgadget



[PATCH 00/11] Add 'git multi-pack-index verify' command

2018-09-05 Thread Derrick Stolee via GitGitGadget
The multi-pack-index file provides faster lookups in repos with many
packfiles by duplicating the information from multiple pack-indexes into a
single file. This series allows us to verify a multi-pack-index using 'git
multi-pack-index verify' and 'git fsck' (when core.multiPackIndex is true).

The pattern for the tests is similar to that found in t5318-commit-graph.sh.

During testing, I found a bug in how we check for the size of off_t (we are
not actually checking off_t, but instead uint32_t). See "multi-pack-index:
fix 32-bit vs 64-bit size check".

Thanks to Ævar [1], I added a commit that provides progress updates when
checking object offsets.

Based on ds/multi-pack-index

[1] 
https://public-inbox.org/git/20180904202729.13900-1-ava...@gmail.com/T/#u

Derrick Stolee (11):
  multi-pack-index: add 'verify' verb
  multi-pack-index: verify bad header
  multi-pack-index: verify corrupt chunk lookup table
  multi-pack-index: verify packname order
  multi-pack-index: verify missing pack
  multi-pack-index: verify oid fanout order
  multi-pack-index: verify oid lookup order
  multi-pack-index: fix 32-bit vs 64-bit size check
  multi-pack-index: verify object offsets
  multi-pack-index: report progress during 'verify'
  fsck: verify multi-pack-index

 Documentation/git-multi-pack-index.txt |  10 ++
 builtin/fsck.c |  18 
 builtin/multi-pack-index.c |   4 +-
 midx.c | 112 
 midx.h |   1 +
 t/t5319-multi-pack-index.sh| 136 -
 6 files changed, 261 insertions(+), 20 deletions(-)


base-commit: 6a22d521260f86dff8fe6f23ab329cebb62ba4f0
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-34%2Fderrickstolee%2Fmidx%2Fverify-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-34/derrickstolee/midx/verify-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/34
-- 
gitgitgadget


Re: [PATCH 0/2] commit-graph: add progress output

2018-09-05 Thread Derrick Stolee

On 9/4/2018 4:27 PM, Ævar Arnfjörð Bjarmason wrote:

This series adds progress output to the commit-graph command, so that
when it's called by "git gc" or "git fsck" we can see what's going on
with it.

Ævar Arnfjörð Bjarmason (2):
   commit-graph write: add progress output
   commit-graph verify: add progress output

  commit-graph.c | 44 +++-
  1 file changed, 43 insertions(+), 1 deletion(-)


Thanks for writing this, Ævar. I appreciate that you took the time to 
fill in an important UX gap.


I had a couple of comments, but generally this is a good change.

-Stolee



Re: [PATCH 1/2] commit-graph write: add progress output

2018-09-05 Thread Derrick Stolee

On 9/4/2018 4:27 PM, Ævar Arnfjörð Bjarmason wrote:

@@ -591,8 +597,13 @@ static void close_reachable(struct packed_oid_list *oids)
  {
int i;
struct commit *commit;
+   struct progress *progress = NULL;
+   int j = 0;


The change below over-counts the number of commits we are processing (by 
at least double, possibly triple).




+   progress = start_delayed_progress(
+   _("Annotating commits in commit graph"), 0);
for (i = 0; i < oids->nr; i++) {
+   display_progress(progress, ++j);
commit = lookup_commit(the_repository, >list[i]);
if (commit)
commit->object.flags |= UNINTERESTING;
This count is the number of oids given to the method. For 'git 
commit-graph write --reachable', this will be the number of refs.

@@ -604,6 +615,7 @@ static void close_reachable(struct packed_oid_list *oids)
 * closure.
 */
for (i = 0; i < oids->nr; i++) {
+   display_progress(progress, ++j);
commit = lookup_commit(the_repository, >list[i]);
  
  		if (commit && !parse_commit(commit))


This is the important count, since we will be parsing commits and adding 
their parents to the list. The bulk of the work happens here.



@@ -611,19 +623,25 @@ static void close_reachable(struct packed_oid_list *oids)
}
  
  	for (i = 0; i < oids->nr; i++) {

+   display_progress(progress, ++j);
commit = lookup_commit(the_repository, >list[i]);
This iterates through the commits a second time and removes the 
UNINTERESTING flag.
  
  		if (commit)

commit->object.flags &= ~UNINTERESTING;
}
+   stop_progress();
  }


I think it is good to have the progress start before the first loop and 
end after the third loop, but the middle loop has the important count.


I tried deleting the first and third display_progress() methods and 
re-ran the process on the Linux repo and did not notice a delay at the 
0% and 100% progress spots. The count matches the number of commits.


Thanks,

-Stolee



Re: [PATCH 1/2] commit-graph write: add progress output

2018-09-05 Thread Derrick Stolee

On 9/4/2018 6:07 PM, Junio C Hamano wrote:

Ævar Arnfjörð Bjarmason   writes:


With --stdin-packs we don't show any estimation of how much is left to
do. This is because we might be processing more than one pack. We
could be less lazy here and show progress, either detect by detecting
that we're only processing one pack, or by first looping over the
packs to discover how many commits they have. I don't see the point in

I do not know if there is no point, but if we were to do it, I think
slurping the list of packs and computing the number of objects is
not all that bad.


If you want to do that, I have nothing against it. However, I don't 
expect users to use that option directly. That option is used by VFS for 
Git to compute the commit-graph in the background after receiving a pack 
of commits and trees, but not by 'git gc' which I expect is how most 
users will compute commit-graphs.



  static void compute_generation_numbers(struct packed_commit_list* commits)
  {
int i;
struct commit_list *list = NULL;
+   struct progress *progress = NULL;
  
+	progress = start_progress(

+   _("Computing commit graph generation numbers"), commits->nr);
for (i = 0; i < commits->nr; i++) {
+   display_progress(progress, i);
if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY 
&&
commits->list[i]->generation != GENERATION_NUMBER_ZERO)
continue;

I am wondering if the progress call should be moved after this
conditional continue; would we want to count the entry whose
generation is already known here?  Of course, as we give commits->nr
as the 100% ceiling, we cannot avoid doing so, but it somehow smells
wrong.


If we wanted to be completely right, we would count the commits in the 
list that do not have a generation number and report that as the 100% 
ceiling.


Something like the diff below would work. I tested it in Linux by first 
deleting my commit-graph and running the following:


stolee@stolee-linux:~/linux$ rm .git/objects/info/commit-graph
stolee@stolee-linux:~/linux$ git rev-parse v4.6 | ~/git/git commit-graph 
write --stdin-commits

Annotating commits in commit graph: 1180333, done.
Computing commit graph generation numbers: 100% (590166/590166), done.
stolee@stolee-linux:~/linux$ ~/git/git commit-graph write --reachable
Annotating commits in commit graph: 1564087, done.
Computing commit graph generation numbers: 100% (191590/191590), done.

-->8--

From: Derrick Stolee 
Date: Wed, 5 Sep 2018 11:55:42 +
Subject: [PATCH] fixup! commit-graph write: add progress output

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1a02fe019a..b933bc9f00 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -634,14 +634,20 @@ static void close_reachable(struct packed_oid_list 
*oids)


 static void compute_generation_numbers(struct packed_commit_list* commits)
 {
-   int i;
+   int i, count_uncomputed = 0;
    struct commit_list *list = NULL;
    struct progress *progress = NULL;

+   for (i = 0; i < commits->nr; i++)
+   if (commits->list[i]->generation == 
GENERATION_NUMBER_INFINITY ||

+   commits->list[i]->generation == GENERATION_NUMBER_ZERO)
+   count_uncomputed++;
+
    progress = start_progress(
-   _("Computing commit graph generation numbers"), 
commits->nr);
+   _("Computing commit graph generation numbers"), 
count_uncomputed);

+   count_uncomputed = 0;
+
    for (i = 0; i < commits->nr; i++) {
-   display_progress(progress, i);
    if (commits->list[i]->generation != 
GENERATION_NUMBER_INFINITY &&

    commits->list[i]->generation != GENERATION_NUMBER_ZERO)
    continue;
@@ -670,10 +676,11 @@ static void compute_generation_numbers(struct 
packed_commit_list* commits)


    if (current->generation > 
GENERATION_NUMBER_MAX)
    current->generation = 
GENERATION_NUMBER_MAX;

+
+   display_progress(progress, 
++count_uncomputed);

    }
    }
    }
-   display_progress(progress, i);
    stop_progress();
 }

--
2.19.0.rc2



Re: [PATCH v2 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage

2018-09-04 Thread Derrick Stolee

On 9/4/2018 12:49 PM, Duy Nguyen wrote:

On Wed, Aug 29, 2018 at 2:49 PM Derrick Stolee via GitGitGadget
 wrote:

The commit-graph (and multi-pack-index) features are optional data
structures that can make Git operations faster. Since they are optional, we
do not enable them in most Git tests. The commit-graph is tested in
t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that
one script cannot cover the data shapes present in the rest of the test
suite.

This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH
. Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it
with every git commit command. Thanks, Duy, for pointing out this direction
[1].

Any reason to not add this new flag in ci/run-build-and-tests.sh
(which is used by Travis)? I see your note about VSTS but I don't see
why it has to be exclusive to VSTS.


I only wanted to volunteer resources that I know are available. I am 
looking into an additional test run in Travis CI builds, but didn't want 
to presume the extra resources would be available.




Re: [PATCH v4 12/12] t5318: use test_oid for HASH_LEN

2018-09-04 Thread Derrick Stolee

On 9/3/2018 7:25 PM, brian m. carlson wrote:

From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
Signed-off-by: brian m. carlson 
---
  t/t5318-commit-graph.sh | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3c1ffad491..d286516f0e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -8,7 +8,8 @@ test_expect_success 'setup full repo' '
cd "$TRASH_DIRECTORY/full" &&
git init &&
git config core.commitGraph true &&
-   objdir=".git/objects"
+   objdir=".git/objects" &&
+   test_oid_init
  '


Ah yes, thanks for adding this.

  
  test_expect_success 'verify graph with no graph file' '

@@ -273,7 +274,7 @@ test_expect_success 'git commit-graph verify' '
  
  NUM_COMMITS=9

  NUM_OCTOPUS_EDGES=2
-HASH_LEN=20
+HASH_LEN="$(test_oid rawsz)"
  GRAPH_BYTE_VERSION=4
  GRAPH_BYTE_HASH=5
  GRAPH_BYTE_CHUNK_COUNT=6


Re: [PATCH 1/8] trace2: create new combined trace facility

2018-08-31 Thread Derrick Stolee

On 8/31/2018 12:49 PM, Jeff Hostetler via GitGitGadget wrote:

+   if (tr2key_trace_want(_event)) {
+   struct tr2tls_thread_ctx *ctx = tr2tls_get_self();
+   if (ctx->nr_open_regions <= tr2env_event_depth_wanted) {


This should also compare to TR2_MAX_REGION_NESTING to avoid memory 
bounds issues. It may even be worth sending a message that the depth 
went beyond the limits.


Thanks,

-Stolee



Re: [PATCH 0/8] WIP: trace2: a new trace facility

2018-08-31 Thread Derrick Stolee

On 8/31/2018 12:49 PM, Jeff Hostetler via GitGitGadget wrote:

This patch series contains a new trace2 facility that hopefully addresses
the recent trace- and structured-logging-related discussions. The intent is
to eventually replace the existing trace_ routines (or to route them to the
new trace2_ routines) as time permits.


I haven't been part of the recent discussions on these logging efforts, 
but I wanted to mention that I'm excited about the potential here. I 
want to use this new tracing model to trace how many commits are walked 
by graph algorithms like paint_down_to_common().


I'm playing with some efforts here, and found one issue when the API is 
used incorrectly (I'll respond to the patch with the issue).


Thanks,

-Stolee



[PATCH 1/1] commit: don't use generation numbers if not needed

2018-08-30 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In 3afc679b "commit: use generations in paint_down_to_common()",
the queue in paint_down_to_common() was changed to use a priority
order based on generation number before commit date. This served
two purposes:

 1. When generation numbers are present, the walk guarantees
correct topological relationships, regardless of clock skew in
commit dates.

 2. It enables short-circuiting the walk when the min_generation
parameter is added in d7c1ec3e "commit: add short-circuit to
paint_down_to_common()". This short-circuit helps commands
like 'git branch --contains' from needing to walk to a merge
base when we know the result is false.

The commit message for 3afc679b includes the following sentence:

This change does not affect the number of commits that are
walked during the execution of paint_down_to_common(), only
the order that those commits are inspected.

This statement is incorrect. Because it changes the order in which
the commits are inspected, it changes the order they are added to
the queue, and hence can change the number of loops before the
queue_has_nonstale() method returns true.

This change makes a concrete difference depending on the topology
of the commit graph. For instance, computing the merge-base between
consecutive versions of the Linux kernel has no effect for versions
after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
regression:

v2.18.0: 0.122s
v2.19.0-rc1: 0.547s
   HEAD: 0.127s

To determine that this was simply an ordering issue, I inserted
a counter within the while loop of paint_down_to_common() and
found that the loop runs 167,468 times in v2.18.0 and 635,579
times in v2.19.0-rc1.

The topology of this case can be described in a simplified way
here:

  v4.9
   |  \
   |   \
  v4.8  \
   | \   \
   |  \   |
  ...  A  B
   |  /  /
   | /  /
   |/__/
   C

Here, the "..." means "a very long line of commits". By generation
number, A and B have generation one more than C. However, A and B
have commit date higher than most of the commits reachable from
v4.8. When the walk reaches v4.8, we realize that it has PARENT1
and PARENT2 flags, so everything it can reach is marked as STALE,
including A. B has only the PARENT1 flag, so is not STALE.

When paint_down_to_common() is run using
compare_commits_by_commit_date, A and B are removed from the queue
early and C is inserted into the queue. At this point, C and the
rest of the queue entries are marked as STALE. The loop then
terminates.

When paint_down_to_common() is run using
compare_commits_by_gen_then_commit_date, B is removed from the
queue only after the many commits reachable from v4.8 are explored.
This causes the loop to run longer. The reason for this regression
is simple: the queue order is intended to not explore a commit
until everything that _could_ reach that commit is explored. From
the information gathered by the original ordering, we have no
guarantee that there is not a commit D reachable from v4.8 that
can also reach B. We gained absolute correctness in exchange for
a performance regression.

The performance regression is probably the worse option, since
these incorrect results in paint_down_to_common() are rare. The
topology required for the performance regression are less rare,
but still require multiple merge commits where the parents differ
greatly in generation number. In our example above, the commit A
is as important as the commit B to demonstrate the problem, since
otherwise the commit C will sit in the queue as non-stale just as
long in both orders.

The solution provided uses the min_generation parameter to decide
if we should use generation numbers in our ordering. When
min_generation is equal to zero, it means that the caller has no
known cutoff for the walk, so we should rely on our commit-date
heuristic as before; this is the case with merge_bases_many().
When min_generation is non-zero, then the caller knows a valuable
cutoff for the short-circuit mechanism; this is the case with
remove_redundant() and in_merge_bases_many().

Signed-off-by: Derrick Stolee 
---
 commit.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 1a6e632185..449c1f4920 100644
--- a/commit.c
+++ b/commit.c
@@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n,
int i;
uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
+   if (!min_generation)
+   queue.compare = compare_commits_by_commit_date;
+
one->object.flags |= PARENT1;
if (!n) {
commit_list_append(one, );
@@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n,
struct commit_list *parents;
int flags;
 
-   if (commit->generation > last_gen)
+   if (min_generation && commit->generation > last_

[PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base'

2018-08-30 Thread Derrick Stolee via GitGitGadget
As I was testing the release candidate, I stumbled across a regression in
'git merge-base' as a result of the switch to generation numbers. The commit
message in [PATCH 1/1] describes the topology involved, but you can test it
yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The
regression does not show up when running merge-base for tags at least v4.9,
which is why I didn't see it when I was testing earlier.

The solution is simple, but also will conflict with ds/reachable in next. I
can send a similar patch that applies the same diff into commit-reach.c.

With the integration of generation numbers into most commit walks coming to
a close [1], it will be time to re-investigate other options for
reachability indexes [2]. As I was digging into the issue with this
regression, I discovered a way we can modify our generation numbers and pair
them with commit dates to give us a simple-to-compute, immutable
two-dimensional reachability index that would be immune to this regression.
I will investigate that more and report back, but it is more important to
fix this regression now.

Thanks, -Stolee

[1] https://public-inbox.org/git/pull.25.git.gitgitgad...@gmail.com/[PATCH
0/6] Use generation numbers for --topo-order

[2] https://public-inbox.org/git/86muxcuyod@gmail.com/[RFC] Other chunks
for commit-graph, part 2 - reachability indexes

Cc: gitster@pobox.comCc: p...@peff.net

Derrick Stolee (1):
  commit: don't use generation numbers if not needed

 commit.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)


base-commit: 2f743933341f27603550fbf383a34dfcfd38
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-28%2Fderrickstolee%2Fmerge-base-regression-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-28/derrickstolee/merge-base-regression-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/28
-- 
gitgitgadget


Re: Questions about the hash function transition

2018-08-29 Thread Derrick Stolee

On 8/29/2018 9:27 AM, Derrick Stolee wrote:

On 8/29/2018 9:09 AM, Johannes Schindelin wrote:


What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.


I'm unsure what this means, as the multi-pack-index simply uses 
bsearch_hash() to find hashes in the list. The same method is used for 
IDX lookups.


I talked with Johannes privately, and we found differences in our 
understanding of the current multi-pack-index feature. Johannes thought 
the feature was farther along than it is, specifically related to how 
much we value the data in the multi-pack-index when adding objects to 
pack-files or repacking. Some of this misunderstanding is due to how the 
equivalent feature works in VSTS (where there is no IDX-file equivalent, 
every object in the repo is tracked by a multi-pack-index).


I'd like to point out a few things about how the multi-pack-index works 
now, and how we hope to extend it in the future.


Currently:

1. Objects are added to the multi-pack-index by adding a new set of 
.idx/.pack file pairs. We scan the .idx file for the objects and offsets 
to add.


2. We re-use the information in the multi-pack-index only to write the 
new one without re-reading the .pack files that are already covered.


3. If a 'git repack' command deletes a pack-file, then we delete the 
multi-pack-index. It must be regenerated by 'git multi-pack-index write' 
later.


In the current world, the multi-pack-index is completely secondary to 
the .idx files.


In the future, I hope these features exist in the multi-pack-index:

1. A stable object order. As objects are added to the multi-pack-index, 
we assign a distinct integer value to each. As we add objects, those 
integers values do not change. We can then pair the reachability bitmap 
to the multi-pack-index instead of a specific pack-file (allowing repack 
and bitmap computations to happen asynchronously). The data required to 
store this object order is very similar to storing the bijection between 
SHA-1 and SHA-256 hashes.


2. Incremental multi-pack-index: Currently, we have only one 
multi-pack-index file per object directory. We can use a mechanism 
similar to the split-index to keep a small number of multi-pack-index 
files (at most 3, probably) such that the 
'.git/objects/pack/multi-pack-index' file is small and easy to rewrite, 
while it refers to larger '.git/objects/pack/*.midx' files that change 
infrequently.


3. Multi-pack-index-aware repack: The repacker only knows about the 
multi-pack-index enough to delete it. We could instead directly 
manipulate the multi-pack-index during repack, and we could decide to do 
more incremental repacks based on data stored in the multi-pack-index.


In conclusion: please keep the multi-pack-index in mind as we implement 
the transition plan. I'll continue building the feature as planned (the 
next thing to do after the current series of cleanups is 'git 
multi-pack-index verify') but am happy to look into other applications 
as we need it.


Thanks,

-Stolee



Re: Questions about the hash function transition

2018-08-29 Thread Derrick Stolee

On 8/29/2018 9:09 AM, Johannes Schindelin wrote:

Hi Jonathan,

On Tue, 28 Aug 2018, Jonathan Nieder wrote:


Johannes Schindelin wrote:

On Thu, 23 Aug 2018, Jonathan Nieder wrote:

Ævar Arnfjörð Bjarmason wrote:

Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry
about...

That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).

Related: I wondered whether we could simply leverage the midx code for
the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very
similar in concept and challenges.

Interesting: tell me more.

My first instinct is to prefer the idx-based design that is already
described in the design doc.  If we want to change that, we should
have a motivating reason.

Midx is designed to be optional and to not necessarily cover all
objects, so it doesn't seem like a good fit.


It is optional, but shouldn't this mode where a Git repo that needs to 
know about two different versions of all files be optional? Or at least 
temporary?


The multi-pack-index is intended to cover all packed objects, so covers 
the same number of objects as an IDX-based strategy. If we are 
rebuilding the repo from scratch by translating the hashes, then "being 
too big to repack" is probably not a problem, so we would expect a 
single IDX file anyway.


In my opinion, whatever we do for the IDX-based approach will need to be 
duplicated in the multi-pack-index. The multi-pack-index does have a 
natural mechanism (optional chunks) for inserting this data without 
incrementing the version number.



Right.

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.


I'm unsure what this means, as the multi-pack-index simply uses 
bsearch_hash() to find hashes in the list. The same method is used for 
IDX lookups.



I know that Stolee performed a bit of performance testing regarding
different data structures to use in MIDX. We could benefit from that
testing by using not only the results from those tests, but also the code.


I did test ways to use something other than bsearch_hash(), such as 
using a 65,536-entry fanout table for lookups using the first two bytes 
of a hash (tl;dr: it speeds things up a bit, but the super-small 
improvement is probably not worth the space and complexity). I've also 
toyed with the idea of using interpolation search inside bsearch_hash(), 
but I haven't had time to do that.



IIRC one of the insights was that packs are a natural structure that
can be used for the MIDX mapping, too (you could, for example, store the
SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate
them on the fly for loose objects all the time).

Stolee can speak with much more competence and confidence about this,
though, whereas all of what I said above is me waving my hands quite
frantically.


I understand the hesitation to pair such an important feature (hash 
transition) to a feature that hasn't even shipped. We will need to see 
how things progress on both fronts to see how mature the 
multi-pack-index is when we need this transition table.


Thanks,

-Stolee



Re: [RFC PATCH 09/12] Add a base implementation of SHA-256 support

2018-08-29 Thread Derrick Stolee

On 8/28/2018 8:58 PM, brian m. carlson wrote:

SHA-256 is somewhat slower to compute than SHA-1 in software.  However,
since our default SHA-1 implementation is collision-detecting, a
reasonable cryptographic library implementation of SHA-256 will actually
be faster than SHA-256.


Nit: do you mean "faster than SHA-1DC"?



Re: [RFC PATCH 08/12] commit-graph: convert to using the_hash_algo

2018-08-29 Thread Derrick Stolee

On 8/28/2018 8:58 PM, brian m. carlson wrote:

Instead of using hard-coded constants for object sizes, use
the_hash_algo to look them up.  In addition, use a function call to look
up the object ID version and produce the correct value.


The C code in this patch looks good to me. The only issue is that I 
predict failure in the 'git commit-graph verify' tests in 
t5318-commit-graph.sh. Squashing in this commit should help (assuming 
that test_oid works, it doesn't at my current branch):


-->8--

Subject: [PATCH] t5318-commit-graph.sh: use test_oid for HASH_LEN

Signed-off-by: Derrick Stolee 
---
 t/t5318-commit-graph.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6aee861f78..676c1a9ae0 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -333,7 +333,7 @@ test_expect_success 'git commit-graph verify' '

 NUM_COMMITS=9
 NUM_OCTOPUS_EDGES=2
-HASH_LEN=20
+HASH_LEN="$(test_oid rawsz)"
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
--
2.19.0.rc1


[PATCH v2 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-08-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.

To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.

There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.

There is one test in t6024-recursive-merge.sh that relies on the
merge-base algorithm picking one of two ambiguous merge-bases, and
the commit-graph feature changes which merge-base is picked.

Helped-by: Eric Sunshine 
Signed-off-by: Derrick Stolee 
---
 builtin/commit.c| 4 
 commit-graph.c  | 5 +++--
 commit-graph.h  | 2 ++
 t/README| 4 
 t/t0410-partial-clone.sh| 2 +-
 t/t5307-pack-missing-commit.sh  | 4 ++--
 t/t6011-rev-list-with-bad-commit.sh | 7 +++
 t/t6024-recursive-merge.sh  | 6 +++---
 8 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index 158e3f843a..a25e652102 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -33,6 +33,7 @@
 #include "sequencer.h"
 #include "mailmap.h"
 #include "help.h"
+#include "commit-graph.h"
 
 static const char * const builtin_commit_usage[] = {
N_("git commit [] [--] ..."),
@@ -1651,6 +1652,9 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
 "new_index file. Check that disk is not full and quota 
is\n"
 "not exceeded, and then \"git reset HEAD\" to recover."));
 
+   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   write_commit_graph_reachable(get_object_directory(), 0);
+
rerere(0);
run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
diff --git a/commit-graph.c b/commit-graph.c
index 4bd1a4abbf..5cae83fe03 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -237,8 +237,9 @@ static int prepare_commit_graph(struct repository *r)
return !!r->objects->commit_graph;
r->objects->commit_graph_attempted = 1;
 
-   if (repo_config_get_bool(r, "core.commitgraph", _value) ||
-   !config_value)
+   if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
+   (repo_config_get_bool(r, "core.commitgraph", _value) ||
+   !config_value))
/*
 * This repository is not configured to use commit graphs, so
 * do not load one. (But report commit_graph_attempted anyway
diff --git a/commit-graph.h b/commit-graph.h
index 13d736cdde..cf9141f356 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,6 +5,8 @@
 #include "repository.h"
 #include "string-list.h"
 
+#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+
 struct commit;
 
 char *get_commit_graph_filename(const char *obj_dir);
diff --git a/t/README b/t/README
index 8373a27fea..2b90c433f5 100644
--- a/t/README
+++ b/t/README
@@ -315,6 +315,10 @@ packs on demand. This normally only happens when the 
object size is
 over 2GB. This variable forces the code path on any object larger than
  bytes.
 
+GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to
+be written after every 'git commit' command, and overrides the
+'core.commitGraph' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 4984ca583d..73d5284a91 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -181,7 +181,7 @@ test_expect_success 'rev-list stops traversal at missing 
and promised commit' '
 
git -C repo config core.repositoryformatversion 1 &&
git -C repo config extensions.partialclone "arbitrary string" &&
-   git -C repo rev-list --exclude-promisor-objects --objects bar >out &&
+   GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects 
--objects bar >out &&
grep $(git -C repo rev-parse bar) out &&
! grep $FOO out
 '
diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh
index ae52a1882d..dacb440b27 100755
--- a/t/t5307-pack-missing-commit.sh
+++ b/t/t5307-pack-missing-commit.sh
@@ -24,11 +24,11 @@ test_expect_success 'check 

[PATCH v2 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage

2018-08-29 Thread Derrick Stolee via GitGitGadget
The commit-graph (and multi-pack-index) features are optional data
structures that can make Git operations faster. Since they are optional, we
do not enable them in most Git tests. The commit-graph is tested in
t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that
one script cannot cover the data shapes present in the rest of the test
suite.

This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH
. Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it
with every git commit command. Thanks, Duy, for pointing out this direction
[1].

A few tests needed to be modified. These are the same tests that were
mentioned in my previous example patch [2]. Thanks, Eric, for providing the
correct way to override the settings [3].

When this merges down, I'll create a CI build in VSTS that runs the test
suite with this option enabled.

Thanks, -Stolee

[1] 
https://public-inbox.org/git/CACsJy8CKnXVJYkKM_W=N=Vq-TVXf+YCqZP_uP7B-dN_6xddB=g...@mail.gmail.com/
Re: [PATCH 0/9] multi-pack-index cleanups (Discussing test environment
variables)

[2] 
https://public-inbox.org/git/20180718152244.45513-1-dsto...@microsoft.com/
[PATCH] DO-NOT-MERGE: write and read commit-graph always

[3] 
https://public-inbox.org/git/CAPig+cSjanDi=jv75pdzypajwvgd4suh3uyvy+vy7yehauy...@mail.gmail.com/

Based-On: ds/commit-graph-with-grafts Cc: jnareb@gmail.comCc: 
sbeller@google.comCc: sunsh...@sunshineco.com

Derrick Stolee (1):
  commit-graph: define GIT_TEST_COMMIT_GRAPH

 builtin/commit.c| 4 
 commit-graph.c  | 5 +++--
 commit-graph.h  | 2 ++
 t/README| 4 
 t/t0410-partial-clone.sh| 2 +-
 t/t5307-pack-missing-commit.sh  | 4 ++--
 t/t6011-rev-list-with-bad-commit.sh | 7 +++
 t/t6024-recursive-merge.sh  | 6 +++---
 8 files changed, 22 insertions(+), 12 deletions(-)


base-commit: 829a321569d8e8f2c582aef9f0c990df976ab842
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-26%2Fderrickstolee%2Fshallow%2Ftest-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-26/derrickstolee/shallow/test-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/26

Range-diff vs v1:

 1:  85d02ac8d8 ! 1:  4ff6695c7e commit-graph: define GIT_TEST_COMMIT_GRAPH
 @@ -23,6 +23,7 @@
  merge-base algorithm picking one of two ambiguous merge-bases, and
  the commit-graph feature changes which merge-base is picked.
  
 +Helped-by: Eric Sunshine 
  Signed-off-by: Derrick Stolee 
  
  diff --git a/builtin/commit.c b/builtin/commit.c
 @@ -112,18 +113,12 @@
   
   test_expect_success 'rev-list notices corruption (1)' '
  - test_must_fail git rev-list HEAD
 -+ (
 -+ GIT_TEST_COMMIT_GRAPH=0 &&
 -+ test_must_fail git rev-list HEAD
 -+ )
 ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list HEAD
   '
   
   test_expect_success 'rev-list notices corruption (2)' '
  - test_must_fail git rev-list --objects HEAD
 -+ (
 -+ GIT_TEST_COMMIT_GRAPH=0 &&
 -+ test_must_fail git rev-list --objects HEAD
 -+ )
 ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --objects HEAD
   '
   
   test_expect_success 'pack-objects notices corruption' '
 @@ -140,10 +135,7 @@
  -   test_must_fail git rev-list --all > /dev/null
  -   '
  +test_expect_success 'rev-list should fail' '
 -+ (
 -+ GIT_TEST_COMMIT_GRAPH=0 &&
 -+ test_must_fail git rev-list --all > /dev/null
 -+ )
 ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > 
/dev/null
  +'
   
   test_expect_success 'git repack _MUST_ fail' \
 @@ -160,10 +152,7 @@
  - test_must_fail git merge -m final G
  -"
  +test_expect_success 'combined merge conflicts' '
 -+ (
 -+ GIT_TEST_COMMIT_GRAPH=0 &&
 -+ test_must_fail git merge -m final G
 -+ )
 ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
  +'
   
   cat > expect << EOF

-- 
gitgitgadget


Re: [PATCH v3 01/11] t: add tool to translate hash-related values

2018-08-29 Thread Derrick Stolee

On 8/28/2018 8:56 PM, brian m. carlson wrote:

+   rawsz="$(test_oid rawsz)" &&
+   hexsz="$(test_oid hexsz)" &&


These are neat helpers! The 'git commit-graph verify' tests in 
t5318-commit-graph.sh should automatically work if we use these for 
HASH_LEN instead of 20. I'll use a similar pattern when writing 'git 
multi-pack-index verify'.


Thanks,

-Stolee



Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-08-29 Thread Derrick Stolee

On 8/28/2018 5:59 PM, Eric Sunshine wrote:

On Tue, Aug 28, 2018 at 5:31 PM Derrick Stolee  wrote:

On 8/28/2018 4:41 PM, Stefan Beller wrote:

On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget
 wrote:

+   GIT_TEST_COMMIT_GRAPH=0 &&
+   test_must_fail git merge -m final G

This could go on the same line without the && in between, setting the
variable as a prefix.

It cannot! The Linux build I ran complained that you can't put
environment variables through test_must_fail.

Is GIT_TEST_COMMIT_GRAPH exported? If not, it won't have an impact on
git-merge anyhow.


In my testing this changed the behavior from fail to pass when passing 
GIT_TEST_COMMIT_GRAPH=1 from the command.




As for the special case of one-shot environment variable and
test_must_fail(), you'll find "env" used as a workaround in a number
of tests:

 test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge ... &&


Thanks for this! This is clearly the better solution.

-Stolee



Re: [PATCH] commit-reach: correct accidental #include of C file

2018-08-28 Thread Derrick Stolee

On 8/28/2018 5:36 PM, Jonathan Nieder wrote:

Without this change, the build breaks with clang:
For some reason, it didn't fail with GCC for me, but this is an 
obviously correct change to make. Thanks!

  libgit/ref-filter.pic.o: multiple definition of 'filter_refs'
  libgit/commit-reach.pic.o: previous definition here

Signed-off-by: Jonathan Nieder 
---
Jonathan Nieder wrote:

Derrick Stolee wrote:

--- a/commit-reach.c
+++ b/commit-reach.c
@@ -1,8 +1,10 @@
  #include "cache.h"
  #include "commit.h"
+#include "commit-graph.h"
  #include "decorate.h"
  #include "prio-queue.h"
  #include "tree.h"
+#include "ref-filter.c"

Did you mean "ref-filter.h"?

This broke the build here.  Is there some check that we can use to
prevent it happening again?  I don't think we ever intentionally
#include a .c file.

Here's what I'm applying locally.

Thanks,
Jonathan

  commit-reach.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index c996524032..86715c103c 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -4,7 +4,7 @@
  #include "decorate.h"
  #include "prio-queue.h"
  #include "tree.h"
-#include "ref-filter.c"
+#include "ref-filter.h"
  #include "revision.h"
  #include "tag.h"
  #include "commit-reach.h"


Re: [PATCH 0/9] introducing oideq()

2018-08-28 Thread Derrick Stolee

On 8/28/2018 5:21 PM, Jeff King wrote:

On Sun, Aug 26, 2018 at 08:56:21PM +, brian m. carlson wrote:


Due to the simplicity of the current code and our inlining, the compiler
can usually figure this out for now. So I wouldn't expect this patch to
actually improve performance right away. But as that discussion shows,
we are likely to take a performance hit as we move to more runtime
determination of the_hash_algo parameters. Having these callers use the
more strict form will potentially help us recover that.

So in that sense we _could_ simply punt on this series until then (and
it's certainly post-v2.19 material). But I think it's worth doing now,
simply from a readability/annotation standpoint. IMHO the resulting code
is more clear (though I've long since taught myself to read !foocmp() as
equality).

I would quite like to see this series picked up for v2.20.  If we want
to minimize performance regressions with the SHA-256 work, I think it's
important.

Thanks. One of the things I was worried about was causing unnecessary
conflicts with existing topics, including your work. But if everybody is
on board, I'd be happy to see this go in early in the next release cycle
(the longer we wait, the more annoying conflicts Junio has to resolve).


I'm happy to take this change whenever. In my opinion, right after 
v2.19.0 is cut would be a great time to merge it into master.


This v2 is good.

Reviewed-by: Derrick Stolee 



Re: [PATCH v2 04/18] commit-reach: move commit_contains from ref-filter

2018-08-28 Thread Derrick Stolee

On 8/28/2018 5:24 PM, Jonathan Nieder wrote:

Hi,

Derrick Stolee wrote:


There are several commit walks in the codebase. Group them together into
a new commit-reach.c file and corresponding header. After we group these
walks into one place, we can reduce duplicate logic by calling
equivalent methods.

All methods are direct moves, except we also make the commit_contains()
method public so its consumers in ref-filter.c can still call it. We can
also test this method in a test-tool in a later commit.

Signed-off-by: Derrick Stolee 
---
  commit-reach.c | 121 +
  commit-reach.h |  20 ++-
  ref-filter.c   | 145 +++--
  3 files changed, 147 insertions(+), 139 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index a6bc4781a6..01d796f011 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -1,8 +1,10 @@
  #include "cache.h"
  #include "commit.h"
+#include "commit-graph.h"
  #include "decorate.h"
  #include "prio-queue.h"
  #include "tree.h"
+#include "ref-filter.c"

Did you mean "ref-filter.h"?

This broke the build here.  Is there some check that we can use to
prevent it happening again?  I don't think we ever intentionally
#include a .c file.

Woah! How did that ever work? I definitely built this locally.


Re: [PATCH 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage

2018-08-28 Thread Derrick Stolee

On 8/28/2018 4:37 PM, Stefan Beller wrote:

On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget
 wrote:

The commit-graph (and multi-pack-index) features are optional data
structures that can make Git operations faster. Since they are optional, we
do not enable them in most Git tests. The commit-graph is tested in
t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that
one script cannot cover the data shapes present in the rest of the test
suite.

This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH
. Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it
with every git commit command.
Thanks, Duy, for pointing out this direction

Did you mean to cc Duy (instead of me)?
(I'll happily review the patch, too... just asking)


I just added you because you've been on a lot of commit-graph things 
lately. But yes, I forgot to add Duy to the CC list. Added to this message.


Thanks,

-Stolee



Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-08-28 Thread Derrick Stolee

On 8/28/2018 4:41 PM, Stefan Beller wrote:

On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget
 wrote:

From: Derrick Stolee 

The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.

To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.

There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.

So the plan is to turn on the commit graph for the whole test suite
excluding these selected tests?

Excluding these specific _steps_, but yes.



+   GIT_TEST_COMMIT_GRAPH=0 &&
+   test_must_fail git merge -m final G

This could go on the same line without the && in between, setting the
variable as a prefix.


It cannot! The Linux build I ran complained that you can't put 
environment variables through test_must_fail.


-Stolee



[PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-08-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.

To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.

There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.

There is one test in t6024-recursive-merge.sh that relies on the
merge-base algorithm picking one of two ambiguous merge-bases, and
the commit-graph feature changes which merge-base is picked.

Signed-off-by: Derrick Stolee 
---
 builtin/commit.c|  4 
 commit-graph.c  |  5 +++--
 commit-graph.h  |  2 ++
 t/README|  4 
 t/t0410-partial-clone.sh|  2 +-
 t/t5307-pack-missing-commit.sh  | 10 --
 t/t6011-rev-list-with-bad-commit.sh | 10 ++
 t/t6024-recursive-merge.sh  |  9 ++---
 8 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index 158e3f843a..a25e652102 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -33,6 +33,7 @@
 #include "sequencer.h"
 #include "mailmap.h"
 #include "help.h"
+#include "commit-graph.h"
 
 static const char * const builtin_commit_usage[] = {
N_("git commit [] [--] ..."),
@@ -1651,6 +1652,9 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
 "new_index file. Check that disk is not full and quota 
is\n"
 "not exceeded, and then \"git reset HEAD\" to recover."));
 
+   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   write_commit_graph_reachable(get_object_directory(), 0);
+
rerere(0);
run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
diff --git a/commit-graph.c b/commit-graph.c
index 4bd1a4abbf..5cae83fe03 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -237,8 +237,9 @@ static int prepare_commit_graph(struct repository *r)
return !!r->objects->commit_graph;
r->objects->commit_graph_attempted = 1;
 
-   if (repo_config_get_bool(r, "core.commitgraph", _value) ||
-   !config_value)
+   if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
+   (repo_config_get_bool(r, "core.commitgraph", _value) ||
+   !config_value))
/*
 * This repository is not configured to use commit graphs, so
 * do not load one. (But report commit_graph_attempted anyway
diff --git a/commit-graph.h b/commit-graph.h
index 13d736cdde..cf9141f356 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,6 +5,8 @@
 #include "repository.h"
 #include "string-list.h"
 
+#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+
 struct commit;
 
 char *get_commit_graph_filename(const char *obj_dir);
diff --git a/t/README b/t/README
index 8373a27fea..2b90c433f5 100644
--- a/t/README
+++ b/t/README
@@ -315,6 +315,10 @@ packs on demand. This normally only happens when the 
object size is
 over 2GB. This variable forces the code path on any object larger than
  bytes.
 
+GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to
+be written after every 'git commit' command, and overrides the
+'core.commitGraph' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 4984ca583d..73d5284a91 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -181,7 +181,7 @@ test_expect_success 'rev-list stops traversal at missing 
and promised commit' '
 
git -C repo config core.repositoryformatversion 1 &&
git -C repo config extensions.partialclone "arbitrary string" &&
-   git -C repo rev-list --exclude-promisor-objects --objects bar >out &&
+   GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects 
--objects bar >out &&
grep $(git -C repo rev-parse bar) out &&
! grep $FOO out
 '
diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh
index ae52a1882d..15df19a040 100755
--- a/t/t5307-pack-missing-commit.sh
+++ b/t/t5307-pack-missing-commit.sh
@@ -24,11 +24,17 @@ test_expect_success 'check co

Re: Git clean removing tracked files semi-regularly

2018-08-28 Thread Derrick Stolee

On 8/28/2018 2:29 PM, Brennan Conroy wrote:

I'm using windows. Have had it happen on 8.1 and 10.

"git status" shows "nothing to commit, working tree clean". I am using git 
version 2.17.1.windows.2


I tested on my machine and could not reproduce this issue. Could you 
find a full repro (first command 'git clone 
https://github.com/aspnet/SignalR') and create an Issue at 
https://github.com/git-for-windows/git/issues ?


Thanks,

-Stolee



Re: Questions about the hash function transition

2018-08-28 Thread Derrick Stolee

On 8/28/2018 8:04 AM, Johannes Schindelin wrote:

Hi,

On Thu, 23 Aug 2018, Jonathan Nieder wrote:


Ævar Arnfjörð Bjarmason wrote:

[...]

Since all operations that make new objects (e.g., "git commit") add
the new objects to the corresponding index, this mapping is possible
for all objects in the object store.

Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry about...

That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).

Related: I wondered whether we could simply leverage the midx code for the
bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar
in concept and challenges.


If we would like such a mapping, then I would propose the following:

1. The object store has everything in SHA-256, so the HASH_LEN parameter 
of the multi-pack-index is 32.


2. We create an optional chunk to add to the multi-pack-index that 
stores the SHA-1 for each object. This list would be in lex order.


3. We create two optional chunks that store the bijection between 
SHA-256 and SHA-1: the first is a list of integers i_0, i_1, ..., 
i_{N-1} such that i_k is the position in the SHA-1 list corresponding to 
the kth SHA-256. The second is a list of integers j_0, j_1, ..., j_{N-1} 
such that j_k is the position in the SHA-256 list of the kth SHA-1.


I'm not super-familiar with how the transition plan specifically needs 
this mapping, but it seems like a good place to put it.


Thanks,

-Stolee



[PATCH 1/6] prio-queue: add 'peek' operation

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 10 +++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get());
-   else if (!strcmp(*argv, "dump")) {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek();
+   void *get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
int *v;
while ((v = prio_queue_get()))
   show(v);
-- 
gitgitgadget



[PATCH 2/6] test-reach: add run_three_modes method

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

While inspecting this code, I realized that the final test for
'commit_contains --tag' is silently dropping the '--tag' argument.
It should be quoted to include both.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..1b18e12a4e 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes "test-tool reach $1"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
@@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
test_three_modes commit_contains &&
-   test_three_modes commit_contains --tag
+   test_three_modes "commit_contains --tag"
 '
 
 test_expect_success 'commit_contains:miss' '
-- 
gitgitgadget



[PATCH 4/6] revision.c: begin refactoring --topo-order logic

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index 3205a3947a..1db70dc951 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2451,7 +2452,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2889,6 +2890,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(>commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(>commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, >commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(>object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2925,11 +2953,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(>commits, revs->sort_order);
+   if (revs->topo_order)
+   

[PATCH 3/6] test-reach: add rev-list tests

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 1b18e12a4e..2fcaa39077 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
+'
+
 test_done
-- 
gitgitgadget



[PATCH 5/6] commit/revisions: bookkeeping before refactoring

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee 
---
 commit.c   | 11 ---
 commit.h   |  8 
 revision.c |  6 --
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index 32d1234bd7..2dbe187b8c 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, unsigned long);
-
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index e2c99d9b04..51de10e698 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 1db70dc951..565f903e46 100644
--- a/revision.c
+++ b/revision.c
@@ -804,7 +804,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -843,7 +844,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->first_parent_only)
break;
-- 
gitgitgadget



[PATCH 0/6] Use generation numbers for --topo-order

2018-08-27 Thread Derrick Stolee via GitGitGadget
This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series is based on ds/reachable.

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Derrick Stolee (6):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: refactor basic topo-order logic

 commit.c   |  11 +-
 commit.h   |   8 ++
 object.h   |   4 +-
 prio-queue.c   |   9 ++
 prio-queue.h   |   6 +
 revision.c | 232 -
 revision.h |   6 +
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh  |  98 +++-
 9 files changed, 361 insertions(+), 23 deletions(-)


base-commit: 6cc017431c1c48f80d1c6512fdcc9866cf4b7f55
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-25/derrickstolee/topo-order/progress-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/25
-- 
gitgitgadget


[PATCH 6/6] revision.c: refactor basic topo-order logic

2018-08-27 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list

Re: Contributor Summit planning

2018-08-27 Thread Derrick Stolee

On 8/27/2018 9:22 AM, Johannes Schindelin wrote:

Point in favor of the pure-online meeting: the informal standup on IRC
every second Friday. I really try to attend it (it is a bit awkward
because it is on a Friday evening in my timezone, right at the time when I
want to unwind from the work week), as it does have a similar effect to
in-person standups: surprising collaborations spring up, unexpected help,
and a general sense of belonging.

Of course, the value of these standups comes from the makeup of the
participants: Stefan, Brandon, Stolee, JeffH, Jonathan and other *very*
active core contributors hang out for roughly half an hour, sharing what
they are working on, exchanging ideas, etc.


A focused aside, since you brought up the online "standup": it seems the 
IRC channel has been less than ideal, with people trying to participate 
but having nickname issues or being muted. You also describe another 
issue: the timing. Having a real-time discussion has its benefits, but 
also it leaves many people out.


One idea to try next time is to create a mailing list thread asking for 
statuses, and each person can chime in asynchronously and spawn a new 
discussion based on that status. Perhaps we can try that next time.


Thanks,

-Stolee



Re: [PATCH 0/9] introducing oideq()

2018-08-27 Thread Derrick Stolee

On 8/26/2018 4:56 PM, brian m. carlson wrote:

On Sat, Aug 25, 2018 at 04:00:31AM -0400, Jeff King wrote:

This is a follow-up to the discussion in:

   https://public-inbox.org/git/20180822030344.ga14...@sigill.intra.peff.net/

The general idea is that the majority of callers don't care about actual
plus/minus ordering from oidcmp() and hashcmp(); they just want to know
if two oids are equal. The stricter equality check can be optimized
better by the compiler.

Due to the simplicity of the current code and our inlining, the compiler
can usually figure this out for now. So I wouldn't expect this patch to
actually improve performance right away. But as that discussion shows,
we are likely to take a performance hit as we move to more runtime
determination of the_hash_algo parameters. Having these callers use the
more strict form will potentially help us recover that.

So in that sense we _could_ simply punt on this series until then (and
it's certainly post-v2.19 material). But I think it's worth doing now,
simply from a readability/annotation standpoint. IMHO the resulting code
is more clear (though I've long since taught myself to read !foocmp() as
equality).

I would quite like to see this series picked up for v2.20.  If we want
to minimize performance regressions with the SHA-256 work, I think it's
important.


Seconded.


Applying the following patch on top of this series causes gcc to inline
both branches, which is pretty much the best we can expect.  I haven't
benchmarked it, though, so I can't say what the actual performance
consequence is.


We should definitely measure the cost here, but when we have the option 
to move to newhash, that cost should be acceptable.


Thanks,

-Stolee



Re: [PATCH 3/9] convert "oidcmp() == 0" to oideq()

2018-08-27 Thread Derrick Stolee

On 8/27/2018 8:31 AM, Derrick Stolee wrote:

On 8/25/2018 4:36 AM, Jeff King wrote:

On Sat, Aug 25, 2018 at 04:07:15AM -0400, Jeff King wrote:

diff --git a/contrib/coccinelle/object_id.cocci 
b/contrib/coccinelle/object_id.cocci

index 5869979be7..548c02336d 100644
--- a/contrib/coccinelle/object_id.cocci
+++ b/contrib/coccinelle/object_id.cocci
@@ -108,3 +108,9 @@ expression E1, E2;
  @@
  - hashcpy(E1.hash, E2->hash)
  + oidcpy(, E2)


Is this change intended? It doesn't seem to match the intention of the 
rest of the patch.


Ignore me. It's just confusing to read the +/- notation from a cocci 
script alongside the file diff.




Re: [PATCH 3/9] convert "oidcmp() == 0" to oideq()

2018-08-27 Thread Derrick Stolee

On 8/25/2018 4:36 AM, Jeff King wrote:

On Sat, Aug 25, 2018 at 04:07:15AM -0400, Jeff King wrote:


diff --git a/contrib/coccinelle/object_id.cocci 
b/contrib/coccinelle/object_id.cocci
index 5869979be7..548c02336d 100644
--- a/contrib/coccinelle/object_id.cocci
+++ b/contrib/coccinelle/object_id.cocci
@@ -108,3 +108,9 @@ expression E1, E2;
  @@
  - hashcpy(E1.hash, E2->hash)
  + oidcpy(, E2)


Is this change intended? It doesn't seem to match the intention of the 
rest of the patch.



+
+@@
+expression E1, E2;
+@@
+- oidcmp(E1, E2) == 0
++ oideq(E1, E2)

[...]

diff --git a/cache.h b/cache.h
index f6d227fac7..d090f71706 100644
--- a/cache.h
+++ b/cache.h
@@ -1100,7 +1100,7 @@ static inline int is_empty_blob_sha1(const unsigned char 
*sha1)
  
  static inline int is_empty_blob_oid(const struct object_id *oid)

  {
-   return !oidcmp(oid, the_hash_algo->empty_blob);
+   return oideq(oid, the_hash_algo->empty_blob);
  }

By the way, one interesting thing I noticed is that coccinelle generates
the hunk for cache.h for _every_ file that includes it, and the
resulting patch is annoying to apply. I think this is related to the
discussion we were having in:

   https://public-inbox.org/git/20180802115522.16107-1-szeder@gmail.com/

Namely that we might do better to invoke one big spatch (and let it
parallelize internally) instead of one perfile. Presumably that would
also avoid looking at the headers repeatedly (which would be both faster
and produce better output).

I'm not planning to pursue that immediately, so just food for thought at
this point.


The more we use Coccinelle, the more we will learn how to improve the 
workflow.


Thanks,

-Stolee



Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-24 Thread Derrick Stolee

On 8/23/2018 4:59 PM, Derrick Stolee wrote:


When I choose my own metrics for performance tests, I like to run at 
least 10 runs, remove the largest AND smallest runs from the samples, 
and then take the average. I did this manually for 'git rev-list --all 
--objects' on git/git and got the following results:


v2.18.0    v2.19.0-rc0   HEAD

3.126 s    3.308 s   3.170 s

For full disclosure, here is a full table including all samples:

|  | v2.18.0 | v2.19.0-rc0 | HEAD    |
|--|-|-|-|
|  | 4.58    | 3.302   | 3.239   |
|  | 3.13    | 3.337   | 3.133   |
|  | 3.213   | 3.291   | 3.159   |
|  | 3.219   | 3.318   | 3.131   |
|  | 3.077   | 3.302   | 3.163   |
|  | 3.074   | 3.328   | 3.119   |
|  | 3.022   | 3.277   | 3.125   |
|  | 3.083   | 3.259   | 3.203   |
|  | 3.057   | 3.311   | 3.223   |
|  | 3.155   | 3.413   | 3.225   |
| Max  | 4.58    | 3.413   | 3.239   |
| Min  | 3.022   | 3.259   | 3.119   |
| Avg* | 3.126   | 3.30825 | 3.17025 |

(Note that the largest one was the first run, on v2.18.0, which is due 
to a cold disk.)


I just kicked off a script that will run this test on the Linux repo 
while I drive home. I'll be able to report a similar table of data 
easily.


Here are the numbers for Linux:

|  | v2.18.0  | v2.19.0-rc0 | HEAD   |
|--|--|-||
|  | 86.5 | 70.739  | 57.266 |
|  | 60.582   | 101.928 | 56.641 |
|  | 58.964   | 60.139  | 60.258 |
|  | 59.47    | 61.141  | 58.213 |
|  | 62.554   | 60.73   | 84.54  |
|  | 59.139   | 85.424  | 57.745 |
|  | 58.487   | 59.31   | 59.979 |
|  | 58.653   | 69.845  | 60.181 |
|  | 58.085   | 102.777 | 61.455 |
|  | 58.304   | 60.459  | 62.551 |
| Max  | 86.5 | 102.777 | 84.54  |
| Min  | 58.085   | 59.31   | 56.641 |
| Avg* | 59.51913 | 71.30063    | 59.706 |
| Med  | 59.0515  | 65.493  | 60.08  |

Thanks,

-Stolee



Re: [PATCH v3 5/5] tests: fix and add lint for non-portable grep --file

2018-08-24 Thread Derrick Stolee

On 8/23/2018 4:44 PM, Junio C Hamano wrote:

Ævar Arnfjörð Bjarmason   writes:


The --file option to grep isn't in POSIX[1], but -f is[1]. Let's check
for that in the future, and fix the portability regression in
f237c8b6fe ("commit-graph: implement git-commit-graph write",
2018-04-02) that broke e.g. AIX.

1. http://pubs.opengroup.org/onlinepubs/009695399/utilities/grep.html

Signed-off-by: Ævar Arnfjörð Bjarmason 
---

Thanks.


Yes, thanks! I'll keep this in mind for the future.


  t/check-non-portable-shell.pl | 1 +
  t/t5318-commit-graph.sh   | 2 +-
  2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl
index 75f38298d7..b45bdac688 100755
--- a/t/check-non-portable-shell.pl
+++ b/t/check-non-portable-shell.pl
@@ -43,6 +43,7 @@ sub err {
/\bwc -l.*"\s*=/ and err '`"$(wc -l)"` is not portable (use 
test_line_count)';
/\bhead\s+-c\b/ and err 'head -c is not portable (use test_copy_bytes BYTES 
out)';
/(?:\$\(seq|^\s*seq\b)/ and err 'seq is not portable (use test_seq)';
+   /\bgrep\b.*--file\b/ and err 'grep --file FILE is not portable (use 
grep -f FILE)';
/\bexport\s+[A-Za-z0-9_]*=/ and err '"export FOO=bar" is not portable (use 
FOO=bar && export FOO)';
/^\s*([A-Z0-9_]+=(\w+|(["']).*?\3)\s+)+(\w+)/ and exists($func{$4}) and
err '"FOO=bar shell_func" assignment extends beyond 
"shell_func"';
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3c1ffad491..0c500f7ca2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -134,7 +134,7 @@ test_expect_success 'Add one more commit' '
git branch commits/8 &&
ls $objdir/pack | grep idx >existing-idx &&
git repack &&
-   ls $objdir/pack| grep idx | grep -v --file=existing-idx >new-idx
+   ls $objdir/pack| grep idx | grep -v -f existing-idx >new-idx
  '
  
  # Current graph structure:


Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-24 Thread Derrick Stolee

On 8/24/2018 2:45 AM, Jeff King wrote:

On Thu, Aug 23, 2018 at 10:59:55PM -0400, Jeff King wrote:


So I think we have a winner. I'll polish that up into patches and send
it out later tonight.

Oof. This rabbit hole keeps going deeper and deeper. I wrote up my
coccinelle findings separately in:

   https://public-inbox.org/git/20180824064228.ga3...@sigill.intra.peff.net/

which is possibly a coccinelle bug (there I talked about oidcmp, since
it can be demonstrated with the existing transformations, but the same
thing happens with my hasheq patches). I'll wait to see how that
discussion plays out, but I do otherwise have hasheq() patches ready to
go, so it's probably not worth anybody else digging in further.

I was trying this myself yesterday, and found a doc _somewhere_ that 
said what we should do is this:


@@
expression e1, e2;
@@
if (
- oidcmp(e1, e2)
+ !oideq(e1, e2)
 )

(Don't forget the space before the last end-paren!)

-Stolee


Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-23 Thread Derrick Stolee

On 8/23/2018 2:53 PM, Jeff King wrote:

On Thu, Aug 23, 2018 at 06:26:58AM -0400, Derrick Stolee wrote:


I think you can safely
ignore the rest of it if you are otherwise occupied. Even if v2.19 ships
without some mitigation, I don't know that it's all that big a deal,
given the numbers I generated (which for some reason are less dramatic
than Stolee's).

My numbers may be more dramatic because my Linux environment is a virtual
machine.

If you have a chance, can you run p0001 on my patch (compared to
2.19-rc0, or to both v2.18 and v2.19-rc0)? It would be nice to double
check that it really is fixing the problem you saw.


Sure. Note: I had to create a new Linux VM on a different machine 
between Tuesday and today, so the absolute numbers are different.


Using git/git:

Test  v2.18.0   v2.19.0-rc0 HEAD
-
0001.2:   3.10(3.02+0.08)   3.27(3.17+0.09) +5.5% 3.14(3.02+0.11) +1.3%


Using torvalds/linux:

Test v2.18.0 v2.19.0-rc0   HEAD
--
0001.2:  56.08(45.91+1.50)   56.60(46.62+1.50) +0.9% 54.61(45.47+1.46) -2.6%


Now here is where I get on my soapbox (and create a TODO for myself 
later). I ran the above with GIT_PERF_REPEAT_COUNT=10, which intuitively 
suggests that the results should be _more_ accurate than the default of 
3. However, I then remember that we only report the *minimum* time from 
all the runs, which is likely to select an outlier from the 
distribution. To test this, I ran a few tests manually and found the 
variation between runs to be larger than 3%.


When I choose my own metrics for performance tests, I like to run at 
least 10 runs, remove the largest AND smallest runs from the samples, 
and then take the average. I did this manually for 'git rev-list --all 
--objects' on git/git and got the following results:


v2.18.0    v2.19.0-rc0   HEAD

3.126 s    3.308 s   3.170 s

For full disclosure, here is a full table including all samples:

|  | v2.18.0 | v2.19.0-rc0 | HEAD    |
|--|-|-|-|
|  | 4.58    | 3.302   | 3.239   |
|  | 3.13    | 3.337   | 3.133   |
|  | 3.213   | 3.291   | 3.159   |
|  | 3.219   | 3.318   | 3.131   |
|  | 3.077   | 3.302   | 3.163   |
|  | 3.074   | 3.328   | 3.119   |
|  | 3.022   | 3.277   | 3.125   |
|  | 3.083   | 3.259   | 3.203   |
|  | 3.057   | 3.311   | 3.223   |
|  | 3.155   | 3.413   | 3.225   |
| Max  | 4.58    | 3.413   | 3.239   |
| Min  | 3.022   | 3.259   | 3.119   |
| Avg* | 3.126   | 3.30825 | 3.17025 |

(Note that the largest one was the first run, on v2.18.0, which is due 
to a cold disk.)


I just kicked off a script that will run this test on the Linux repo 
while I drive home. I'll be able to report a similar table of data easily.


My TODO is to consider aggregating the data this way (or with a median) 
instead of reporting the minimum.


Thanks,

-Stolee



[PATCH v2] config: fix commit-graph related config docs

2018-08-23 Thread Derrick Stolee
The core.commitGraph config setting was accidentally removed from
the config documentation. In that same patch, the config setting
that writes a commit-graph during garbage collection was incorrectly
written to the doc as "gc.commitGraph" instead of "gc.writeCommitGraph".

Reported-by: Szeder Gábor 
Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1c42364988..6ee1890984 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -917,12 +917,10 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
-gc.commitGraph::
-   If true, then gc will rewrite the commit-graph file when
-   linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
-   '--auto' the commit-graph will be updated if housekeeping is
-   required. Default is false. See linkgit:git-commit-graph[1]
-   for details.
+core.commitGraph::
+   If true, then git will read the commit-graph file (if it exists)
+   to parse the graph structure of commits. Defaults to false. See
+   linkgit:git-commit-graph[1] for more information.
 
 core.useReplaceRefs::
If set to `false`, behave as if the `--no-replace-objects`
@@ -1763,6 +1761,13 @@ this configuration variable is ignored, all packs except 
the base pack
 will be repacked. After this the number of packs should go below
 gc.autoPackLimit and gc.bigPackThreshold should be respected again.
 
+gc.writeCommitGraph::
+   If true, then gc will rewrite the commit-graph file when
+   linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
+   '--auto' the commit-graph will be updated if housekeeping is
+   required. Default is false. See linkgit:git-commit-graph[1]
+   for details.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
-- 
2.19.0.rc0



Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-23 Thread Derrick Stolee

On 8/23/2018 1:04 AM, Jeff King wrote:

On Thu, Aug 23, 2018 at 03:47:07AM +, brian m. carlson wrote:


I expect that's going to be the case as well.  I have patches that
wire up actual SHA-256 support in my hash-impl branch.

However, having said that, I'm happy to defer to whatever everyone else
thinks is best for 2.19.  The assert solution would be fine with me in
this situation, and if we need to pull it out in the future, that's okay
with me.

I don't really have a strong opinion on this either way, so if someone
else does, please say so.  I have somewhat more limited availability
over the next couple days, as I'm travelling on business, but I'm happy
to review a patch (and it seems like Peff has one minus the actual
commit message).

I just posted the patch elsewhere in the thread.


Thank you for that!


I think you can safely
ignore the rest of it if you are otherwise occupied. Even if v2.19 ships
without some mitigation, I don't know that it's all that big a deal,
given the numbers I generated (which for some reason are less dramatic
than Stolee's).
My numbers may be more dramatic because my Linux environment is a 
virtual machine.


I was thinking that having a mitigation for 2.19 is best, and then we 
can focus as part of the 2.20 cycle how we can properly avoid this cost, 
especially when 32 is a valid option.


Around the time that my proposed approaches were getting vetoed for 
alignment issues, I figured I was out of my depth here. I reached out to 
Daniel Lemire (of EWAH bitmap fame) on Twitter [1]. His blog is full of 
posts of word-based approaches to different problems, so I thought he 
might know something off the top of his head that would be applicable. 
His conclusion (after looking only a short time) was to take a 'hasheq' 
approach [2] like Peff suggested [3]. Since that requires auditing all 
callers of hashcmp to see if hasheq is appropriate, it is not a good 
solution for 2.19 but (in my opinion) should be evaluated as part of the 
2.20 cycle.


Of course, if someone with knowledge of word-alignment issues across the 
platforms we support knows how to enforce an alignment for object_id, 
then something word-based like [4] could be reconsidered.


Thanks, everyone!
-Stolee

[1] https://twitter.com/stolee/status/1032312965754748930

[2] 
https://lemire.me/blog/2018/08/22/avoid-lexicographical-comparisons-when-testing-for-string-equality/


[3] 
https://public-inbox.org/git/20180822030344.ga14...@sigill.intra.peff.net/


[4] 
https://public-inbox.org/git/7ea416cf-b043-1274-e161-85a8780b8...@gmail.com/


Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-22 Thread Derrick Stolee

On 8/22/2018 12:58 PM, Duy Nguyen wrote:

On Wed, Aug 22, 2018 at 6:49 PM Derrick Stolee  wrote:

On 8/22/2018 12:26 PM, Jeff King wrote:

On Wed, Aug 22, 2018 at 06:14:24PM +0200, Duy Nguyen wrote:


On Wed, Aug 22, 2018 at 6:08 PM Duy Nguyen  wrote:

On Wed, Aug 22, 2018 at 6:03 PM Jeff King  wrote:

On Wed, Aug 22, 2018 at 07:14:42AM -0400, Derrick Stolee wrote:


The other thing I was going to recommend (and I'll try to test this out
myself later) is to see if 'the_hash_algo->rawsz' is being treated as a
volatile variable, since it is being referenced through a pointer. Perhaps
storing the value locally and then casing on it would help?

I tried various sprinkling of "const" around the declarations to make it
clear that the values wouldn't change once we saw them. But I couldn't
detect any difference. At most I think that would let us hoist the "if"
out of the loop, but gcc still seems unwilling to expand the memcmp when
there are other branches.

I think if that's the thing we want to have happen, we really do need to
just write it out on that branch rather than saying "memcmp".

This reminds me of an old discussion about memcpy() vs doing explicit
compare loop with lots of performance measurements..

Ah found it. Not sure if it is still relevant in light of multiple hash support

https://public-inbox.org/git/20110427225114.ga16...@elte.hu/

Yes, that was what I meant. We actually did switch to that hand-rolled
loop, but later we went back to memcmp in 0b006014c8 (hashcmp: use
memcmp instead of open-coded loop, 2017-08-09).

Looking at that commit, I'm surprised the old logic was just a for loop, 
instead of a word-based approach, such as the following:

Might work on x86 but it breaks on cpu architectures with stricter
alignment. I don't think we have a guarantee that object_id is always
8 byte aligned.
You (and Peff) are probably correct here, which is unfortunate. I'm not 
familiar with alignment constraints, but assume that such a word-based 
approach is best.


Thanks,
-Stolee



Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-22 Thread Derrick Stolee

On 8/22/2018 12:26 PM, Jeff King wrote:

On Wed, Aug 22, 2018 at 06:14:24PM +0200, Duy Nguyen wrote:


On Wed, Aug 22, 2018 at 6:08 PM Duy Nguyen  wrote:

On Wed, Aug 22, 2018 at 6:03 PM Jeff King  wrote:

On Wed, Aug 22, 2018 at 07:14:42AM -0400, Derrick Stolee wrote:


The other thing I was going to recommend (and I'll try to test this out
myself later) is to see if 'the_hash_algo->rawsz' is being treated as a
volatile variable, since it is being referenced through a pointer. Perhaps
storing the value locally and then casing on it would help?

I tried various sprinkling of "const" around the declarations to make it
clear that the values wouldn't change once we saw them. But I couldn't
detect any difference. At most I think that would let us hoist the "if"
out of the loop, but gcc still seems unwilling to expand the memcmp when
there are other branches.

I think if that's the thing we want to have happen, we really do need to
just write it out on that branch rather than saying "memcmp".

This reminds me of an old discussion about memcpy() vs doing explicit
compare loop with lots of performance measurements..

Ah found it. Not sure if it is still relevant in light of multiple hash support

https://public-inbox.org/git/20110427225114.ga16...@elte.hu/

Yes, that was what I meant. We actually did switch to that hand-rolled
loop, but later we went back to memcmp in 0b006014c8 (hashcmp: use
memcmp instead of open-coded loop, 2017-08-09).


Looking at that commit, I'm surprised the old logic was just a for loop, 
instead of a word-based approach, such as the following:

diff --git a/cache.h b/cache.h
index b1fd3d58ab..5e5819ad49 100644
--- a/cache.h
+++ b/cache.h
@@ -1021,9 +1021,41 @@ extern int find_unique_abbrev_r(char *hex, const 
struct object_id *oid, int len)

 extern const unsigned char null_sha1[GIT_MAX_RAWSZ];
 extern const struct object_id null_oid;

+static inline int word_cmp_32(uint32_t a, uint32_t b)
+{
+   return memcmp(, , sizeof(uint32_t));
+}
+
+static inline int word_cmp_64(uint64_t a, uint64_t b)
+{
+   return memcmp(, , sizeof(uint64_t));
+}
+
+struct object_id_20 {
+   uint64_t data0;
+   uint64_t data1;
+   uint32_t data2;
+};
+
 static inline int hashcmp(const unsigned char *sha1, const unsigned 
char *sha2)

 {
-   return memcmp(sha1, sha2, the_hash_algo->rawsz);
+   if (the_hash_algo->rawsz == 20) {
+   struct object_id_20 *obj1 = (struct object_id_20 *)sha1;
+   struct object_id_20 *obj2 = (struct object_id_20 *)sha2;
+
+   if (obj1->data0 == obj2->data0) {
+   if (obj1->data1 == obj2->data1) {
+   if (obj1->data2 == obj2->data2) {
+   return 0;
+   }
+   return word_cmp_32(obj1->data2, 
obj2->data2);

+   }
+   return word_cmp_64(obj1->data1, obj2->data1);
+   }
+   return word_cmp_64(obj1->data0, obj2->data0);
+   }
+
+   assert(0);
 }

 static inline int oidcmp(const struct object_id *oid1, const struct 
object_id *oid2)





Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-22 Thread Derrick Stolee

On 8/22/2018 1:36 AM, brian m. carlson wrote:

On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote:

So I wonder if there's some other way to tell the compiler that we'll
only have a few values. An enum comes to mind, though I don't think the
enum rules are strict enough to make this guarantee (after all, it's OK
to bitwise-OR enums, so they clearly don't specify all possible values).

I was thinking about this:

diff --git a/cache.h b/cache.h
index 1398b2a4e4..1f5c6e9319 100644
--- a/cache.h
+++ b/cache.h
@@ -1033,7 +1033,14 @@ extern const struct object_id null_oid;
  
  static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)

  {
-   return memcmp(sha1, sha2, the_hash_algo->rawsz);
+   switch (the_hash_algo->rawsz) {
+   case 20:
+   return memcmp(sha1, sha2, 20);
+   case 32:
+   return memcmp(sha1, sha2, 32);
+   default:
+   assert(0);
+   }
  }
  
  static inline int oidcmp(const struct object_id *oid1, const struct object_id *oid2)


That would make it obvious that there are at most two options.
Unfortunately, gcc for me determines that the buffer in walker.c is 20
bytes in size and steadfastly refuses to compile because it doesn't know
that the value will never be 32 in our codebase currently.  I'd need to
send in more patches before it would compile.

I don't know if something like this is an improvement or now, but this
seems to at least compile:

diff --git a/cache.h b/cache.h
index 1398b2a4e4..3207f74771 100644
--- a/cache.h
+++ b/cache.h
@@ -1033,7 +1033,13 @@ extern const struct object_id null_oid;
  
  static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)

  {
-   return memcmp(sha1, sha2, the_hash_algo->rawsz);
+   switch (the_hash_algo->rawsz) {
+   case 20:
+   case 32:
+   return memcmp(sha1, sha2, the_hash_algo->rawsz);
+   default:
+   assert(0);
+   }
  }

In my testing, I've had the best luck with this change:

diff --git a/cache.h b/cache.h
index b1fd3d58ab..6c8b51c390 100644
--- a/cache.h
+++ b/cache.h
@@ -1023,7 +1023,14 @@ extern const struct object_id null_oid;

 static inline int hashcmp(const unsigned char *sha1, const unsigned 
char *sha2)

 {
-   return memcmp(sha1, sha2, the_hash_algo->rawsz);
+   switch (the_hash_algo->rawsz) {
+   case 20:
+   return memcmp(sha1, sha2, 20);
+   case 32:
+   return memcmp(sha1, sha2, 32);
+   default:
+   assert(0);
+   }
 }

The fact that '20' and '32' are constants here may be helpful to the 
compiler. Can someone else test the perf?


Thanks,
-Stolee


Re: [PATCH 3/6] t/perf: add infrastructure for measuring sizes

2018-08-22 Thread Derrick Stolee

On 8/21/2018 3:06 PM, Jeff King wrote:

The main objective of scripts in the perf framework is to
run "test_perf", which measures the time it takes to run
some operation. However, it can also be interesting to see
the change in the output size of certain operations.

This patch introduces test_size, which records a single
numeric output from the test and shows it in the aggregated
output (with pretty printing and relative size comparison).


I'm interested in exploring this test_size mechanism. The other area 
that could benefit from size testing is 'git repack', but I don't have 
any plans to change our compression or delta strategies. If we _did_ 
look into that, then using test_size would be a natural fit.


Overall, I really like this series!

Thanks
-Stolee




[PATCH] config: fix commit-graph related config docs

2018-08-22 Thread Derrick Stolee
The core.commitGraph config setting was accidentally removed from
the config documentation. In that same patch, the config setting
that writes a commit-graph during garbage collection was incorrectly
written to the doc as "gc.commitGraph" instead of "gc.writeCommitGraph".

Reported-by: Szeder Gábor 
Signed-off-by: Derrick Stolee 
---

Szeder reported this in [1] and I forgot until now to come back to it.

[1] https://public-inbox.org/git/20180803093909.2853-1-szeder@gmail.com/

 Documentation/config.txt | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1c42364988..f846543414 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -917,7 +917,11 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
-gc.commitGraph::
+core.commitGraph::
+   Enable git commit graph feature. Allows reading from the
+   commit-graph file.
+
+gc.writeCommitGraph::
If true, then gc will rewrite the commit-graph file when
linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
'--auto' the commit-graph will be updated if housekeeping is

base-commit: 7e8bfb0412581daf8f3c89909f1d37844e8610dd
-- 
2.19.0.rc0



Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-22 Thread Derrick Stolee



On 8/22/2018 3:39 AM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Aug 22, 2018 at 8:20 AM Jeff King  wrote:

On Wed, Aug 22, 2018 at 05:36:26AM +, brian m. carlson wrote:


On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote:
I don't know if something like this is an improvement or now, but this
seems to at least compile:

diff --git a/cache.h b/cache.h
index 1398b2a4e4..3207f74771 100644
--- a/cache.h
+++ b/cache.h
@@ -1033,7 +1033,13 @@ extern const struct object_id null_oid;

  static inline int hashcmp(const unsigned char *sha1, const unsigned char 
*sha2)
  {
- return memcmp(sha1, sha2, the_hash_algo->rawsz);
+ switch (the_hash_algo->rawsz) {
+ case 20:
+ case 32:
+ return memcmp(sha1, sha2, the_hash_algo->rawsz);
+ default:
+ assert(0);
+ }

I think that would end up with the same slow code, as gcc would rather
call memcmp than expand out the two sets of asm.


I won't have time to sit down and test this out until tomorrow afternoon
at the earliest.  If you want to send in something in the mean time,
even if that limits things to just 20 for now, that's fine.

I don't have a good option. The assert() thing works until I add in the
"32" branch, but that's just punting the issue off until you add support
for the new hash.

Hand-rolling our own asm or C is a portability headache, and we need to
change all of the callsites to use a new hasheq().

Hiding it behind a per-hash function is conceptually cleanest, but not
quite as fast. And it also requires hasheq().

So all of the solutions seem non-trivial.  Again, I'm starting to wonder
if it's worth chasing this few percent.

Did you try __builtin_expect? It's a GCC builtin for these sorts of
situations, and sometimes helps:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

I.e. you'd tell GCC we expect to have the 20 there with:

 if (__builtin_expect(the_hash_algo->rawsz == 20, 1)) { ... }

The perl codebase has LIKELY() and UNLIKELY() macros for this which if
the feature isn't available fall back on just plain C code:
https://github.com/Perl/perl5/blob/v5.27.7/perl.h#L3335-L3344
The other thing I was going to recommend (and I'll try to test this out 
myself later) is to see if 'the_hash_algo->rawsz' is being treated as a 
volatile variable, since it is being referenced through a pointer. 
Perhaps storing the value locally and then casing on it would help?


Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-22 Thread Derrick Stolee

On 8/21/2018 11:36 PM, Jeff King wrote:

On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote:


with the obvious "oideq()" implementation added, that seems to get me to
2-3%. Not _quite_ as good as the original branching version I showed.
And we had to touch all the callsites (although arguably that kind of
"eq" function is a better interface anyway, since it obviously allows
for more optimization.

So maybe the branching thing is actually not so insane. It makes new
hash_algo's Just Work; they just won't be optimized. And the change is
very localized.

Hmph. So I went back to double-check my measurements on that branching
version, and I couldn't replicate it!

I'm actually relieved to see this, as I couldn't either.


It turns out what I showed (and measured) before has a bug. Can you see
it?
I had rewritten the section from scratch instead of applying your diff, 
so I didn't get the sha1-sha1 error. I decided to sleep on it instead of 
sending my email.



So the assert() version really is the fastest. I didn't test, but I
suspect we could "trick" the compiler by having the fallback call an
opaque wrapper around memcmp(). That would prevent it from combining the
two paths, and presumably it would still optimize the constant-20 side.
Or maybe it would eventually decide our inline function is getting too
big and scrap it. Which probably crosses a line of craziness (if I
didn't already cross it two emails ago).

I appreciate your effort here.

Thanks
-Stolee


Re: [PATCH v2 2/2] commit-graph.txt: improve formatting for asciidoc

2018-08-21 Thread Derrick Stolee

On 8/21/2018 5:29 PM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

When viewing commit-graph.txt as a plain-text document, it makes
sense to keep paragraphs left-padded between bullet points.
However, asciidoc converts these left-padded paragraphs as monospace
fonts, creating an unpleasant document. Remove the padding.

That's completely backwards.

These indented two paragraphs that follow "... the following
property:" do not fall into the same classes of paragraphs as
others.  The way the text version makes them stand out by indenting
clearly show these two are what "... the following ..." refers to.

Perhaps these two would want to become a bulleted list or
something.


The "Future Work" section includes a bulleted list of items, and one
item has sub-items. These do not render properly in asciidoc, so
remove the sub-list and incorporate them into the paragraph.

Signed-off-by: Derrick Stolee 
---
  Documentation/technical/commit-graph.txt | 43 +++-

I said I didn't check if commit-graph-format.txt doc format
correctly, but that does not mean you do not have to ;-).  I suspect
that most of the contents would become monospaced wall of text,
which is no better than the original text and because only the
headings are typeset in different font, the result actually would be
harder to read than the original.


I compared the results of commit-graph-format.txt to pack-format.txt. 
Since I had based the format of commit-graph-format.txt on that file, 
the output was remarkably similar (and hence I couldn't find anything to 
change).



We are not in a hurry to do this during the pre-release period, are
we?  My understanding is that the original text documentation files
will be shipped and installed, and they are adequately readable
(correct me if it is not the case).

Unless we are making the result a lot more readable as the original
text, let's not distract ourselves by rerolling this in too quick
cycles without giving us sufficiently big improvements.
No rush! I'll wait until the release calms down for a v3, and read the 
helpful docs on asciidoc format that Eric shared earlier.


Thanks,
-Stolee


[PATCH] commit: use timestamp_t for author_date_slab

2018-08-21 Thread Derrick Stolee
The author_date_slab is used to store the author date of a commit
when walking with the --author-date flag in rev-list or log. This
was added as an 'unsigned long' in

81c6b38b "log: --author-date-order"

Since 'unsigned long' is ambiguous in its bit-ness across platforms
(64-bit in Linux, 32-bit in Windows, for example), most references
to the author dates in commit.c were converted to timestamp_t in

dddbad72 "timestamp_t: a new data type for timestamps"

However, the slab definition was missed, leading to a mismatch in
the data types in Windows. This would not reveal itself as a bug
unless someone authors a commit after February 2106, but commits
can store anything as their author date.

Signed-off-by: Derrick Stolee 
---

I found this while reading up on the revision-walk machinery. This code
hasn't been touched in years, so could apply to 'maint'.

 commit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 0030e79940..2ce5052b3e 100644
--- a/commit.c
+++ b/commit.c
@@ -609,7 +609,7 @@ struct commit *pop_commit(struct commit_list **stack)
 define_commit_slab(indegree_slab, int);
 
 /* record author-date for each commit object */
-define_commit_slab(author_date_slab, unsigned long);
+define_commit_slab(author_date_slab, timestamp_t);
 
 static void record_author_date(struct author_date_slab *author_date,
   struct commit *commit)
-- 
2.19.0.rc0



Re: [ANNOUNCE] Git v2.19.0-rc0

2018-08-21 Thread Derrick Stolee

On 8/20/2018 6:13 PM, Junio C Hamano wrote:

An early preview release Git v2.19.0-rc0 is now available for
testing at the usual places.


As part of testing the release candidate, I ran the performance suite 
against a fresh clone of the Linux repository using v2.18.0 and 
v2.19.0-rc0 (also: GIT_PERF_REPEAT_COUNT=10). I found a few nice 
improvements, but I also found a possible regression in tree walking. I 
say "tree walking" because it was revealed using p0001-rev-list.sh, but 
only with the "--objects" flag. I also saw some similar numbers on 'git 
log --raw'.


Test v2.18.0 v2.19.0-rc0

0001.1: rev-list --all 6.69(6.33+0.35) 6.52(6.20+0.31) -2.5%
0001.2: rev-list --all --objects 52.14(47.43+1.02)   57.15(51.09+1.18) +9.6%

To me, 9.6% seems out of the range of just noise for this length of a 
command, but I could be wrong. Could anyone else try to repro these results?


(This may also not just be tree-walking, but general pack-file loading 
and decompression, since I computed and stored a commit-graph file. 
Hence, commits are not being parsed from the pack-file by either command.)


Aside: the perf results were not all bad. Here was an interesting 
improvement:


Test v2.18.0 v2.19.0-rc0

0002.1: read_cache/discard_cache 1000 times 5.63(5.30+0.32)   
3.34(3.03+0.30) -40.7%


Thanks,

-Stolee



Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk

2018-08-21 Thread Derrick Stolee

On 8/21/2018 3:07 PM, Jeff King wrote:

When we do a bitmap walk, we save the result, which
represents (WANTs & ~HAVEs); i.e., every object we care
about visiting in our walk. However, we throw away the
haves bitmap, which can sometimes be useful, too. Save it
and provide an access function so code which has performed a
walk can query it.


This makes a lot of sense. Based on the amount of time the "Counting 
Objects" blog post [1] spent talking about delta bases, I would have 
assumed this "haves" bitmap was already part of it. But, I can also see 
how it could be dropped if you are focusing on performance for 'git clone'.




A few notes on the accessor interface:

  - the bitmap code calls these "haves" because it grew out
of the want/have negotiation for fetches. But really,
these are simply the objects that would be flagged
UNINTERESTING in a regular traversal. Let's use that
more universal nomenclature for the external module
interface. We may want to change the internal naming
inside the bitmap code, but that's outside the scope of
this patch.


I saw the uninteresting-vs-haves name confusion in the patch below, but 
I agree with your logic here.


Sorry that I'm late to the party, but I was interested in the topic.

Thanks,

-Stolee

[1] https://githubengineering.com/counting-objects/

    "Counting Objects" by Vincent Martí



Re: [PATCH] SubmittingPatches: mention doc-diff

2018-08-21 Thread Derrick Stolee

On 8/21/2018 3:23 PM, Jeff King wrote:

We already advise people to make sure their documentation
formats correctly. Let's point them at the doc-diff script,
which can help with that.

Let's also put a brief note in the script about its purpose,
since that otherwise can only be found in the original
commit message. Along with the existing -h/usage text,
that's hopefully enough for developers to make use of it.


This is helpful, thanks!


Signed-off-by: Jeff King 
---
Just a finishing touch on the jk/diff-rendered-docs topic.

  Documentation/SubmittingPatches | 4 +++-
  Documentation/doc-diff  | 8 
  2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index b44fd51f27..ec8b205145 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -80,7 +80,9 @@ GitHub-Travis CI hints section for details.
  
  Do not forget to update the documentation to describe the updated

  behavior and make sure that the resulting documentation set formats
-well. It is currently a liberal mixture of US and UK English norms for
+well (try the Documentation/doc-diff script).
+
+We currently have a liberal mixture of US and UK English norms for
  spelling and grammar, which is somewhat unfortunate.  A huge patch that
  touches the files all over the place only to correct the inconsistency
  is not welcome, though.  Potential clashes with other changes that can
diff --git a/Documentation/doc-diff b/Documentation/doc-diff
index f483fe427c..6e285e648c 100755
--- a/Documentation/doc-diff
+++ b/Documentation/doc-diff
@@ -1,4 +1,12 @@
  #!/bin/sh
+#
+# Build two documentation trees and diff the resulting formatted output.
+# Compared to a source diff, this can reveal mistakes in the formatting.
+# For example:
+#
+#   ./doc-diff origin/master HEAD
+#
+# would show the differences introduced by a branch based on master.
  
  OPTIONS_SPEC="\

  doc-diff [options]   [-- ]


[PATCH v2 2/2] commit-graph.txt: improve formatting for asciidoc

2018-08-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When viewing commit-graph.txt as a plain-text document, it makes
sense to keep paragraphs left-padded between bullet points.
However, asciidoc converts these left-padded paragraphs as monospace
fonts, creating an unpleasant document. Remove the padding.

The "Future Work" section includes a bulleted list of items, and one
item has sub-items. These do not render properly in asciidoc, so
remove the sub-list and incorporate them into the paragraph.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/commit-graph.txt | 43 +++-
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index c664acbd7..bdc3eab9e 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -49,23 +49,23 @@ Equivalently, the generation number of a commit A is one 
more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
-If A and B are commits with generation numbers N and M, respectively,
-and N <= M, then A cannot reach B. That is, we know without searching
-that B is not an ancestor of A because it is further from a root commit
-than A.
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
 
-Conversely, when checking if A is an ancestor of B, then we only need
-to walk commits until all commits on the walk boundary have generation
-number at most N. If we walk commits using a priority queue seeded by
-generation numbers, then we always expand the boundary commit with highest
-generation number and can easily detect the stopping condition.
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
 
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
 
-If A and B are commits with commit time X and Y, respectively, and
-X < Y, then A _probably_ cannot reach B.
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
 
 This heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
@@ -121,11 +121,8 @@ Future Work
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
-  priority queue with one ordered by generation number. The following
-  operations are important candidates:
-
-- 'log --topo-order'
-- 'tag --merged'
+  priority queue with one ordered by generation number. Commands that could
+  improve include 'git log --topo-order' and 'git tag --merged'.
 
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
@@ -148,13 +145,13 @@ Related Links
 More discussion about generation numbers and not storing them inside
 commit objects. A valuable quote:
 
-"I think we should be moving more in the direction of keeping
- repo-local caches for optimizations. Reachability bitmaps have been
- a big performance win. I think we should be doing the same with our
- properties of commits. Not just generation numbers, but making it
- cheap to access the graph structure without zlib-inflating whole
- commit objects (i.e., packv4 or something like the "metapacks" I
- proposed a few years ago)."
+"I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
 
 [4] 
https://public-inbox.org/git/20180108154822.54829-1-...@jeffhostetler.com/T/#u
 A patch to remove the ahead-behind calculation from 'status'.
-- 
gitgitgadget


[PATCH v2 1/2] Docs: Add commit-graph tech docs to Makefile

2018-08-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Ensure that the commit-graph.txt and commit-graph-format.txt files
are compiled to HTML using ASCIIDOC.

Signed-off-by: Derrick Stolee 
---
 Documentation/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index d079d7c73..841e4f705 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -69,6 +69,8 @@ API_DOCS = $(patsubst %.txt,%,$(filter-out 
technical/api-index-skel.txt technica
 SP_ARTICLES += $(API_DOCS)
 
 TECH_DOCS += SubmittingPatches
+TECH_DOCS += technical/commit-graph
+TECH_DOCS += technical/commit-graph-format
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
-- 
gitgitgadget



[PATCH v2 0/2] Docs: Add commit-graph tech docs to Makefile

2018-08-21 Thread Derrick Stolee via GitGitGadget
Similar to [1], add the commit-graph and commit-graph-format technical docs
to Documentation/Makefile so they are automatically converted to HTML when
needed.

I compiled the docs and inspected the HTML manually in the browser. As
suggested, I modified the documents to format a bit better. See the commit
messages for details. Since the files had been modified since 'maint', this
version is based on 'master'.

[1] 
https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/
[PATCH] partial-clone: render design doc using asciidoc

Derrick Stolee (2):
  Docs: Add commit-graph tech docs to Makefile
  commit-graph.txt: improve formatting for asciidoc

 Documentation/Makefile   |  2 ++
 Documentation/technical/commit-graph.txt | 43 +++-
 2 files changed, 22 insertions(+), 23 deletions(-)


base-commit: fa03cdc39b951d1cfbfd690fe6f3ac6c57ab6a44
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-22%2Fderrickstolee%2Fmake-docs-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-22/derrickstolee/make-docs-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/22

Range-diff vs v1:

 1:  ef5af2ccc = 1:  4c66af626 Docs: Add commit-graph tech docs to Makefile
 -:  - > 2:  6cf253c2a commit-graph.txt: improve formatting for asciidoc

-- 
gitgitgadget


Re: [PATCH v2 5/8] commit-graph: not compatible with replace objects

2018-08-21 Thread Derrick Stolee

On 8/21/2018 1:45 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


Create new method commit_graph_compatible(r) to check if a given
repository r is compatible with the commit-graph feature. Fill the
method with a check to see if replace-objects exist. Test this
interaction succeeds, including ignoring an existing commit-graph and
failing to write a new commit-graph. However, we do ensure that
we write a new commit-graph by setting read_replace_refs to 0, thereby
ignoring the replace refs.

Signed-off-by: Derrick Stolee 
---
  builtin/commit-graph.c  |  4 
  commit-graph.c  | 21 +
  replace-object.c|  2 +-
  replace-object.h|  2 ++
  t/t5318-commit-graph.sh | 22 ++
  5 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0bf0c48657..da737df321 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -120,6 +120,8 @@ static int graph_read(int argc, const char **argv)
return 0;
  }
  
+extern int read_replace_refs;

+

Why do you need this (and also in commit-graph.c)?  I thought
cache.h includes it, which you can just make use of it.



You're right. I don't know how I missed that.



+static int commit_graph_compatible(struct repository *r)
+{
+   if (read_replace_refs) {
+   prepare_replace_object(r);
+   if (hashmap_get_size(>objects->replace_map->map))
+   return 0;
+   }
+
+   return 1;
+}
diff --git a/replace-object.c b/replace-object.c
index 3c17864eb7..9821f1477e 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -32,7 +32,7 @@ static int register_replace_ref(struct repository *r,
return 0;
  }
  
-static void prepare_replace_object(struct repository *r)

+void prepare_replace_object(struct repository *r)
  {
if (r->objects->replace_map)
return;

The way the new caller is written, I am wondering if this function
should return "we are (or, are not) using the replace object
feature", making it unnecessary for callers on the reading side to
know about "read-replace-refs" external variable, for example.

/*
 * To be called on-demand from codepaths that want to make
 * sure that replacement objects can be found as necessary.
 *
 * Return number of replacement defined for the repository, or
 * -1 when !read_replace_refs tells us not to use replacement
 * mechanism at all.
 */
int prepare_replace_object(struct repository *r)
{
if (!read_replace_refs)
return -1;

if (!r->objects->replace_map) {
r->objects->replace_map =
xmalloc(...);
oidmap_init(r->objects->replace_map, 0);
for_each_refplace_ref(r, register_...);
}
return hashmap_get_size(>objects->replace_map->map);
}


This is a good idea. I can incorporate it into v3.


Then, the caller side can simply become something like:

#define cgraph_compat(r) (prepare_replace_object(r) <= 0)


With the next two patches that add more conditions to 
commit_graph_compatible(), I'd prefer keeping it a method instead of a 
macro.



There are various writers to read_replace_refs variable, but I think
they should first be replaced with calls to something like:

void use_replace_refs(struct repository *r, int enable);

which allows us to hide the global variable and later make it per
repository if we wanted to.


I will incorporate this into v3 as well.

Thanks,

-Stolee



Re: [PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities

2018-08-21 Thread Derrick Stolee

On 8/21/2018 1:51 PM, Junio C Hamano wrote:

Stefan Beller  writes:


On Mon, Aug 20, 2018 at 11:24 AM Derrick Stolee  wrote:

Because of the change of base, it is hard to provide a side-by-side diff
from v1.

I thought that was the whole point of range-diff. ;-)

I thought so, too.  Was there any change that confused range-diff
machinery?


I've just been out of the loop and didn't realize what range-diff did. 
Here is the range-diff:


1:  43ddcc9ef9 = 1:  c8edae4179 refs.c: migrate internal ref iteration 
to pass thru repository argument
2:  22dc9ce836 = 2:  f89451e884 refs.c: upgrade for_each_replace_ref to 
be a each_repo_ref_fn callback

3:  5e90d36f84 ! 3:  d95aa3472c commit-graph: update design document
    @@ -19,7 +19,7 @@
    so a future change of hash algorithm does not require a change 
in format.


 +- Commit grafts and replace objects can change the shape of the 
commit

    -+  history. These can also be enabled/disabled on the fly using
    ++  history. The latter can also be enabled/disabled on the fly using
 +  `--no-replace-objects`. This leads to difficultly storing both 
possible
 +  interpretations of a commit id, especially when computing 
generation

 +  numbers. The commit-graph will not be read or written when
4:  88711a3cf4 = 4:  ec89c88e14 test-repository: properly init repo
5:  7f596c1718 ! 5:  0321a3cf10 commit-graph: not compatible with 
replace objects

    @@ -6,10 +6,34 @@
 repository r is compatible with the commit-graph feature. Fill the
 method with a check to see if replace-objects exist. Test this
 interaction succeeds, including ignoring an existing 
commit-graph and

    -    failing to write a new commit-graph.
    +    failing to write a new commit-graph. However, we do ensure that
    +    we write a new commit-graph by setting read_replace_refs to 0, 
thereby

    +    ignoring the replace refs.

 Signed-off-by: Derrick Stolee 

    +diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
    +--- a/builtin/commit-graph.c
     b/builtin/commit-graph.c
    +@@
    +   return 0;
    + }
    +
    ++extern int read_replace_refs;
    ++
    + static int graph_write(int argc, const char **argv)
    + {
    +   struct string_list *pack_indexes = NULL;
    +@@
    +   if (!opts.obj_dir)
    +   opts.obj_dir = get_object_directory();
    +
    ++  read_replace_refs = 0;
    ++
    +   if (opts.reachable) {
    +   write_commit_graph_reachable(opts.obj_dir, opts.append);
    +   return 0;
    +
 diff --git a/commit-graph.c b/commit-graph.c
 --- a/commit-graph.c
 +++ b/commit-graph.c
    @@ -26,11 +50,15 @@
    return g;
  }

    ++extern int read_replace_refs;
    ++
 +static int commit_graph_compatible(struct repository *r)
 +{
    -+  prepare_replace_object(r);
    -+  if (hashmap_get_size(>objects->replace_map->map))
    -+  return 0;
    ++  if (read_replace_refs) {
    ++  prepare_replace_object(r);
    ++  if (hashmap_get_size(>objects->replace_map->map))
    ++  return 0;
    ++  }
 +
 +  return 1;
 +}
    @@ -110,7 +138,7 @@
 +  test_cmp expect actual &&
 +  rm -rf .git/objects/info/commit-graph &&
 +  git commit-graph write --reachable &&
    -+  test_path_is_missing .git/objects/info/commit-graph
    ++  test_path_is_file .git/objects/info/commit-graph
 +  )
 +'
 +
6:  94dd91ac35 ! 6:  d18ecc0124 commit-graph: not compatible with grafts
    @@ -7,24 +7,25 @@
 situations we ignore existing commit-graph files and we do not 
write new

 commit-graph files.

    +    Helped-by: Jakub Narebski 
 Signed-off-by: Derrick Stolee 

 diff --git a/commit-graph.c b/commit-graph.c
 --- a/commit-graph.c
 +++ b/commit-graph.c
 @@
    +   return 0;
    +   }

    - static int commit_graph_compatible(struct repository *r)
    - {
 +  prepare_commit_graft(r);
 +  if (r->parsed_objects && r->parsed_objects->grafts_nr)
 +  return 0;
 +  if (is_repository_shallow(r))
 +  return 0;
 +
    -   prepare_replace_object(r);
    -   if (hashmap_get_size(>objects->replace_map->map))
    -   return 0;
    +   return 1;
    + }
    +

 diff --git a/commit.c b/commit.c
 --- a/commit.c
    @@ -66,7 +67,9 @@
 +  cd graft &&
 +  git commit-graph write --reachable &&
 +  test_path_is_file .git/objects/info/commit-graph &&
    -+  git replace --graft HEAD~1 HEAD~3 &&
    ++  H1=$(git rev-parse --verify HEAD~1) &&
    ++  H3=$(git rev-parse --verify HEAD~3) &&
    ++  echo "$H1 $H3" >.git/info/grafts &&
 +  git -c core.commitGraph=false log >exp

Re: [PATCH 0/9] multi-pack-index cleanups

2018-08-21 Thread Derrick Stolee

On 8/21/2018 10:34 AM, Duy Nguyen wrote:

On Mon, Aug 20, 2018 at 6:53 PM Derrick Stolee  wrote:

To better understand the implications of the multi-pack-index with
other scenarios, I ran the test suite after adding a step to 'git repack'
to write a multi-pack-index, and to default core.multiPackIndex to 'true'.
This commit is available as [1].

...

[1] 
https://github.com/derrickstolee/git/commit/098dd1d515b592fb165a276241d7d68d1cde0036
 DO-NOT-MERGE: compute multi-pack-index on repack.

It should be able to _merge_ this patch, but only activate it when
some test environment variable is set. On Travis x86 we run the test
suite twice, once normal, and once with special features activated
(see "if test "$jobname" = "linux-gcc"" in ci/run-build-and-tests.sh).
I don't think midx is incompatible with any of those features, so we
could activate and run the whole test suite with it too.


Interesting. I'll look into this. A similar approach may work for the 
commit-graph.


Thanks,

-Stolee



Re: [PATCH 7/9] treewide: use get_all_packs

2018-08-21 Thread Derrick Stolee

On 8/20/2018 6:01 PM, Stefan Beller wrote:

On Mon, Aug 20, 2018 at 9:52 AM Derrick Stolee  wrote:

There are many places in the codebase that want to iterate over
all packfiles known to Git. The purposes are wide-ranging, and
those that can take advantage of the multi-pack-index already
do. So, use get_all_packs() instead of get_packed_git() to be
sure we are iterating over all packfiles.

So get_packed_git shouold not be used any further?
Do we want to annotate it as deprecated, to be deleted
in the future? Or even remove it as part of this series
(that might be an issue with other series in flight).


Perhaps the right thing to do would be to put a big warning over the 
definition to say "only use this if you really don't want 
get_all_packs()" or something. The reason I didn't remove it from 
packfile.h is that it is used in sha1-name.c alongside 
get_multi_pack_index() (so making it 'static' would be incorrect).




Re: [PATCH 3/9] midx: mark bad packed objects

2018-08-21 Thread Derrick Stolee

On 8/20/2018 5:23 PM, Stefan Beller wrote:

On Mon, Aug 20, 2018 at 9:52 AM Derrick Stolee  wrote:

When an object fails to decompress from a pack-file, we mark the object
as 'bad' so we can retry with a different copy of the object (if such a
copy exists).

Before now, the multi-pack-index did not update the bad objects list for
the pack-files it contains, and we did not check the bad objects list
when reading an object. Now, do both.

This sounds like a bug fix unlike patches 1 & 2 that sound like
feature work(2) or making code more readable(1).
(After studying the code, this doesn't sound like a bug fix any more,
but a safety thing)


It is a safety thing. One codepath that needs this includes this comment:

            /*
             * We're probably in deep shit, but let's try to fetch
             * the required base anyway from another pack or loose.
             * This is costly but should happen only in the presence
             * of a corrupted pack, and is better than failing outright.
             */

This goes back to 8eca0b47: implement some resilience against pack 
corruptions. The logic is tested in t5303-pack-corruption-resilience.sh, 
with the test title '... and a redundant pack allows for full recovery too'.



Is it worth having this on a separate track coming in
faster than the rest of this series?


ds/multi-pack-index is cooking in 'next' until after 2.19.0, so I'm not 
sure it's worth splitting things up at this point.



Signed-off-by: Derrick Stolee 
---
  midx.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/midx.c b/midx.c
index 6824acf5f8..7fa75a37a3 100644
--- a/midx.c
+++ b/midx.c
@@ -280,6 +280,16 @@ static int nth_midxed_pack_entry(struct multi_pack_index 
*m, struct pack_entry *
 if (!is_pack_valid(p))
 return 0;

+   if (p->num_bad_objects) {
+   uint32_t i;

Is there a reason that i needs to be if 32 bits?
Would size_t (for iterating) or 'int' (as a default
like in many for loops) be ok, too?


i is bounded by p->num_bad_objects, which is also uint32_t.

Thanks,

-Stolee



[PATCH 1/1] Docs: Add commit-graph tech docs to Makefile

2018-08-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Ensure that the commit-graph.txt and commit-graph-format.txt files
are compiled to HTML using ASCIIDOC.

Signed-off-by: Derrick Stolee 
---
 Documentation/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index d079d7c73..841e4f705 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -69,6 +69,8 @@ API_DOCS = $(patsubst %.txt,%,$(filter-out 
technical/api-index-skel.txt technica
 SP_ARTICLES += $(API_DOCS)
 
 TECH_DOCS += SubmittingPatches
+TECH_DOCS += technical/commit-graph
+TECH_DOCS += technical/commit-graph-format
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
-- 
gitgitgadget


[PATCH 0/1] Docs: Add commit-graph tech docs to Makefile

2018-08-21 Thread Derrick Stolee via GitGitGadget
Similar to [1], add the commit-graph and commit-graph-format technical docs
to Documentation/Makefile so they are automatically converted to HTML when
needed.

I compiled the docs and inspected the HTML manually in the browser. Nothing
looked strange, so I don't think the docs themselves need any editing for
format.

[1] 
https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/
[PATCH] partial-clone: render design doc using asciidoc

Derrick Stolee (1):
  Docs: Add commit-graph tech docs to Makefile

 Documentation/Makefile | 2 ++
 1 file changed, 2 insertions(+)


base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-22%2Fderrickstolee%2Fmake-docs-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-22/derrickstolee/make-docs-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/22
-- 
gitgitgadget


Re: [PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities

2018-08-20 Thread Derrick Stolee

On 8/20/2018 3:37 PM, Stefan Beller wrote:

On Mon, Aug 20, 2018 at 11:24 AM Derrick Stolee  wrote:

One unresolved issue with the commit-graph feature is that it can cause
issues when combined with replace objects, commit grafts, or shallow
clones. These are not 100% incompatible, as one could be reasonably
successful writing a commit-graph after replacing some objects and not
have issues. The problems happen when commits that are already in the
commit-graph file are replaced, or when git is run with the
`--no-replace-objects` option; this can cause incorrect parents or
incorrect generation numbers. Similar things occur with commit grafts
and shallow clones, especially when running `git fetch --unshallow` in a
shallow repo.

Instead of trying (and probably failing) to make these features work
together, default to making the commit-graph feature unavailable in these
situations. Create a new method 'commit_graph_compatible(r)' that checks
if the repository 'r' has any of these features enabled.

CHANGES IN V2:

* The first two commits regarding the ref iterators are unchanged, despite
   a lot of discussion on the subject [1].

* I included Peff's changes in jk/core-use-replace-refs, changing the base
   commit for the series to 1689c22c1c328e9135ed51458e9f9a5d224c5057 (the merge
   that brought that topic into 'msater').

* I fixed the tests for the interactions with the graft feature.

Because of the change of base, it is hard to provide a side-by-side diff
from v1.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/CAGZ79kZ3PzqpGzXWcmxjzi98gA+LT2MBOf8KaA89hOa-Qig=o...@mail.gmail.com/
 Stefan's response recommending we keep the first two commits.


After reviewing my own patches, I flipped again (Sorry!) and would
rather not see my patches be merged, but the very original solution
by you, that you proposed at [1]. That said, I will not insist on it, and
if this is merged as is, we can fix it up later.

With that said, I just read through the remaining patches, I think
they are a valuable addition to Git and could be merged as-is.

[1] 
https://github.com/gitgitgadget/git/pull/11/commits/300db80140dacc927db0d46c804ca0ef4dcc1be1

Thanks,
Stefan


Just to keep things on-list as possible, here is the patch for the 
commit linked above. It would replace patches 1 & 2 from this series.


Junio: I can send a v3 that includes this commit if you need it, or if 
there are other edits to be made in this series.


commit 300db80140dacc927db0d46c804ca0ef4dcc1be1
Author: Derrick Stolee 
Date:   Fri Jul 20 15:39:06 2018 -0400

    replace-objects: use arbitrary repositories

    This is the smallest possible change that makes prepare_replace_objects
    work properly with arbitrary repositories. By supplying the repository
    as the cb_data, we do not need to modify any code in the ref iterator
    logic. We will likely want to do a full replacement of the ref iterator
    logic to provide a repository struct as a concrete parameter.

    Signed-off-by: Derrick Stolee 

diff --git a/replace-object.c b/replace-object.c
index 801b5c1678..e99fcd1ff6 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -14,6 +14,7 @@ static int register_replace_ref(const char *refname,
    const char *slash = strrchr(refname, '/');
    const char *hash = slash ? slash + 1 : refname;
    struct replace_object *repl_obj = xmalloc(sizeof(*repl_obj));
+   struct repository *r = (struct repository *)cb_data;

    if (get_oid_hex(hash, _obj->original.oid)) {
    free(repl_obj);
@@ -25,7 +26,7 @@ static int register_replace_ref(const char *refname,
    oidcpy(_obj->replacement, oid);

    /* Register new object */
-   if (oidmap_put(the_repository->objects->replace_map, repl_obj))
+   if (oidmap_put(r->objects->replace_map, repl_obj))
    die("duplicate replace ref: %s", refname);

    return 0;
@@ -40,7 +41,7 @@ static void prepare_replace_object(struct repository *r)
    xmalloc(sizeof(*r->objects->replace_map));
    oidmap_init(r->objects->replace_map, 0);

-   for_each_replace_ref(r, register_replace_ref, NULL);
+   for_each_replace_ref(r, register_replace_ref, r);
 }

 /* We allow "recursive" replacement. Only within reason, though */



[PATCH v2 1/8] refs.c: migrate internal ref iteration to pass thru repository argument

2018-08-20 Thread Derrick Stolee
From: Stefan Beller 

In 60ce76d3581 (refs: add repository argument to for_each_replace_ref,
2018-04-11) and 0d296c57aec (refs: allow for_each_replace_ref to handle
arbitrary repositories, 2018-04-11), for_each_replace_ref learned how
to iterate over refs by a given arbitrary repository.
New attempts in the object store conversion have shown that it is useful
to have the repository handle available that the refs iteration is
currently iterating over.

To achieve this goal we will need to add a repository argument to
each_ref_fn in refs.h. However as many callers rely on the signature
such a patch would be too large.

So convert the internals of the ref subsystem first to pass through a
repository argument without exposing the change to the user. Assume
the_repository for the passed through repository, although it is not
used anywhere yet.

Signed-off-by: Stefan Beller 
Signed-off-by: Derrick Stolee 
---
 refs.c   | 39 +--
 refs.h   | 10 ++
 refs/iterator.c  |  6 +++---
 refs/refs-internal.h |  5 +++--
 4 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/refs.c b/refs.c
index 457fb78057..7cd76f72d2 100644
--- a/refs.c
+++ b/refs.c
@@ -1386,17 +1386,50 @@ struct ref_iterator *refs_ref_iterator_begin(
  * non-zero value, stop the iteration and return that value;
  * otherwise, return 0.
  */
+static int do_for_each_repo_ref(struct repository *r, const char *prefix,
+   each_repo_ref_fn fn, int trim, int flags,
+   void *cb_data)
+{
+   struct ref_iterator *iter;
+   struct ref_store *refs = get_main_ref_store(r);
+
+   if (!refs)
+   return 0;
+
+   iter = refs_ref_iterator_begin(refs, prefix, trim, flags);
+
+   return do_for_each_repo_ref_iterator(r, iter, fn, cb_data);
+}
+
+struct do_for_each_ref_help {
+   each_ref_fn *fn;
+   void *cb_data;
+};
+
+static int do_for_each_ref_helper(struct repository *r,
+ const char *refname,
+ const struct object_id *oid,
+ int flags,
+ void *cb_data)
+{
+   struct do_for_each_ref_help *hp = cb_data;
+
+   return hp->fn(refname, oid, flags, hp->cb_data);
+}
+
 static int do_for_each_ref(struct ref_store *refs, const char *prefix,
   each_ref_fn fn, int trim, int flags, void *cb_data)
 {
struct ref_iterator *iter;
+   struct do_for_each_ref_help hp = { fn, cb_data };
 
if (!refs)
return 0;
 
iter = refs_ref_iterator_begin(refs, prefix, trim, flags);
 
-   return do_for_each_ref_iterator(iter, fn, cb_data);
+   return do_for_each_repo_ref_iterator(the_repository, iter,
+   do_for_each_ref_helper, );
 }
 
 int refs_for_each_ref(struct ref_store *refs, each_ref_fn fn, void *cb_data)
@@ -2025,10 +2058,12 @@ int refs_verify_refname_available(struct ref_store 
*refs,
 int refs_for_each_reflog(struct ref_store *refs, each_ref_fn fn, void *cb_data)
 {
struct ref_iterator *iter;
+   struct do_for_each_ref_help hp = { fn, cb_data };
 
iter = refs->be->reflog_iterator_begin(refs);
 
-   return do_for_each_ref_iterator(iter, fn, cb_data);
+   return do_for_each_repo_ref_iterator(the_repository, iter,
+do_for_each_ref_helper, );
 }
 
 int for_each_reflog(each_ref_fn fn, void *cb_data)
diff --git a/refs.h b/refs.h
index cc2fb4c68c..80eec8bbc6 100644
--- a/refs.h
+++ b/refs.h
@@ -274,6 +274,16 @@ struct ref_transaction;
 typedef int each_ref_fn(const char *refname,
const struct object_id *oid, int flags, void *cb_data);
 
+/*
+ * The same as each_ref_fn, but also with a repository argument that
+ * contains the repository associated with the callback.
+ */
+typedef int each_repo_ref_fn(struct repository *r,
+const char *refname,
+const struct object_id *oid,
+int flags,
+void *cb_data);
+
 /*
  * The following functions invoke the specified callback function for
  * each reference indicated.  If the function ever returns a nonzero
diff --git a/refs/iterator.c b/refs/iterator.c
index 2ac91ac340..629e00a122 100644
--- a/refs/iterator.c
+++ b/refs/iterator.c
@@ -407,15 +407,15 @@ struct ref_iterator *prefix_ref_iterator_begin(struct 
ref_iterator *iter0,
 
 struct ref_iterator *current_ref_iter = NULL;
 
-int do_for_each_ref_iterator(struct ref_iterator *iter,
-each_ref_fn fn, void *cb_data)
+int do_for_each_repo_ref_iterator(struct repository *r, struct ref_iterator 
*iter,
+ each_repo_ref_fn fn, void *cb_data)
 {
int retval = 0, ok;
struct ref_iterator *old_ref_iter = curr

[PATCH v2 8/8] commit-graph: close_commit_graph before shallow walk

2018-08-20 Thread Derrick Stolee
Call close_commit_graph() when about to start a rev-list walk that
includes shallow commits. This is necessary in code paths that "fake"
shallow commits for the sake of fetch. Specifically, test 351 in
t5500-fetch-pack.sh runs

git fetch --shallow-exclude one origin

with a file-based transfer. When the "remote" has a commit-graph, we do
not prevent the commit-graph from being loaded, but then the commits are
intended to be dynamically transferred into shallow commits during
get_shallow_commits_by_rev_list(). By closing the commit-graph before
this call, we prevent this interaction.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 8 
 commit-graph.h | 1 +
 upload-pack.c  | 2 ++
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cee2caab5c..4bd1a4abbf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -260,10 +260,10 @@ static int prepare_commit_graph(struct repository *r)
return !!r->objects->commit_graph;
 }
 
-static void close_commit_graph(void)
+void close_commit_graph(struct repository *r)
 {
-   free_commit_graph(the_repository->objects->commit_graph);
-   the_repository->objects->commit_graph = NULL;
+   free_commit_graph(r->objects->commit_graph);
+   r->objects->commit_graph = NULL;
 }
 
 static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
@@ -875,7 +875,7 @@ void write_commit_graph(const char *obj_dir,
write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr);
write_graph_chunk_large_edges(f, commits.list, commits.nr);
 
-   close_commit_graph();
+   close_commit_graph(the_repository);
finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
commit_lock_file();
 
diff --git a/commit-graph.h b/commit-graph.h
index 76e098934a..13d736cdde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -59,6 +59,7 @@ void write_commit_graph(const char *obj_dir,
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);
 
+void close_commit_graph(struct repository *);
 void free_commit_graph(struct commit_graph *);
 
 #endif
diff --git a/upload-pack.c b/upload-pack.c
index 82b393ec31..2ae9d9bb47 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -24,6 +24,7 @@
 #include "quote.h"
 #include "upload-pack.h"
 #include "serve.h"
+#include "commit-graph.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE  (1u << 11)
@@ -740,6 +741,7 @@ static void deepen_by_rev_list(int ac, const char **av,
 {
struct commit_list *result;
 
+   close_commit_graph(the_repository);
result = get_shallow_commits_by_rev_list(ac, av, SHALLOW, NOT_SHALLOW);
send_shallow(result);
free_commit_list(result);
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 2/8] refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback

2018-08-20 Thread Derrick Stolee
From: Stefan Beller 

Signed-off-by: Stefan Beller 
Signed-off-by: Derrick Stolee 
---
 builtin/replace.c | 8 
 refs.c| 9 -
 refs.h| 2 +-
 replace-object.c  | 5 +++--
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/builtin/replace.c b/builtin/replace.c
index e997a562f0..b5861a0ee9 100644
--- a/builtin/replace.c
+++ b/builtin/replace.c
@@ -39,7 +39,8 @@ struct show_data {
enum replace_format format;
 };
 
-static int show_reference(const char *refname, const struct object_id *oid,
+static int show_reference(struct repository *r, const char *refname,
+ const struct object_id *oid,
  int flag, void *cb_data)
 {
struct show_data *data = cb_data;
@@ -56,9 +57,8 @@ static int show_reference(const char *refname, const struct 
object_id *oid,
if (get_oid(refname, ))
return error("Failed to resolve '%s' as a valid 
ref.", refname);
 
-   obj_type = oid_object_info(the_repository, ,
-  NULL);
-   repl_type = oid_object_info(the_repository, oid, NULL);
+   obj_type = oid_object_info(r, , NULL);
+   repl_type = oid_object_info(r, oid, NULL);
 
printf("%s (%s) -> %s (%s)\n", refname, 
type_name(obj_type),
   oid_to_hex(oid), type_name(repl_type));
diff --git a/refs.c b/refs.c
index 7cd76f72d2..c5a5f727e8 100644
--- a/refs.c
+++ b/refs.c
@@ -1474,12 +1474,11 @@ int refs_for_each_fullref_in(struct ref_store *refs, 
const char *prefix,
return do_for_each_ref(refs, prefix, fn, 0, flag, cb_data);
 }
 
-int for_each_replace_ref(struct repository *r, each_ref_fn fn, void *cb_data)
+int for_each_replace_ref(struct repository *r, each_repo_ref_fn fn, void 
*cb_data)
 {
-   return do_for_each_ref(get_main_ref_store(r),
-  git_replace_ref_base, fn,
-  strlen(git_replace_ref_base),
-  DO_FOR_EACH_INCLUDE_BROKEN, cb_data);
+   return do_for_each_repo_ref(r, git_replace_ref_base, fn,
+   strlen(git_replace_ref_base),
+   DO_FOR_EACH_INCLUDE_BROKEN, cb_data);
 }
 
 int for_each_namespaced_ref(each_ref_fn fn, void *cb_data)
diff --git a/refs.h b/refs.h
index 80eec8bbc6..a0a18223a1 100644
--- a/refs.h
+++ b/refs.h
@@ -317,7 +317,7 @@ int for_each_fullref_in(const char *prefix, each_ref_fn fn, 
void *cb_data,
 int for_each_tag_ref(each_ref_fn fn, void *cb_data);
 int for_each_branch_ref(each_ref_fn fn, void *cb_data);
 int for_each_remote_ref(each_ref_fn fn, void *cb_data);
-int for_each_replace_ref(struct repository *r, each_ref_fn fn, void *cb_data);
+int for_each_replace_ref(struct repository *r, each_repo_ref_fn fn, void 
*cb_data);
 int for_each_glob_ref(each_ref_fn fn, const char *pattern, void *cb_data);
 int for_each_glob_ref_in(each_ref_fn fn, const char *pattern,
 const char *prefix, void *cb_data);
diff --git a/replace-object.c b/replace-object.c
index 4162df6324..3c17864eb7 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -6,7 +6,8 @@
 #include "repository.h"
 #include "commit.h"
 
-static int register_replace_ref(const char *refname,
+static int register_replace_ref(struct repository *r,
+   const char *refname,
const struct object_id *oid,
int flag, void *cb_data)
 {
@@ -25,7 +26,7 @@ static int register_replace_ref(const char *refname,
oidcpy(_obj->replacement, oid);
 
/* Register new object */
-   if (oidmap_put(the_repository->objects->replace_map, repl_obj))
+   if (oidmap_put(r->objects->replace_map, repl_obj))
die("duplicate replace ref: %s", refname);
 
return 0;
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 5/8] commit-graph: not compatible with replace objects

2018-08-20 Thread Derrick Stolee
Create new method commit_graph_compatible(r) to check if a given
repository r is compatible with the commit-graph feature. Fill the
method with a check to see if replace-objects exist. Test this
interaction succeeds, including ignoring an existing commit-graph and
failing to write a new commit-graph. However, we do ensure that
we write a new commit-graph by setting read_replace_refs to 0, thereby
ignoring the replace refs.

Signed-off-by: Derrick Stolee 
---
 builtin/commit-graph.c  |  4 
 commit-graph.c  | 21 +
 replace-object.c|  2 +-
 replace-object.h|  2 ++
 t/t5318-commit-graph.sh | 22 ++
 5 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0bf0c48657..da737df321 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -120,6 +120,8 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+extern int read_replace_refs;
+
 static int graph_write(int argc, const char **argv)
 {
struct string_list *pack_indexes = NULL;
@@ -150,6 +152,8 @@ static int graph_write(int argc, const char **argv)
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
+   read_replace_refs = 0;
+
if (opts.reachable) {
write_commit_graph_reachable(opts.obj_dir, opts.append);
return 0;
diff --git a/commit-graph.c b/commit-graph.c
index b0a55ad128..2c01fa433f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -13,6 +13,8 @@
 #include "commit-graph.h"
 #include "object-store.h"
 #include "alloc.h"
+#include "hashmap.h"
+#include "replace-object.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -56,6 +58,19 @@ static struct commit_graph *alloc_commit_graph(void)
return g;
 }
 
+extern int read_replace_refs;
+
+static int commit_graph_compatible(struct repository *r)
+{
+   if (read_replace_refs) {
+   prepare_replace_object(r);
+   if (hashmap_get_size(>objects->replace_map->map))
+   return 0;
+   }
+
+   return 1;
+}
+
 struct commit_graph *load_commit_graph_one(const char *graph_file)
 {
void *graph_map;
@@ -223,6 +238,9 @@ static int prepare_commit_graph(struct repository *r)
 */
return 0;
 
+   if (!commit_graph_compatible(r))
+   return 0;
+
obj_dir = r->objects->objectdir;
prepare_commit_graph_one(r, obj_dir);
prepare_alt_odb(r);
@@ -693,6 +711,9 @@ void write_commit_graph(const char *obj_dir,
int num_extra_edges;
struct commit_list *parent;
 
+   if (!commit_graph_compatible(the_repository))
+   return;
+
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
diff --git a/replace-object.c b/replace-object.c
index 3c17864eb7..9821f1477e 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -32,7 +32,7 @@ static int register_replace_ref(struct repository *r,
return 0;
 }
 
-static void prepare_replace_object(struct repository *r)
+void prepare_replace_object(struct repository *r)
 {
if (r->objects->replace_map)
return;
diff --git a/replace-object.h b/replace-object.h
index 9345e105dd..16528df942 100644
--- a/replace-object.h
+++ b/replace-object.h
@@ -10,6 +10,8 @@ struct replace_object {
struct object_id replacement;
 };
 
+void prepare_replace_object(struct repository *r);
+
 /*
  * This internal function is only declared here for the benefit of
  * lookup_replace_object().  Please do not call it directly.
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 4f17d7701e..e0c3c60f66 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -259,6 +259,28 @@ test_expect_success 'check that gc computes commit-graph' '
test_cmp commit-graph-after-gc $objdir/info/commit-graph
 '
 
+test_expect_success 'replace-objects invalidates commit-graph' '
+   cd "$TRASH_DIRECTORY" &&
+   test_when_finished rm -rf replace &&
+   git clone full replace &&
+   (
+   cd replace &&
+   git commit-graph write --reachable &&
+   test_path_is_file .git/objects/info/commit-graph &&
+   git replace HEAD~1 HEAD~2 &&
+   git -c core.commitGraph=false log >expect &&
+   git -c core.commitGraph=true log >actual &&
+   test_cmp expect actual &&
+   git commit-graph write --reachable &&
+   git -c core.commitGraph=false --no-replace-objects log >expect 
&&
+   git -c core.commitGraph=true --no-replace-objects log >actual &&

[PATCH v2 3/8] commit-graph: update design document

2018-08-20 Thread Derrick Stolee
As it exists right now, the commit-graph feature may provide
inconsistent results when combined with commit grafts, replace objects,
and shallow clones. Update the design document to discuss why these
interactions are difficult to reconcile and how we will avoid errors by
preventing updates to and reads from the commit-graph file when these
other features exist.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/commit-graph.txt | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index c664acbd76..001395e950 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -112,12 +112,24 @@ Design Details
 - The file format includes parameters for the object ID hash function,
   so a future change of hash algorithm does not require a change in format.
 
+- Commit grafts and replace objects can change the shape of the commit
+  history. The latter can also be enabled/disabled on the fly using
+  `--no-replace-objects`. This leads to difficultly storing both possible
+  interpretations of a commit id, especially when computing generation
+  numbers. The commit-graph will not be read or written when
+  replace-objects or grafts are present.
+
+- Shallow clones create grafts of commits by dropping their parents. This
+  leads the commit-graph to think those commits have generation number 1.
+  If and when those commits are made unshallow, those generation numbers
+  become invalid. Since shallow clones are intended to restrict the commit
+  history to a very small set of commits, the commit-graph feature is less
+  helpful for these clones, anyway. The commit-graph will not be read or
+  written when shallow commits are present.
+
 Future Work
 ---
 
-- The commit graph feature currently does not honor commit grafts. This can
-  be remedied by duplicating or refactoring the current graft logic.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 4/8] test-repository: properly init repo

2018-08-20 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 t/helper/test-repository.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-repository.c b/t/helper/test-repository.c
index 2762ca6562..6a84a53efb 100644
--- a/t/helper/test-repository.c
+++ b/t/helper/test-repository.c
@@ -15,7 +15,10 @@ static void test_parse_commit_in_graph(const char *gitdir, 
const char *worktree,
struct commit *c;
struct commit_list *parent;
 
-   repo_init(, gitdir, worktree);
+   setup_git_env(gitdir);
+
+   if (repo_init(, gitdir, worktree))
+   die("Couldn't init repo");
 
c = lookup_commit(, commit_oid);
 
@@ -38,7 +41,10 @@ static void test_get_commit_tree_in_graph(const char *gitdir,
struct commit *c;
struct tree *tree;
 
-   repo_init(, gitdir, worktree);
+   setup_git_env(gitdir);
+
+   if (repo_init(, gitdir, worktree))
+   die("Couldn't init repo");
 
c = lookup_commit(, commit_oid);
 
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 7/8] commit-graph: not compatible with uninitialized repo

2018-08-20 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index c4eaedf4e5..cee2caab5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -62,6 +62,9 @@ extern int read_replace_refs;
 
 static int commit_graph_compatible(struct repository *r)
 {
+   if (!r->gitdir)
+   return 0;
+
if (read_replace_refs) {
prepare_replace_object(r);
if (hashmap_get_size(>objects->replace_map->map))
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 6/8] commit-graph: not compatible with grafts

2018-08-20 Thread Derrick Stolee
Augment commit_graph_compatible(r) to return false when the given
repository r has commit grafts or is a shallow clone. Test that in these
situations we ignore existing commit-graph files and we do not write new
commit-graph files.

Helped-by: Jakub Narebski 
Signed-off-by: Derrick Stolee 
---
 commit-graph.c  |  6 ++
 commit.c|  2 +-
 commit.h|  1 +
 t/t5318-commit-graph.sh | 38 ++
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2c01fa433f..c4eaedf4e5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -68,6 +68,12 @@ static int commit_graph_compatible(struct repository *r)
return 0;
}
 
+   prepare_commit_graft(r);
+   if (r->parsed_objects && r->parsed_objects->grafts_nr)
+   return 0;
+   if (is_repository_shallow(r))
+   return 0;
+
return 1;
 }
 
diff --git a/commit.c b/commit.c
index 30d1af2b20..ef9a2cbb23 100644
--- a/commit.c
+++ b/commit.c
@@ -209,7 +209,7 @@ static int read_graft_file(struct repository *r, const char 
*graft_file)
return 0;
 }
 
-static void prepare_commit_graft(struct repository *r)
+void prepare_commit_graft(struct repository *r)
 {
char *graft_file;
 
diff --git a/commit.h b/commit.h
index da0db36eba..5459e279fe 100644
--- a/commit.h
+++ b/commit.h
@@ -202,6 +202,7 @@ typedef int (*each_commit_graft_fn)(const struct 
commit_graft *, void *);
 
 struct commit_graft *read_graft_line(struct strbuf *line);
 int register_commit_graft(struct repository *r, struct commit_graft *, int);
+void prepare_commit_graft(struct repository *r);
 struct commit_graft *lookup_commit_graft(struct repository *r, const struct 
object_id *oid);
 
 extern struct commit_list *get_merge_bases(struct commit *rev1, struct commit 
*rev2);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index e0c3c60f66..6aee861f78 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -281,6 +281,44 @@ test_expect_success 'replace-objects invalidates 
commit-graph' '
)
 '
 
+test_expect_success 'commit grafts invalidate commit-graph' '
+   cd "$TRASH_DIRECTORY" &&
+   test_when_finished rm -rf graft &&
+   git clone full graft &&
+   (
+   cd graft &&
+   git commit-graph write --reachable &&
+   test_path_is_file .git/objects/info/commit-graph &&
+   H1=$(git rev-parse --verify HEAD~1) &&
+   H3=$(git rev-parse --verify HEAD~3) &&
+   echo "$H1 $H3" >.git/info/grafts &&
+   git -c core.commitGraph=false log >expect &&
+   git -c core.commitGraph=true log >actual &&
+   test_cmp expect actual &&
+   git commit-graph write --reachable &&
+   git -c core.commitGraph=false --no-replace-objects log >expect 
&&
+   git -c core.commitGraph=true --no-replace-objects log >actual &&
+   test_cmp expect actual &&
+   rm -rf .git/objects/info/commit-graph &&
+   git commit-graph write --reachable &&
+   test_path_is_missing .git/objects/info/commit-graph
+   )
+'
+
+test_expect_success 'replace-objects invalidates commit-graph' '
+   cd "$TRASH_DIRECTORY" &&
+   test_when_finished rm -rf shallow &&
+   git clone --depth 2 "file://$TRASH_DIRECTORY/full" shallow &&
+   (
+   cd shallow &&
+   git commit-graph write --reachable &&
+   test_path_is_missing .git/objects/info/commit-graph &&
+   git fetch origin --unshallow &&
+   git commit-graph write --reachable &&
+   test_path_is_file .git/objects/info/commit-graph
+   )
+'
+
 # the verify tests below expect the commit-graph to contain
 # exactly the commits reachable from the commits/8 branch.
 # If the file changes the set of commits in the list, then the
-- 
2.18.0.118.gd4f65b8d14



[PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities

2018-08-20 Thread Derrick Stolee
One unresolved issue with the commit-graph feature is that it can cause
issues when combined with replace objects, commit grafts, or shallow
clones. These are not 100% incompatible, as one could be reasonably
successful writing a commit-graph after replacing some objects and not
have issues. The problems happen when commits that are already in the
commit-graph file are replaced, or when git is run with the
`--no-replace-objects` option; this can cause incorrect parents or
incorrect generation numbers. Similar things occur with commit grafts
and shallow clones, especially when running `git fetch --unshallow` in a
shallow repo.

Instead of trying (and probably failing) to make these features work
together, default to making the commit-graph feature unavailable in these
situations. Create a new method 'commit_graph_compatible(r)' that checks
if the repository 'r' has any of these features enabled.

CHANGES IN V2:

* The first two commits regarding the ref iterators are unchanged, despite
  a lot of discussion on the subject [1].

* I included Peff's changes in jk/core-use-replace-refs, changing the base
  commit for the series to 1689c22c1c328e9135ed51458e9f9a5d224c5057 (the merge
  that brought that topic into 'msater').

* I fixed the tests for the interactions with the graft feature.

Because of the change of base, it is hard to provide a side-by-side diff
from v1.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/CAGZ79kZ3PzqpGzXWcmxjzi98gA+LT2MBOf8KaA89hOa-Qig=o...@mail.gmail.com/
Stefan's response recommending we keep the first two commits.

Derrick Stolee (6):
  commit-graph: update design document
  test-repository: properly init repo
  commit-graph: not compatible with replace objects
  commit-graph: not compatible with grafts
  commit-graph: not compatible with uninitialized repo
  commit-graph: close_commit_graph before shallow walk

Stefan Beller (2):
  refs.c: migrate internal ref iteration to pass thru repository
argument
  refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback

 Documentation/technical/commit-graph.txt | 18 +--
 builtin/commit-graph.c   |  4 ++
 builtin/replace.c|  8 ++--
 commit-graph.c   | 38 +--
 commit-graph.h   |  1 +
 commit.c |  2 +-
 commit.h |  1 +
 refs.c   | 48 ---
 refs.h   | 12 -
 refs/iterator.c  |  6 +--
 refs/refs-internal.h |  5 +-
 replace-object.c |  7 +--
 replace-object.h |  2 +
 t/helper/test-repository.c   | 10 +++-
 t/t5318-commit-graph.sh  | 60 
 upload-pack.c|  2 +
 16 files changed, 194 insertions(+), 30 deletions(-)


base-commit: 1689c22c1c328e9135ed51458e9f9a5d224c5057
-- 
2.18.0.118.gd4f65b8d14



Re: What's cooking in git.git (Aug 2018, #04; Fri, 17)

2018-08-20 Thread Derrick Stolee

On 8/20/2018 1:26 PM, Junio C Hamano wrote:

Jonathan Nieder  writes:


Usually, I refrain from merging larger topics in 'next' down to
'master' when we get close to -rc0, but I am wondering if it is
better to merge all of them to 'master', even the ones on the larger
and possibly undercooked side, expecting that we collectively spend
effort on hunting and fixing bugs in them during the pre-release
freeze period.  If we were to go that route, I'd want everybody's
buy-in and I'll promise to ignore any shiny new toys that appear on
list that are not regression fixes to topics merged to 'master'
since the end of the previous cycle to make sure people are not
distracted.

Based on what I see in 'next' (midx, range-diff, etc), I quite like
this idea.

The comment above was about the ones that was marked as "Will merge
to 'master'" that are in 'next', not the ones marked as "Will cook
in 'next'", like the midx ones.

I am not worried about range-diff, as I do not think it came close
enough to the core-ish code to break other things; the potential
damage is fairly isolated and the worst would be that we'd realize
that we shipped with a new command that was buggy after the release,
which is not the end of the world.  'midx' and 'reachable' are quite
different stories, as bugs in them would in the worst case lead to
an immediate and unrecoverable repository corruption.

So I am still on the fence, even though I am leaning toward merging
them down to 'master'.


I'm happy to wait until 2.19 is cut for ds/multi-pack-index to merge 
down. I sent a new series today [1] that improves the series and 
corrects some bugs I found while making the full Git test suite work 
when writing a multi-pack-index on every repack. Nothing was 
"unrecoverable repository corruption," but I understand the concern 
there. Making the safest decision is probably the best decision.


I'm happy to contribute to a "pre-2.19 bug bash" after you merge things 
into master.


Thanks,

-Stolee

[1] 
https://public-inbox.org/git/20180820165124.152146-1-dsto...@microsoft.com/T/#u


    [PATCH 0/9] multi-pack-index cleanups



<    1   2   3   4   5   6   7   8   9   10   >