Re: Git Test Coverage Report (Saturday, Oct 27)

2018-10-27 Thread Derrick Stolee

On 10/27/2018 9:55 AM, Junio C Hamano wrote:

Derrick Stolee  writes:



Uncovered in mater not in master@{1}


Does this typo indicate that some part of the process to produce and
send out this report involve manual editing?
I kick off four builds with on Azure Pipelines [1], wait for the builds 
to finish, then copy the appropriate sections out of the output 
(trimming the timestamps). If I were more familiar with the Pipelines 
features, I'm pretty sure I could make four parallel jobs inside the 
same build that ran the tests on each branch, export the logs as 
artifacts, then auto-assemble the email at the end. I plan to do that 
soon, but I'm still keeping a close eye on the results to see when and 
where failures occur.


Thanks,
-Stolee

[1] https://dev.azure.com/git/git/_build?definitionId=5


Git Test Coverage Report (Saturday, Oct 27)

2018-10-27 Thread Derrick Stolee
d not write '%s''"), todo_file);
6424061be4 110) return error(_("could not copy '%s' to '%s'."), todo_file,
b74a37a5a7 174) goto leave_check;

refs.c
3a3b9d8cde  657) return 0;

refs/files-backend.c
revision.c
a5dda2204a  726) if (repo_parse_commit(the_repository, p) < 0)

sequencer.c
a5dda2204a 1597) repo_unuse_commit_buffer(the_repository, head_commit,
a5dda2204a 3821) repo_unuse_commit_buffer(the_repository,
b5d6062402 4486) strbuf_insert(buf, todo_list->items[insert].offset_in_buf +
b5d6062402 4498) int sequencer_add_exec_commands(const char *commands)
06d8136126 4505) return error_errno(_("could not read '%s'."), todo_file);
b5d6062402 4507) if (todo_list_parse_insn_buffer(todo_list.buf.buf, 
&todo_list)) {

b5d6062402 4512) todo_list_add_exec_commands(&todo_list, commands);
b5d6062402 4513) res = write_message(todo_list.buf.buf, 
todo_list.buf.len, todo_file, 0);

0cce4a2756 4514) todo_list_release(&todo_list);
b5d6062402 4516) return res;
b74a37a5a7 4576) goto out;
b74a37a5a7 4581) goto out;
b8dac44d10 4721) todo_list_release(&new_todo);
009173ed7b 4726) todo_list_release(&new_todo);
009173ed7b 4727) return error_errno(_("could not write '%s'"), todo_file);
9787d17d40 4921) int rearrange_squash_in_todo_file(void)
9787d17d40 4923) const char *todo_file = rebase_path_todo();
9787d17d40 4924) struct todo_list todo_list = TODO_LIST_INIT;
9787d17d40 4925) int res = 0;
9787d17d40 4927) if (strbuf_read_file_or_whine(&todo_list.buf, 
todo_file) < 0)

9787d17d40 4928) return -1;
9787d17d40 4929) if (todo_list_parse_insn_buffer(todo_list.buf.buf, 
&todo_list) < 0) {

9787d17d40 4930) todo_list_release(&todo_list);
9787d17d40 4931) return -1;
9787d17d40 4934) res = todo_list_rearrange_squash(&todo_list);
9787d17d40 4935) if (!res)
9787d17d40 4936) res = rewrite_file(todo_file, todo_list.buf.buf, 
todo_list.buf.len);

9787d17d40 4938) todo_list_release(&todo_list);

setup.c
58b284a2e9  413) return config_error_nonbool(var);

sha1-array.c
7fdf90d541 91) oidcpy(&oids[dst], &oids[src]);

submodule-config.c
bcbc780d14 740) return CONFIG_INVALID_KEY;
45f5ef3d77 755) warning(_("Could not update .gitmodules entry %s"), key);

submodule.c
b303ef65e7  524) the_repository->submodule_prefix :
060675d4fc 1378) strbuf_release(&gitdir);
183be9660a 1501) struct get_next_submodule_task *task = task_cb;
183be9660a 1505) get_next_submodule_task_release(task);
183be9660a 1532) return 0;
183be9660a 1536) goto out;
183be9660a 1551) return 0;

tree.c
a5dda2204a 108) if (repo_parse_commit(the_repository, commit))

worktree.c
3a3b9d8cde 495) return -1;
3a3b9d8cde 508) return -1;
3a3b9d8cde 517) return -1;
ab3e1f78ae 537) break;

wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Alban Gruin  009173ed7: sequencer: change complete_action() to use 
the refactored functions
Alban Gruin  06d813612: sequencer: fix a call to error() in 
transform_todo_file()
Alban Gruin  6424061be: rebase-interactive: rewrite edit_todo_list() 
to handle the initial edit
Alban Gruin  7ccfac40b: rebase--interactive: move 
transform_todo_file() to rebase--interactive.c
Alban Gruin  9787d17d4: sequencer: refactor rearrange_squash() to 
work on a todo_list
Alban Gruin  b5d606240: sequencer: refactor 
sequencer_add_exec_commands() to work on a todo_list
Alban Gruin  b74a37a5a: sequencer: refactor check_todo_list() to 
work on a todo_list
Alban Gruin  b8dac44d1: sequencer: refactor skip_unnecessary_picks() 
to work on a todo_list
Antonio Ospite  45f5ef3d7: submodule: factor out a 
config_set_in_gitmodules_file_gently function
Antonio Ospite  bcbc780d1: submodule: add a 
print_config_from_gitmodules() helper
Antonio Ospite  c22b82014: submodule: support reading .gitmodules 
when it's not in the working tree

Derrick Stolee  1dcd9f204: midx: close multi-pack-index on repack
Derrick Stolee  dc7d66433: packfile: close multi-pack-index in 
close_all_packs
Elijah Newren  387361a6a: merge-recursive: improve 
rename/rename(1to2)/add[/add] handling
Elijah Newren  4cdc48e41: merge-recursive: new function for better 
colliding conflict resolutions
Elijah Newren  b58ae691c: merge-recursive: fix rename/add conflict 
handling

Junio C Hamano  a5dda2204: treewide: apply cocci patch
Liam Beguin  0cce4a275: rebase -i -x: add exec commands via the 
rebase--helper

Linus Torvalds  74e8221b5: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Nguyễn Thái Ngọc Duy  3a3b9d8cd: refs: new ref types to make 
per-worktree refs visible to all worktrees

Nguyễn Thái Ngọc Duy  58b2

Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-25 Thread Derrick Stolee

On 10/25/2018 5:43 AM, Jeff King wrote:

On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote:


2. INDEGREE: using the indegree_queue priority queue (ordered
 by maximizing the generation number), add one to the in-
 degree of each parent for each commit that is walked. Since
 we walk in order of decreasing generation number, we know
 that discovering an in-degree value of 0 means the value for
 that commit was not initialized, so should be initialized to
 two. (Recall that in-degree value "1" is what we use to say a
 commit is ready for output.) As we iterate the parents of a
 commit during this walk, ensure the EXPLORE walk has walked
 beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.

The INDEGREE walk is an important element for Kahn's algorithm. The final
output order is dictated by peeling commits of "indegree zero" to ensure all
children are output before their parents. (Note: since we use literal 0 to
mean "uninitialized", we peel commits when the indegree slab has value 1.)

This walk replaces the indegree logic from sort_in_topological_order(). That
method performs one walk that fills the indegree slab, then another walk
that peels the commits with indegree 0 and inserts them into a list.

I guess my big question here was: if we have generation numbers, do we
need Kahn's algorithm? That is, in a fully populated set of generation
numbers (i.e., no INFINITY), we could always just pick a commit with the
highest generation number to show.

So if we EXPLORE down to a real generation number in phase 1, why do we
need to care about INDEGREE anymore? Or am I wrong that we always have a
real generation number (i.e., not INFINITY) after EXPLORE? (And if so,
why is exploring to a real generation number a bad idea; presumably
it's due to a worst-case that goes deeper than we'd otherwise need to
here).


The issue is that we our final order (used by level 3) is unrelated to 
generation number. Yes, if we prioritized by generation number then we 
could output the commits in _some_ order that doesn't violate 
topological constraints. However, we are asking for a different 
priority, which is different than the generation number priority.


In the case of "--topo-order", we want to output the commits reachable 
from the second parent of a merge before the commits reachable from the 
first parent. However, in most cases the generation number of the first 
parent is higher than the second parent (there are more things in the 
merge chain than in a small topic that got merged). The INDEGREE is what 
allows us to know when we can peel these commits while still respecting 
the priority we want at the end.


Thanks,
-Stolee


[PATCH] packfile: close multi-pack-index in close_all_packs

2018-10-25 Thread Derrick Stolee
Whenever we delete pack-files from the object directory, we need
to also delete the multi-pack-index that may refer to those
objects. Sometimes, this connection is obvious, like during a
repack. Other times, this is less obvious, like when gc calls
a repack command and then does other actions on the objects, like
write a commit-graph file.

The pattern we use to avoid out-of-date in-memory packed_git
structs is to call close_all_packs(). This should also call
close_midx(). Since we already pass an object store to
close_all_packs(), this is a nicely scoped operation.

This fixes a test failure when running t6500-gc.sh with
GIT_TEST_MULTI_PACK_INDEX=1.

Reported-by: Szeder Gábor 
Signed-off-by: Derrick Stolee 
---

Thanks for the report, Szeder! I think this is the correct fix,
and more likely to be permanent across all builtins that run
auto-GC. I based it on ds/test-multi-pack-index so it could easily
be added on top.

-Stolee

 packfile.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/packfile.c b/packfile.c
index 841b36182f..37fcd8f136 100644
--- a/packfile.c
+++ b/packfile.c
@@ -339,6 +339,11 @@ void close_all_packs(struct raw_object_store *o)
BUG("want to close pack marked 'do-not-close'");
else
close_pack(p);
+
+   if (o->multi_pack_index) {
+   close_midx(o->multi_pack_index);
+   o->multi_pack_index = NULL;
+   }
 }
 
 /*

base-commit: 0465a50506023df0932fe0534fe6ac6712c0d854
-- 
2.17.1



Recommended configurations (was Re: [PATCH v1 2/2] reset: add new reset.quietDefault config setting)

2018-10-24 Thread Derrick Stolee

On 10/23/2018 4:03 PM, Ævar Arnfjörð Bjarmason wrote:

[snip]
The --ahead-behind config setting stalled on-list before:
https://public-inbox.org/git/36e3a9c3-f7e2-4100-1bfc-647b809a0...@jeffhostetler.com/

Now we have this similarly themed thing.

I think we need to be mindful of how changes like this can add up to
very confusing UI. I.e. in this case I can see a "how take make git fast
on large repos" post on stackoverflow in our future where the answer is
setting a bunch of seemingly irrelevant config options like reset.quiet
and status.aheadbehind=false etc.

So maybe we should take a step back and consider if the real thing we
want is just some way for the user to tell git "don't work so hard at
coming up with these values".

That can also be smart, e.g. some "auto" setting that tweaks it based on
estimated repo size so even with the same config your tiny dotfiles.git
will get "ahead/behind" reporting, but not when you cd into windows.git.


Generally, there are a lot of config settings that are likely in the "if 
you have a big repo, then you should use this" category. However, there 
is rarely a one-size-fits-all solution to these problems, just like 
there are different ways a repo can be "big" (working directory? number 
of commits? submodules?).


I would typically expect that users with _really_ big repos have the 
resources to have an expert tweak the settings that are best for that 
data shape and share those settings with all the users. In VFS for Git, 
we turn certain config settings on by default when mounting the repo 
[1], but others are either signaled through warning messages (like the 
status.aheadBehind config setting [2]).


We never upstreamed the status.aheadBehind config setting [2], but we 
_did_ upstream the command-line option as fd9b544 "status: add 
--[no-]ahead-behind to status and commit for V2 format". We didn't want 
to change the expected output permanently, so we didn't add the config 
setting to our list of "required" settings, but instead created a list 
of optional settings [3]; these settings don't override the existing 
settings so users can opt-out. (Now that we have the commit-graph 
enabled and kept up-to-date, it may be time to revisit the importance of 
this setting.)


All of this is to say: it is probably a good idea to have some 
"recommended configuration" for big repos, but there will always be 
power users who want to tweak each and every one of these settings. I'm 
open to design ideas of how to store a list of recommended 
configurations and how to set a group of config settings with one 
command (say, a "git recommended-config [small|large|submodules]" 
builtin that fills the local config with the important settings).


Thanks,
-Stolee

[1] 
https://github.com/Microsoft/VFSForGit/blob/7daa9f1133764a4e4bd87014833fc2091e6702c1/GVFS/GVFS/CommandLine/GVFSVerb.cs#L79-L104

    Code in VFS for Git that enables "required" config settings.

[2] 
https://github.com/Microsoft/git/commit/0cbe9e6b23e4d9008d4a1676e1dd6a87bdcd6ed5

    status: add status.aheadbehind setting

[3] 
https://github.com/Microsoft/VFSForGit/blob/7daa9f1133764a4e4bd87014833fc2091e6702c1/GVFS/GVFS/CommandLine/GVFSVerb.cs#L120-L123

    Code in VFS for Git that enables "optional" config settings.


Re: [PATCH v3 09/12] commit-graph: convert to using the_hash_algo

2018-10-24 Thread Derrick Stolee

On 10/21/2018 10:43 PM, brian m. carlson wrote:

Instead of using hard-coded constants for object sizes, use
the_hash_algo to look them up.  In addition, use a function call to look
up the object ID version and produce the correct value.  For now, we use
version 1, which means to use the default algorithm used in the rest of
the repository.


This patch looks good to me. Later, we will want to delete the 
commit-graph file during the "upgrade the repo to newhash" process, but 
that can wait.


Thanks,
-Stolee


Re: [PATCH] commit-reach: fix sorting commits by generation

2018-10-24 Thread Derrick Stolee

On 10/23/2018 4:32 PM, Thomas Gummerer wrote:

On 10/22, René Scharfe wrote:

Am 22.10.2018 um 23:10 schrieb Thomas Gummerer:

Anyway, your implied question was discussed back then.  Derrick wrote:

The reason to sort is to hopefully minimize the amount we walk by
exploring the "lower" commits first. This is a performance-only thing,
not a correctness issue (which is why the bug exists). Even then, it is
just a heuristic.

Thanks for pointing that out!


Does b6723e4671 in pu (commit-reach: fix first-parent heuristic) change
that picture?  Did a quick test and found no performance difference with
and without the fix on top, i.e. proper sorting didn't seem to matter.

I just gave 'test-tool reach can_all_from_reach' a try and got the
same results, with or without the fix the times are very similar.  I
haven't had time to follow the commit-graph series though, so I'm not
sure I used it correctly.


Thanks for your attention here. I've been thinking a lot about this 
heuristic and have concluded the following two things are true:


(1) When we return 1, the order that we explore the 'from' commits does 
not change the explored set of commits.


(2) When we return 0, the order that we explore the 'to' commits will 
change the explored set, but it is difficult to say that the heuristic 
helps more than it hurts.


Item (1) is contrary to what I had thought when I first created the 
heuristic.


The details are tricky, but essentially each DFS starting at a 'from' 
commit may be short-circuited due to a prior walk, but swapping the 
order of those two 'from' commits would lead to the same set of commits 
to be explored (with the short-circuit happening in the other commit). 
The only change is that we can terminate early if we fully explore a 
'from' commit and do not find a commit marked with 'with_flag'. In this 
sense, it would be best to explore the commits that are "closest" to the 
generation number cutoff first, as we can maybe find a negative answer 
earlier in the search.


In this sense, we could remove the sort entirely and probably not have 
much performance hit. But since the set of 'from' commits is probably 
much smaller than the set of commits to explore, the sort is likely 
inexpensive.


In conclusion: I cannot say that this sort is super-important. As for 
the potential benefits in (2), I'll leave that to people who run git as 
a server who may have telemetry around fetch negotiation. How often do 
we actually say we need more rounds of negotiation? What kinds of data 
shapes matter there?


Thanks,
-Stolee


Re: [PATCH v4 6/7] revision.c: generation-based topo-order algorithm

2018-10-23 Thread Derrick Stolee

On 10/22/2018 9:37 AM, Jakub Narebski wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n " or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.

2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.

Because in-degree has very specific defined meaning of number of
children, i.e. the number of _incoming_ edges, I would say "if and only
if their in-degree-plus-one is one".  It is more exact, even if it looks
a bit funny.


3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().

All right, so those are separate steps (separate walks): prepare and
parse commits, topologically sort list of commits from previous step,
output sorted list of commits from previous step.


I would rephrase your explanation above as: prepare and parse commits, 
compute in-degrees, and peel commits of in-degree zero.



In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk.

What does 'higher order' walk means: steps 3, 2, 1, in this order,
i.e. output being the highest order, or something different?


Yes. We only walk "level 2" in order to satisfy how far we are in "level 3".


Sidenote: the new algorithm looks a bit like Unix pipeline, where each
step of pipeline does not output much more than next step needs /
requests.


That's essentially the idea.


  We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Do I understand it correctly that this is mainly used in Kahn's
algorithm to find out through the negative-cut index of generation
number which commits in the to-be-sorted list cannot have an in-degree
of zero (or otherise cannot be next commit to be shown in output)?


In each step of the algorithm, we operate under the assumption that 
certain vertices have "all necessary information".


In the case of "level 3", we need to know that all descendants were 
walked and our in-degree calculation is correct. We guarantee this by 
ensuring that "level 2" has walked beyond that commit's generation number.


In the case of "level 2", we need to know that we have parsed all 
descendants and determined their simplifications (if necessary, such as 
in file-history) and if they are UNINTERESTING. We guarantee this by 
ensuring that "level 1" has walked beyond that commit's generation number.


In the previous algorithm, these guarantees were handled by doing each 
step on all reachable commits before moving to the next level.



Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
   number is one more than the maximum generation number among
   its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
   indicates that the commit is not stored in the commit-graph and
   the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
   to say that the commit-graph was generated by a version of Git
   that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement t

[RFC PATCH] revision.c: use new algorithm in A..B case

2018-10-21 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---

I just wanted to mention that in order to use the new logic for 'git log
--topo-order A..B', we just need the following patch. It is an extra
time that sets 'revs->limited' to 1, triggering the old logic.

You can use this for comparison purposes, but I'm not ready to do this
until more performance testing is ready in this case. Since these
comparison commands are already pretty fast when the diff is small,
there is less urgency to improve performance here.

Thanks,
-Stolee

 revision.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/revision.c b/revision.c
index 472f3994e3..8e5656f7b4 100644
--- a/revision.c
+++ b/revision.c
@@ -278,10 +278,8 @@ static struct commit *handle_commit(struct rev_info *revs,
 
if (parse_commit(commit) < 0)
die("unable to parse commit %s", name);
-   if (flags & UNINTERESTING) {
+   if (flags & UNINTERESTING)
mark_parents_uninteresting(commit);
-   revs->limited = 1;
-   }
if (revs->sources) {
char **slot = revision_sources_at(revs->sources, 
commit);
 
-- 
2.19.1



Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic

2018-10-21 Thread Derrick Stolee

On 10/21/2018 9:12 PM, Junio C Hamano wrote:

Jakub Narebski  writes:


So if revs->limited is set (but not because revs->topo_order is set),
which means A..B queries, we will be still using the old algorithm.
All right, though I wonder if it could be improved in the future
(perhaps with the help of other graph labelling / indices than
generation numbers, maybe a positive-cut index).

Do you have an idea why there is no improvement with the new code in
this case?

I didn't get the impression that it would not be possible to improve
the "--topo A..B" case by using generation numbers from this series.
Isn't it just because the necessary code has not been written yet?
In addition to what is needed for "--topo P1 P2 P3..." (all
positive), limited walk needs to notice the bottom boundary and stop
traversal.  Having generation numbers would make it slightly easier
than without, as you know that a positive commit you have will not
be marked UNINTERESTING due to a negative commit whose ancestors
have not been explored, as long as that negative commit has a higher
generation number.  But you still need to adjust the traversal logic
to properly terminate upon hitting UNINTERESTING one, and also
propagate the bit down the history, which is not needed at all if
you only want to support the "positive only" case.


Actually, the code has been written, but the problem is the same as the 
performance issue when I made paint_down_to_common() use generation 
numbers: the algorithm for deciding what is in the set "reachable from A 
but not reachable from B" uses commit-date order as a heuristic to avoid 
walking the entire graph. Yes, the revision parameters specify "limited" 
in order to call "limit_list()", but it uses the same algorithm to 
determine the reachable set difference.


You can test this yourself! Run the following two commands in the Git 
repository using v2.19.1:


    time git log --topo-order -10 master >/dev/null

    time git log --topo-order -10 maint..master >/dev/null

I get 0.39s for the first call and 0.01s for the second. (Note: I 
specified "-10" to ensure we are only writing 10 commits and the output 
size does not factor into the time.) This is because the first walks the 
entire history, while the second uses the heuristic walk to identify a 
much smaller subgraph that the topo-order algorithm uses.


Just as before, by using this algorithm for the B..A case, we are adding 
an extra restriction on the algorithm: always be correct. This results 
in us walking a larger set (everything reachable from B or A with 
generation number at least the smallest generation of a commit reachable 
from only one).


I believe this can be handled by using a smarter generation number (one 
that relies on commit date as a heuristic, but still have enough 
information to guarantee topological relationships), and I've already 
started testing a few of these directions. It is possible now that we 
have concrete graph algorithms to use on real repositories. I hope to 
share a report on my findings in a couple weeks. I'll include how using 
this algorithm compares to the old algorithm in the B..A case.


Thanks,

-Stolee



Re: [PATCH v4 3/7] test-reach: add rev-list tests

2018-10-21 Thread Derrick Stolee

On 10/21/2018 6:21 AM, Jakub Narebski wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Could you remind us here which of those operations will be using
generation numbers after this patch series?


For this series, we are focused only on the --topo-order with a single 
start position. The versions that use a compare branch still use the old 
logic. In the future, I would like to use the new logic for these other 
modes.


  
+test_expect_success 'rev-list: basic topo-order' '

+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-6-6
+'

I wonder if this test could be make easier to write and less error
prone, e.g. creating it from ASCII-art graphics.

But it is good enough.


I did lay out the branch names in a grid layout similar to the 
commit-graph layout. It's easier to see the purposeful layout in the 
comparison sections where some commits don't appear in the output.


Thanks,

-Stolee



Re: What's cooking in git.git (Oct 2018, #04; Fri, 19)

2018-10-19 Thread Derrick Stolee

On 10/19/2018 2:02 AM, Junio C Hamano wrote:


* ds/ci-commit-graph-and-midx (2018-10-19) 1 commit
  - ci: add optional test variables

  One of our CI tests to run with "unusual/experimental/random"
  settings now also uses commti-graph and midx.

  Will merge to 'next'.


s/commti-graph/commit-graph/


* ds/test-multi-pack-index (2018-10-09) 3 commits
  - multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX
  - midx: close multi-pack-index on repack
  - midx: fix broken free() in close_midx()

  Tests for the recently introduced multi-pack index machinery.

  Expecting a reroll.
  cf. <8b5dbe3d-b382-bf48-b524-d9e8a074a...@gmail.com>


A reroll exists: 
https://public-inbox.org/git/pull.27.v2.git.gitgitgad...@gmail.com/


Thanks,
-Stolee



Re: [PATCH 1/1] commit-reach: fix first-parent heuristic

2018-10-19 Thread Derrick Stolee

On 10/19/2018 1:24 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


We can also re-run the performance tests from commit 4fbcca4e
"commit-reach: make can_all_from_reach... linear".

Performance was measured on the Linux repository using
'test-tool reach can_all_from_reach'. The input included rows seeded by
tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as
v3.[0-9]*. This mimics a (very large) fetch that says "I have all major
v3 releases and want all major v4 releases." The "large" case included
X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate
tags to the set, which does not greatly increase the number of objects
that are considered, but does increase the number of 'from' commits,
demonstrating the quadratic nature of the previous code.

These micro-benchmarks are interesting, but we should also remember
to describe the impact of the bug being fixed in the larger picture.
What end-user visible operations are affected?  Computing merge-base?
Finding if a push fast-forwards?  Something else?


Sorry about that. Here is a description that could be inserted into the 
commit message:


The can_all_from_reach() method synthesizes the logic for one iteration 
of can_all_from_reach_with_flags() which is used by the server during 
fetch negotiation to determine if more haves/wants are needed. The logic 
is also used by is_descendant_of() which is used to check if a 
force-push is required and in 'git branch --contains'.


Thanks,
-Stolee


Re: [PATCH] multi-pack-index: avoid dead store for struct progress

2018-10-18 Thread Derrick Stolee

On 10/18/2018 2:59 PM, Carlo Marcelo Arenas Belón wrote:

it is initialized unconditionally by a call to start_progress
below.

Signed-off-by: Carlo Marcelo Arenas Belón 
---
  midx.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index ea2f3ffe2e..4fac0cd08a 100644
--- a/midx.c
+++ b/midx.c
@@ -941,7 +941,7 @@ static void midx_report(const char *fmt, ...)
  int verify_midx_file(const char *object_dir)
  {
uint32_t i;
-   struct progress *progress = NULL;
+   struct progress *progress;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
  
This makes sense as a cleanup. Is there a tool that reports a wasted 
initialization that you used to find this?


-Stolee


[PATCH 1/1] commit-reach: fix first-parent heuristic

2018-10-18 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The algorithm in can_all_from_reach_with_flags() performs a depth-
first-search, terminated by generation number, intending to use
a hueristic that "important" commits are found in the first-parent
history. This heuristic is valuable in scenarios like fetch
negotiation.

However, there is a problem! After the search finds a target commit,
it should pop all commits off the stack and mark them as "can reach".
This logic is incorrect, so the algorithm instead walks all reachable
commits above the generation-number cutoff.

The existing algorithm is still an improvement over the previous
algorithm, as the worst-case complexity went from quadratic to linear.
The performance measurement at the time was good, but not dramatic.
By fixing this heuristic, we reduce the number of walked commits.

We can also re-run the performance tests from commit 4fbcca4e
"commit-reach: make can_all_from_reach... linear".

Performance was measured on the Linux repository using
'test-tool reach can_all_from_reach'. The input included rows seeded by
tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as
v3.[0-9]*. This mimics a (very large) fetch that says "I have all major
v3 releases and want all major v4 releases." The "large" case included
X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate
tags to the set, which does not greatly increase the number of objects
that are considered, but does increase the number of 'from' commits,
demonstrating the quadratic nature of the previous code.

Small Case:

4fbcca4e~1: 0.85 s
  4fbcca4e: 0.26 s (num_walked: 1,011,035)
  HEAD: 0.14 s (num_walked: 8,601)

Large Case:

4fbcca4e~1: 24.0  s
  4fbcca4e:  0.12 s (num_walked:  503,925)
  HEAD:  0.06 s (num_walked:  217,243)

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a22..79419be8af 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -593,8 +593,10 @@ int can_all_from_reach_with_flag(struct object_array *from,
while (stack) {
struct commit_list *parent;
 
-   if (stack->item->object.flags & with_flag) {
+   if (stack->item->object.flags & (with_flag | RESULT)) {
pop_commit(&stack);
+   if (stack)
+   stack->item->object.flags |= RESULT;
continue;
}
 
-- 
gitgitgadget


[PATCH 0/1] commit-reach: fix first-parent heuristic

2018-10-18 Thread Derrick Stolee via GitGitGadget
I originally reported this fix [1] after playing around with the trace2
series for measuring performance. Since trace2 isn't merging quickly, I
pulled the performance fix patch out and am sending it on its own. The only
difference here is that we don't have the tracing to verify the performance
fix in the test script.

See the patch message for details about the fix.

Thanks, -Stolee

[1] 
https://public-inbox.org/git/20180906151309.66712-7-dsto...@microsoft.com/

[RFC PATCH 6/6] commit-reach: fix first-parent heuristic

Derrick Stolee (1):
  commit-reach: fix first-parent heuristic

 commit-reach.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-51%2Fderrickstolee%2Ffirst-parent-heuristic-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-51/derrickstolee/first-parent-heuristic-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/51
-- 
gitgitgadget


Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256

2018-10-18 Thread Derrick Stolee

On 10/17/2018 8:06 PM, brian m. carlson wrote:

On Wed, Oct 17, 2018 at 04:31:19PM +0200, Duy Nguyen wrote:

On Wed, Oct 17, 2018 at 12:44 AM brian m. carlson
 wrote:

Honestly, anything in the .git directory that is not the v3 pack indexes
or the loose object file should be in exactly one hash algorithm.  We
could simply just leave this value at 1 all the time and ignore the
field, since we already know what algorithm it will use.

In this particular case, I agree, but not as a general principle. It's
nice to have independence for fsck-like tools. I don't know if we have
a tool that simply validates commit-graph file format (and not trying
to access any real object). But for such a tool, I guess we can just
pass the hash algorithm from command line. The user would have to
guess a bit.

I'm going to drop this patch for now.  I'll send a follow-up series
later which bumps the format version for this and the multi-pack index
and serializes them with the four-byte value.  I probably should have
caught this earlier, but unfortunately I don't always have the time to
look at every series that hits the list.
We should coordinate before incrementing the version number. I was 
working on making the file formats incremental (think split-index) but 
couldn't come up with a way to do it without incrementing the file 
format. It would be best to combine these features into v2 of each format.


Thanks,
-Stolee


Re: [PATCH 0/3] Use commit-graph by default

2018-10-18 Thread Derrick Stolee

On 10/17/2018 11:47 PM, Junio C Hamano wrote:

If I recall correctly, one more task that was discussed but hasn't
been addressed well is how the generation and incremental update of
it should integrate with the normal repository maintenance workflow
(perhaps "gc --auto").  If we are going to turn it on by default, it
would be good to see if we can avoid multiple independent walks done
over the same history graph by repack, prune and now commit-graph,
before such a change happens.
I don't remember a discussion on this, but it is an interesting point. 
I'll give it some thought to see if we can combine these walks.


Thanks,
-Stolee


Re: [PATCH 1/3] t6501: use --quiet when testing gc stderr

2018-10-18 Thread Derrick Stolee




On 10/18/2018 1:23 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

The test script t6501-freshen-objects.sh has some tests that care
if 'git gc' has any output to stderr. This is intended to say that
no warnings occurred related to broken links. However, when we
have operations that output progress (like writing the commit-graph)
this causes the test to fail.

I see that the descriptor #2 is redirected into a regular file.  Why
should we be writing the progress indicator in that case in the
first place?  Shoudln't we be doing the usual "are we showing this
to an interactive terminal?" test?


This code from builtin/gc.c makes it look like we are doing that:

    if (gc_write_commit_graph)
    write_commit_graph_reachable(get_object_directory(), 0,
 !quiet && !daemonized);

But really, daemonized is only for when running in the background.

Do you have an example of a builtin that checks this interactive 
terminal behavior?


Thanks,
-Stolee


[PATCH 3/3] commit-graph: Use commit-graph by default

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The config setting "core.commitGraph" enables using the commit-graph
file to accelerate commit walks through parsing speed and generation
numbers. The setting "gc.writeCommitGraph" enables writing the
commit-graph file on every non-trivial 'git gc' operation. Together,
they help users automatically improve their performance.

By setting these config variables to true by default, we make the
commit-graph feature an "opt-out" feature instead of "opt-in".

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt | 4 ++--
 builtin/gc.c | 2 +-
 commit-graph.c   | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f6f4c21a54..dc5ee7c145 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -923,7 +923,7 @@ the `GIT_NOTES_REF` environment variable.  See 
linkgit:git-notes[1].
 
 core.commitGraph::
If true, then git will read the commit-graph file (if it exists)
-   to parse the graph structure of commits. Defaults to false. See
+   to parse the graph structure of commits. Defaults to true. See
linkgit:git-commit-graph[1] for more information.
 
 core.useReplaceRefs::
@@ -1639,7 +1639,7 @@ gc.writeCommitGraph::
If true, then gc will rewrite the commit-graph file when
linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
'--auto' the commit-graph will be updated if housekeeping is
-   required. Default is false. See linkgit:git-commit-graph[1]
+   required. Default is true. See linkgit:git-commit-graph[1]
for details.
 
 gc.logExpiry::
diff --git a/builtin/gc.c b/builtin/gc.c
index 871a56f1c5..77e7413f94 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -41,7 +41,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
-static int gc_write_commit_graph;
+static int gc_write_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
diff --git a/commit-graph.c b/commit-graph.c
index a686758603..a459272466 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -232,15 +232,15 @@ static int prepare_commit_graph(struct repository *r)
 {
struct alternate_object_database *alt;
char *obj_dir;
-   int config_value;
+   int config_value = 1;
 
if (r->objects->commit_graph_attempted)
return !!r->objects->commit_graph;
r->objects->commit_graph_attempted = 1;
 
+   repo_config_get_bool(r, "core.commitgraph", &config_value);
if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-   (repo_config_get_bool(r, "core.commitgraph", &config_value) ||
-   !config_value))
+   !config_value)
/*
 * This repository is not configured to use commit graphs, so
 * do not load one. (But report commit_graph_attempted anyway
-- 
gitgitgadget


[PATCH 1/3] t6501: use --quiet when testing gc stderr

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The test script t6501-freshen-objects.sh has some tests that care
if 'git gc' has any output to stderr. This is intended to say that
no warnings occurred related to broken links. However, when we
have operations that output progress (like writing the commit-graph)
this causes the test to fail.

Use 'git gc --quiet' to avoid these progress indicators from causing
a test failure.

Signed-off-by: Derrick Stolee 
---
 t/t6501-freshen-objects.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index 033871ee5f..0973130f06 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -137,7 +137,7 @@ test_expect_success 'do not complain about existing broken 
links (commit)' '
some message
EOF
commit=$(git hash-object -t commit -w broken-commit) &&
-   git gc 2>stderr &&
+   git gc --quiet 2>stderr &&
verbose git cat-file -e $commit &&
test_must_be_empty stderr
 '
@@ -147,7 +147,7 @@ test_expect_success 'do not complain about existing broken 
links (tree)' '
100644 blob 0003foo
EOF
tree=$(git mktree --missing stderr &&
+   git gc --quiet 2>stderr &&
git cat-file -e $tree &&
test_must_be_empty stderr
 '
@@ -162,7 +162,7 @@ test_expect_success 'do not complain about existing broken 
links (tag)' '
this is a broken tag
EOF
tag=$(git hash-object -t tag -w broken-tag) &&
-   git gc 2>stderr &&
+   git gc --quiet 2>stderr &&
git cat-file -e $tag &&
test_must_be_empty stderr
 '
-- 
gitgitgadget



[PATCH 0/3] Use commit-graph by default

2018-10-17 Thread Derrick Stolee via GitGitGadget
The commit-graph feature is starting to stabilize. Based on what is in
master right now, we have:

Git 2.18:

 * Ability to write commit-graph (requires user interaction).
   
   
 * Commit parsing is faster when commit-graph exists.
   
   
 * Must have core.commitGraph true to use.
   
   

Git 2.19:

 * Ability to write commit-graph on GC with gc.writeCommitGraph.
   
   
 * Generation numbers written in commit-graph
   
   
 * A few reachability algorithms make use of generation numbers.
   
   

(queued for) master:

 * The test suite passes with GIT_TEST_COMMIT_GRAPH=1
   
   
 * 'git commit-graph write' has progress indicators.
   
   
 * The commit-graph is automatically disabled when grafts or replace-objects
   exist.
   
   

There are some other things coming that are in review (like 'git log
--graph' speedups), but it is probably time to consider enabling the
commit-graph by default. This series does that.

For timing, I'm happy to leave this queued for a merge after the Git 2.20
release. There are enough things in master to justify not enabling this by
default until that goes out and more people use it.

PATCH 3/3 is rather simple, and is the obvious thing to do to achieve
enabling these config values by default.

PATCH 1/3 is a required change to make the test suite work with this change.
This change isn't needed with GIT_TEST_COMMIT_GRAPH=1 because the
commit-graph is up-to-date for these 'git gc' calls, so no progress is
output.

PATCH 2/3 is also a necessary evil, since we already had to disable
GIT_TEST_COMMIT_GRAPH for some tests, we now also need to turn off
core.commitGraph.

Thanks, -Stolee

Derrick Stolee (3):
  t6501: use --quiet when testing gc stderr
  t: explicitly turn off core.commitGraph as needed
  commit-graph: Use commit-graph by default

 Documentation/config.txt| 4 ++--
 builtin/gc.c| 2 +-
 commit-graph.c  | 6 +++---
 t/t0410-partial-clone.sh| 3 ++-
 t/t5307-pack-missing-commit.sh  | 3 ++-
 t/t6011-rev-list-with-bad-commit.sh | 3 ++-
 t/t6024-recursive-merge.sh  | 3 ++-
 t/t6501-freshen-objects.sh  | 6 +++---
 8 files changed, 17 insertions(+), 13 deletions(-)


base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-50%2Fderrickstolee%2Fcommit-graph-default-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-50/derrickstolee/commit-graph-default-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/50
-- 
gitgitgadget


[PATCH 2/3] t: explicitly turn off core.commitGraph as needed

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few tests that already require GIT_TEST_COMMIT_GRAPH=0
as they rely on an interaction with the commits in the object store
that is circumvented by parsing commit information from the
commit-graph instead. Before enabling core.commitGraph as true by
default, explicitly turn the setting off for these tests.

Signed-off-by: Derrick Stolee 
---
 t/t0410-partial-clone.sh| 3 ++-
 t/t5307-pack-missing-commit.sh  | 3 ++-
 t/t6011-rev-list-with-bad-commit.sh | 3 ++-
 t/t6024-recursive-merge.sh  | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index cfd0655ea1..f5277fafb1 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -193,7 +193,8 @@ test_expect_success 'rev-list stops traversal at missing 
and promised commit' '
 
git -C repo config core.repositoryformatversion 1 &&
git -C repo config extensions.partialclone "arbitrary string" &&
-   GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects 
--objects bar >out &&
+   GIT_TEST_COMMIT_GRAPH=0 git -c core.commitGraph=false \
+   -C repo rev-list --exclude-promisor-objects --objects bar >out 
&&
grep $(git -C repo rev-parse bar) out &&
! grep $FOO out
 '
diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh
index dacb440b27..dc4c19d0aa 100755
--- a/t/t5307-pack-missing-commit.sh
+++ b/t/t5307-pack-missing-commit.sh
@@ -16,7 +16,8 @@ test_expect_success setup '
obj=$(git rev-parse --verify tag3) &&
fanout=$(expr "$obj" : "\(..\)") &&
remainder=$(expr "$obj" : "..\(.*\)") &&
-   rm -f ".git/objects/$fanout/$remainder"
+   rm -f ".git/objects/$fanout/$remainder" &&
+   git config core.commitGraph false
 '
 
 test_expect_success 'check corruption' '
diff --git a/t/t6011-rev-list-with-bad-commit.sh 
b/t/t6011-rev-list-with-bad-commit.sh
index 545b461e51..da6949743d 100755
--- a/t/t6011-rev-list-with-bad-commit.sh
+++ b/t/t6011-rev-list-with-bad-commit.sh
@@ -42,7 +42,8 @@ test_expect_success 'corrupt second commit object' \
'
 
 test_expect_success 'rev-list should fail' '
-   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > 
/dev/null
+   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \
+   git -c core.commitGraph=false rev-list --all > /dev/null
 '
 
 test_expect_success 'git repack _MUST_ fail' \
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 27c7de90ce..fccdf96f13 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -61,7 +61,8 @@ GIT_AUTHOR_DATE="2006-12-12 23:00:08" git commit -m F
 '
 
 test_expect_success 'combined merge conflicts' '
-   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \
+   git -c core.commitGraph=false merge -m final G
 '
 
 cat > expect << EOF
-- 
gitgitgadget



Re: commit-graph is cool (overcoming add_missing_tags() perf issues)

2018-10-17 Thread Derrick Stolee

On 10/17/2018 2:00 PM, Elijah Newren wrote:

Hi,

Just wanted to give a shout-out for the commit-graph work and how
impressive it is.  I had an internal report from a user that git
pushes containing only one new tiny commit were taking over a minute
(in a moderate size repo with good network connectivity). After
digging for a while, I noticed three unusual things about the repo[1]:
   * he had push.followTags set to true
   * upstream repo had about 20k tags (despite only 55k commits)
   * his repo had an additional 2.5k tags, but none of these were in
 the history of the branches he was pushing and thus would not be
 included in any pushes.

Digging in, almost all the time was CPU-bound and spent in
add_missing_tags()[2].  If I'm reading the code correctly, it appears
that function loops over each tag, calling in_merge_bases_many() once
per tag.  Thus, for his case, we were potentially walking all of
history of the main branch 2.5k times.  That seemed rather suboptimal.


Thanks for the report. I made a note to inspect add_missing_tags() for 
more improvement in the future.



Before attempting to optimize, I decided to try out the commit-graph
with a version of git from pu.  While I expected a speed-up, I was a
bit suprised that it was a factor of over 100; dropping the time for
local dry-run push[2] to sub-second.  A quick look suggests that
commit-graph doesn't fix the fact that we call in_merge_bases_many() N
times from add_missing_tags() and thus likely need to do N merge base
computations, it just makes each of the N much faster.  So, perhaps
there's still another scaling issue we'll eventually need to address,
but for now, I'm pretty excited about commit-graph.


Without the commit-graph, you are getting a quadratic problem (N commits 
* T tags), but with the commit-graph you are also getting the benefit of 
generation numbers, so the "N commits" is actually likely _zero_ for 
most tags, because the tags have strictly lower generation number. In 
those cases, we can terminate without any walk at all.


Thanks!
-Stolee



Git Test Coverage Report (Wednesday, Oct 17)

2018-10-17 Thread Derrick Stolee
513)    ieot->nr = nr;
3255089ada 3514)    for (i = 0; i < nr; i++) {
3255089ada 3515)    ieot->entries[i].offset = get_be32(index);
3255089ada 3516)    index += sizeof(uint32_t);
3255089ada 3517)    ieot->entries[i].nr = get_be32(index);
3255089ada 3518)    index += sizeof(uint32_t);
3255089ada 3521)    return ieot;
3255089ada 3524) static void write_ieot_extension(struct strbuf *sb, 
struct index_entry_offset_table *ieot)

3255089ada 3530)    put_be32(&buffer, IEOT_VERSION);
3255089ada 3531)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3534)    for (i = 0; i < ieot->nr; i++) {
3255089ada 3537)    put_be32(&buffer, ieot->entries[i].offset);
3255089ada 3538)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3541)    put_be32(&buffer, ieot->entries[i].nr);
3255089ada 3542)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3544) }

revision.c
2abf350385 1525) if (ce_path_match(istate, ce, &revs->prune_data, NULL)) {
2abf350385 1531) while ((i+1 < istate->cache_nr) &&
2abf350385 1532)    ce_same_name(ce, istate->cache[i+1]))

split-index.c
e3d837989e 335) ce->ce_flags |= CE_UPDATE_IN_BASE;

transport-helper.c
fb19d32f05  643) if (!data->connect && !data->stateless_connect)

transport.c
wt-status.c
f3bd35fa0d  671) s->committable = 1;
73ba5d78b4 1953) if (s->state.rebase_in_progress ||
73ba5d78b4 1954) s->state.rebase_interactive_in_progress)
73ba5d78b4 1955) branch_name = s->state.onto;
73ba5d78b4 1956) else if (s->state.detached_from)
73ba5d78b4 1957) branch_name = s->state.detached_from;

Commits introducing uncovered code:
Ben Peart  3255089ad: ieot: add Index Entry Offset Table (IEOT) 
extension

Ben Peart  3b1d9e045: eoie: add End of Index Entry (EOIE) extension
Ben Peart  77ff1127a: read-cache: load cache entries on worker threads
Ben Peart  abb4bb838: read-cache: load cache extensions on a worker 
thread

Ben Peart  c780b9cfe: config: add new index.threads config setting
Josh Steadmon  e001fd3a5: archive: implement protocol v2 archive command
Josh Steadmon  fb19d32f0: archive: allow archive over HTTP(S) with 
proto v2
Matthew DeVore  7c0fe330d: rev-list: handle missing tree objects 
properly

Matthew DeVore  bc5975d24: list-objects-filter: implement filter tree:0
Matthew DeVore  cc0b05a4c: list-objects-filter-options: do not 
over-strbuf_init
Matthew DeVore  f447a499d: list-objects: store common func args in 
struct
Nguyễn Thái Ngọc Duy  252d079cb: read-cache.c: optimize reading 
index format v4
Nguyễn Thái Ngọc Duy  26c7d0678: help -a: improve and make --verbose 
default
Nguyễn Thái Ngọc Duy  2abf35038: revision.c: remove implicit 
dependency on the_index

Nguyễn Thái Ngọc Duy  a470beea3: blame.c: rename "repo" argument to "r"
Nguyễn Thái Ngọc Duy  ae9af1228: status: show progress bar if 
refreshing the index takes too long
Nguyễn Thái Ngọc Duy  b78ea5fc3: diff.c: reduce implicit dependency 
on the_index

René Scharfe  8b2f8cbcb: oidset: use khash
Stephen P. Smith  73ba5d78b: roll wt_status_state into wt_status and 
populate in the collect phase
Stephen P. Smith  f3bd35fa0: wt-status.c: set the committable flag 
in the collect phase
SZEDER Gábor  e3d837989: split-index: don't compare cached data of 
entries already marked for split index



Uncovered Code in 'master' (a4b8ab5) but not in 5a0cc8a
---

builtin/gc.c
3029970275 builtin/gc.c 461) ret = error_errno(_("cannot stat '%s'"), 
gc_log_path);
3029970275 builtin/gc.c 470) ret = error_errno(_("cannot read '%s'"), 
gc_log_path);

fec2ed2187 builtin/gc.c 495) die(FAILED_RUN, pack_refs_cmd.argv[0]);
fec2ed2187 builtin/gc.c 498) die(FAILED_RUN, reflog.argv[0]);
3029970275 builtin/gc.c 585) exit(128);
fec2ed2187 builtin/gc.c 637) die(FAILED_RUN, repack.argv[0]);
fec2ed2187 builtin/gc.c 647) die(FAILED_RUN, prune.argv[0]);
fec2ed2187 builtin/gc.c 654) die(FAILED_RUN, prune_worktrees.argv[0]);
fec2ed2187 builtin/gc.c 658) die(FAILED_RUN, rerere.argv[0]);

builtin/pack-objects.c
2fa233a554 builtin/pack-objects.c 1512) hashcpy(base_oid.hash, base_sha1);
2fa233a554 builtin/pack-objects.c 1513) if 
(!in_same_island(&delta->idx.oid, &base_oid))

2fa233a554 builtin/pack-objects.c 1514) return 0;

commit-graph.c
5cef295f28   67) return 0;
20fd6d5799   79) return 0;

midx.c
e43d2dcce1  285) if (hasheq(oid.hash,
e43d2dcce1  286)    p->bad_object_sha1 + the_hash_algo->rawsz * i))

refs.c
4a6067cda5 1405) return 0;

Commits introducing uncovered code:
Derrick Stolee  20fd6d579: commit-graph: not compatible with grafts
Derrick Stolee  5cef295f2: commit-graph: not compatible with 
uninitialized repo
Jeff King  2fa233a55: pack-objects: handle island check for 
"external" delta base

Jeff King  e43d2dcce: more oideq/hasheq conversions
Jonathan Nieder  302997027: gc: do not return error for prior errors 
in daemonized mode

Jonathan Nieder  fec2ed218: gc: exit with status 128 on failure
Stefan Beller  4a6067cda: refs.c: migrate internal ref iteration to 
pass thru repository argument


Re: [PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI

2018-10-17 Thread Derrick Stolee

On 10/17/2018 9:00 AM, Derrick Stolee via GitGitGadget wrote:

[1] https://git.visualstudio.com/git/_build?definitionId=4
Newlines are hard. Sorry for the formatting issues when translating from 
a PR description.

Build definition
that tests Git with different arrangements of GIT_TEST_* variables.

Thanks,
-Stolee


Re: [PATCH] test-tool: show tool list on error

2018-10-17 Thread Derrick Stolee

On 10/17/2018 5:25 AM, Jeff King wrote:

Before we switched to one big test-tool binary, if you
forgot the name of a tool, you could use tab-completion in
the shell to get a hint. But these days, all you get is:

   $ t/helper/test-tool approxidate
   fatal: There is no test named 'approxidate'

and you're stuck reading the source code to find it. Let's
print a list of the available tools in this case.

Signed-off-by: Jeff King 
---
Not really user-facing, but this bugged me enough earlier to write the
patch. ;)

Some of the individual tools have nice help, too (try
"t/helper/test-tool date", which shows the approxidate command I was
looking for), but some of them could probably stand to improve their
friendliness (try "t/helper/test-tool config"). I think it's fine for
people to improve them over time if and when they get annoyed.

This will improve contributors' lives! Thanks.

Reviewed-by: Derrick Stolee 


[PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI

2018-10-17 Thread Derrick Stolee via GitGitGadget
Our CI scripts include a step to run the test suite with certain optional
variables enabled. Now that all branches should build and have tests succeed
with GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX enabled, add those
variables to that stage.

Note: the GIT_TEST_MULTI_PACK_INDEX variable has not merged all the way
down, so will be ignored if this series is merged faster than that one.
However, it is safe to make these changes orthogonally as all (known) test
breaks with GIT_TEST_MULTI_PACK_INDEX=1 are fixed in the topic that
introduces the variable.

I also created a build definition on Azure Pipelines that runs the test
suite with different subsets of the test variables, split by the following
types:

1) Index options 2) Commit-graph 3) Multi-pack-index

These builds are found at [1]. There are benefits to testing the variables
all together but also separately. I didn't want to create new stages in the
CI scripts to avoid consuming extra resources.

This series is based on js/vsts-ci to avoid conflicts with that series, but
is not necessarily a hard dependence.

Thanks, -Stolee

[1] https://git.visualstudio.com/git/_build?definitionId=4Build definition
that tests Git with different arrangements of GIT_TEST_* variables.

Derrick Stolee (1):
  ci: add optional test variables

 ci/run-build-and-tests.sh | 2 ++
 1 file changed, 2 insertions(+)


base-commit: d82963f34cf6921ed29d1fc2d96b16acf9005159
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-49%2Fderrickstolee%2Fci-vars-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-49/derrickstolee/ci-vars-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/49
-- 
gitgitgadget


[PATCH 1/1] ci: add optional test variables

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph and multi-pack-index features introduce optional
data structures that are not required for normal Git operations.
It is important to run the normal test suite without them enabled,
but it is helpful to also run the test suite using them.

Our continuous integration scripts include a second test stage that
runs with optional GIT_TEST_* variables enabled. Add the following
two variables to that stage:

  GIT_TEST_COMMIT_GRAPH
  GIT_TEST_MULTI_PACK_INDEX

This will slow down the operation, as we build a commit-graph file
after every 'git commit' operation and build a multi-pack-index
during every 'git repack' operation. However, it is important that
future changes are compatible with these features.

Signed-off-by: Derrick Stolee 
---
 ci/run-build-and-tests.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index e28ac2fb9a..db342bb6a8 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -15,6 +15,8 @@ then
export GIT_TEST_FULL_IN_PACK_ARRAY=true
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5
+   export GIT_TEST_COMMIT_GRAPH=1
+   export GIT_TEST_MULTI_PACK_INDEX=1
make --quiet test
 fi
 
-- 
gitgitgadget


Re: [PATCH 00/19] Bring more repository handles into our code base

2018-10-17 Thread Derrick Stolee

On 10/16/2018 7:35 PM, Stefan Beller wrote:
 
 This series takes another approach as it doesn't change the signature of

 functions, but introduces new functions that can deal with arbitrary
 repositories, keeping the old function signature around using a shallow 
wrapper.

I think this is a good direction, and the changes look good to me.


 Additionally each patch adds a semantic patch, that would port from the 
old to
 the new function. These semantic patches are all applied in the very last 
patch,
 but we could omit applying the last patch if it causes too many merge 
conflicts
 and trickl in the semantic patches over time when there are no merge 
conflicts.


The semantic patches are a good idea. At some point in the future, we 
can submit a series that applies the patches to any leftover calls and 
removes the old callers. We can hopefully rely on review (and the 
semantic patches warning that there is work to do) to prevent new 
callers from being introduced in future topics. That doesn't count 
topics that come around while this one isn't merged down.


I had one high-level question: How are we testing that these "arbitrary 
repository" changes are safe? I just remember the issue we had with the 
commit-graph code relying on arbitrary repositories, but then adding a 
reference to the replace objects broke the code (because replace-objects 
wasn't using arbitrary repos correctly, but the commit-graph was tested 
with arbitrary repositories using "test-tool repository"). It would be 
nice to introduce more method calls in t/helper/test-repository.c that 
help us know these are safe conversions. Otherwise, we are essentially 
waiting until we try to take submodule things in-process and find out 
what breaks. As we discovered with the refstore, we can't just ensure 
that all references to the_repository are removed.


Thanks,
-Stolee


Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256

2018-10-17 Thread Derrick Stolee

On 10/14/2018 10:19 PM, brian m. carlson wrote:

Since the commit-graph code wants to serialize the hash algorithm into
the data store, specify a version number for each supported algorithm.
Note that we don't use the values of the constants themselves, as they
are internal and could change in the future.

Signed-off-by: brian m. carlson 
---
  commit-graph.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7a28fbb03f..e587c21bb6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
  
  static uint8_t oid_version(void)

  {
-   return 1;
+   switch (hash_algo_by_ptr(the_hash_algo)) {
hash_algo_by_ptr is specified as an inline function in hash.h, so this 
leads to a linking error in jch and pu right now.


I think the fix is simply to add '#include "hash.h"' to the list of 
includes.


Thanks,
-Stolee


[PATCH v4 4/7] revision.c: begin refactoring --topo-order logic

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(&commit->object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(&revs->commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(&a

[PATCH v4 6/7] revision.c: generation-based topo-order algorithm

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n " or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this imp

[PATCH v4 3/7] test-reach: add rev-list tests

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
gitgitgadget



[PATCH v4 1/7] prio-queue: add 'peek' operation

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh. Further, add
a test that checks the behavior when the compare function is NULL
(i.e. the queue becomes a stack).

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 26 ++
 t/t0009-prio-queue.sh  | 14 ++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..5bc9c46ea5 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get(&pq));
-   else if (!strcmp(*argv, "dump")) {
-   int *v;
-   while ((v = prio_queue_get(&pq)))
-  show(v);
-   }
-   else {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek(&pq);
+   void *get = prio_queue_get(&pq);
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
+   void *peek;
+   void *get;
+   while ((peek = prio_queue_peek(&pq))) {
+   get = prio_queue_get(&pq);
+   if (peek != get)
+   BUG("peek and get results do not 
match");
+   show(get);
+   }
+   } else if (!strcmp(*argv, "stack")) {
+   pq.compare = NULL;
+   } else {
int *v = malloc(sizeof(*v));
*v = atoi(*argv);
prio_queue_put(&pq, v);
diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
index e56dfce668..3941ad2528 100755
--- a/t/t0009-prio-queue.sh
+++ b/t/t0009-prio-queue.sh
@@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' '
test_cmp expect actual
 '
 
+cat >expect <<'EOF'
+3
+2
+6
+4
+5
+1
+8
+EOF
+test_expect_success 'stack order' '
+   test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
+   test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget



[PATCH v4 7/7] t6012: make rev-list tests more interesting

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

As we are working to rewrite some of the revision-walk machinery,
there could easily be some interesting interactions between the
options that force topological constraints (--topo-order,
--date-order, and --author-date-order) along with specifying a
path.

Add extra tests to t6012-rev-list-simplify.sh to add coverage of
these interactions. To ensure interesting things occur, alter the
repo data shape to have different orders depending on topo-, date-,
or author-date-order.

When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
the new logic for topo-order walks using generation numbers. The
extra tests can be added indepently.

Signed-off-by: Derrick Stolee 
---
 t/t6012-rev-list-simplify.sh | 45 
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index b5a1190ffe..a10f0df02b 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -12,6 +12,22 @@ unnote () {
git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 
|g"
 }
 
+#
+# Create a test repo with interesting commit graph:
+#
+# A--B--G--H--I--K--L
+#  \  \   / /
+#   \  \ / /
+#C--E---F J
+#\_/
+#
+# The commits are laid out from left-to-right starting with
+# the root commit A and terminating at the tip commit L.
+#
+# There are a few places where we adjust the commit date or
+# author date to make the --topo-order, --date-order, and
+# --author-date-order flags produce different output.
+
 test_expect_success setup '
echo "Hi there" >file &&
echo "initial" >lost &&
@@ -21,10 +37,18 @@ test_expect_success setup '
 
git branch other-branch &&
 
+   git symbolic-ref HEAD refs/heads/unrelated &&
+   git rm -f "*" &&
+   echo "Unrelated branch" >side &&
+   git add side &&
+   test_tick && git commit -m "Side root" &&
+   note J &&
+   git checkout master &&
+
echo "Hello" >file &&
echo "second" >lost &&
git add file lost &&
-   test_tick && git commit -m "Modified file and lost" &&
+   test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m 
"Modified file and lost" &&
note B &&
 
git checkout other-branch &&
@@ -63,13 +87,6 @@ test_expect_success setup '
test_tick && git commit -a -m "Final change" &&
note I &&
 
-   git symbolic-ref HEAD refs/heads/unrelated &&
-   git rm -f "*" &&
-   echo "Unrelated branch" >side &&
-   git add side &&
-   test_tick && git commit -m "Side root" &&
-   note J &&
-
git checkout master &&
test_tick && git merge --allow-unrelated-histories -m "Coolest" 
unrelated &&
note K &&
@@ -103,14 +120,24 @@ check_result () {
check_outcome success "$@"
 }
 
-check_result 'L K J I H G F E D C B A' --full-history
+check_result 'L K J I H F E D C G B A' --full-history --topo-order
+check_result 'L K I H G F E D C B J A' --full-history
+check_result 'L K I H G F E D C B J A' --full-history --date-order
+check_result 'L K I H G F E D B C J A' --full-history --author-date-order
 check_result 'K I H E C B A' --full-history -- file
 check_result 'K I H E C B A' --full-history --topo-order -- file
 check_result 'K I H E C B A' --full-history --date-order -- file
+check_result 'K I H E B C A' --full-history --author-date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
+check_result 'I E C B A' --simplify-merges --topo-order -- file
+check_result 'I E C B A' --simplify-merges --date-order -- file
+check_result 'I E B C A' --simplify-merges --author-date-order -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I B A' --date-order -- file
+check_result 'I B A' --author-date-order -- file
 check_result 'H' --first-parent -- another-file
+check_result 'H' --first-parent --topo-order -- another-file
 
 check_result 'E C B A' --full-history E -- lost
 test_expect_success 'full history simplification without parent' '
-- 
gitgitgadget


[PATCH v4 5/7] commit/revisions: bookkeeping before refactoring

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.
   Also rename the method to the slightly more generic name
   process_parents() to make clear that this method does more than
   add to a list (and no list is required anymore).

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit.c   | 11 +--
 commit.h   |  8 
 revision.c | 18 ++
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..861a485e93 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
+implement_shared_commit_slab(author_date_slab, timestamp_t);
 
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +683,8 @@ fail_exit:
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..977d397356 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+define_shared_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..36458265a0 100644
--- a/revision.c
+++ b/revision.c
@@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit 
*p, struct commit_li
*cache = new_entry;
 }
 
-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
-   struct commit_list **list, struct commit_list **cache_ptr)
+static int process_parents(struct rev_info *revs, struct commit *commit,
+  struct commit_list **list, struct commit_list 
**cache_ptr)
 {
struct commit_list *parent = commit->parents;
unsigned left_flag;
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->

[PATCH v4 0/7] Use generation numbers for --topo-order

2018-10-16 Thread Derrick Stolee via GitGitGadget
This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Changes in V3: I added a new patch that updates the tab-alignment for flags
in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the
recommended changes to run_three_modes and test_three_modes from Szeder and
Junio. Thanks!

Changes in V4: I'm sending a V4 to respond to the feedback so far. Still
looking forward to more on the really big commit!

 * Removed the whitespace changes to the flags in revision.c that caused
   merge pain. 
   
   
 * The prio-queue peek function is now covered by tests when in "stack"
   mode.
   
   
 * The "add_parents_to_list()" function is now renamed to
   "process_parents()"
   
   
 * Added a new commit that expands test coverage with alternate orders and
   file history (use GIT_TEST_COMMIT_GRAPH to have
   t6012-rev-list-simplify.sh cover the new logic). These tests found a
   problem with author dates (I forgot to record them during the explore
   walk).
   
   
 * Commit message edits.
   
   

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Cc: avarab@gmail.comCc: szeder@gmail.com

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring

[PATCH v4 2/7] test-reach: add run_three_modes method

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
-- 
gitgitgadget



Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256

2018-10-16 Thread Derrick Stolee

On 10/16/2018 11:35 AM, Duy Nguyen wrote:

On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson
 wrote:

Since the commit-graph code wants to serialize the hash algorithm into
the data store, specify a version number for each supported algorithm.
Note that we don't use the values of the constants themselves, as they
are internal and could change in the future.

Signed-off-by: brian m. carlson 
---
  commit-graph.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7a28fbb03f..e587c21bb6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)

  static uint8_t oid_version(void)
  {
-   return 1;
+   switch (hash_algo_by_ptr(the_hash_algo)) {
+   case GIT_HASH_SHA1:
+   return 1;
+   case GIT_HASH_SHA256:
+   return 2;

Should we just increase this field to uint32_t and store format_id
instead? That will keep oid version unique in all data formats.
Both the commit-graph and multi-pack-index store a single byte for the 
hash version, so that ship has sailed (without incrementing the full 
file version number in each format).


It may be good to make this method accessible to both formats. I'm not 
sure if Brian's branch is built on top of the multi-pack-index code. 
Probably best to see if ds/multi-pack-verify is in the history.


Thanks,
-Stolee


Re: [PATCH 0/4] Bloom filter experiment

2018-10-16 Thread Derrick Stolee

On 10/16/2018 8:57 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Oct 16 2018, Derrick Stolee wrote:


On 10/16/2018 12:45 AM, Junio C Hamano wrote:

Derrick Stolee  writes:


2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
...
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.

Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?

512 is somewhat arbitrary, but having a maximum size is not.

In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.

A commit with many changed paths is very rare. The 512 I picked above
is enough to cover 99% of commits in each of the repos I sampled when
first investigating Bloom filters.

When a Bloom filter response says "maybe yes" (in our case, "maybe not
TREESAME"), then we need to verify that it is correct. In the extreme
case that every path is changed, then the Bloom filter does nothing
but add extra work.

These extreme cases are also not unprecedented: in our Azure Repos
codebase, we were using core.autocrlf to smudge CRLFs to LFs, but
when it was time to dogfood VFS for Git, we needed to turn off the
smudge filter. So, there is one commit that converts every LF to a
CRLF in every text file. Storing a Bloom filter for those ~250,000
entries would take ~256KB for essentially no value. By not storing a
filter for this commit, we go immediately to the regular TREESAME
check, which would happen for most pathspecs.

This is all to say: having a maximum size is good. 512 is big enough
to cover _most_ commits, but not so big that we may store _really_ big
filters.

Makes sense. 512 is good enough to hardcode initially, but I couldn't
tell from briefly skimming the patches if it was possible to make this
size dynamic per-repo when the graph/filter is written.
My proof-of-concept has it as a constant, but part of my plan is to make 
these all config options, as of this item in my message:


>>> 2. We need config values for writing and consuming bloom filters, 
but also to override the default settings.



I.e. we might later add some discovery step where we look at N number of
commits at random, until we're satisfied that we've come up with some
average/median number of total (recursive) tree entries & how many tend
to be changed per-commit.

I.e. I can imagine repositories (with some automated changes) where we
have 10k files and tend to change 1k per commit, or ones with 10k files
where we tend to change just 1-10 per commit, which would mean a
larger/smaller filter would be needed / would do.
I'm not sure a dynamic approach would be worth the effort, but I'm open 
to hearing the results of an experiment.


Thanks,
-Stolee


Re: [PATCH 0/4] Bloom filter experiment

2018-10-16 Thread Derrick Stolee

On 10/16/2018 12:45 AM, Junio C Hamano wrote:

Derrick Stolee  writes:


2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
...
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.

Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?

512 is somewhat arbitrary, but having a maximum size is not.

In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.


A commit with many changed paths is very rare. The 512 I picked above is 
enough to cover 99% of commits in each of the repos I sampled when first 
investigating Bloom filters.


When a Bloom filter response says "maybe yes" (in our case, "maybe not 
TREESAME"), then we need to verify that it is correct. In the extreme 
case that every path is changed, then the Bloom filter does nothing but 
add extra work.


These extreme cases are also not unprecedented: in our Azure Repos 
codebase, we were using core.autocrlf  to smudge CRLFs to LFs, but when 
it was time to dogfood VFS for Git, we needed to turn off the smudge 
filter. So, there is one commit that converts every LF to a CRLF in 
every text file. Storing a Bloom filter for those ~250,000 entries would 
take ~256KB for essentially no value. By not storing a filter for this 
commit, we go immediately to the regular TREESAME check, which would 
happen for most pathspecs.


This is all to say: having a maximum size is good. 512 is big enough to 
cover _most_ commits, but not so big that we may store _really_ big filters.


Thanks,
-Stolee



Re: Git Test Coverage Report (Monday, Oct 15)

2018-10-15 Thread Derrick Stolee

On 10/15/2018 12:24 PM, Derrick Stolee wrote:


Uncovered code in 'jch' (22f2f0f) and not in 'next' (152ad8e)
-

prio-queue.c
2d181390f3 94) return queue->array[queue->nr - 1].data;

(I have a fix to cover this in my private branch for this topic.)



revision.c
4943d28849 2931) return;
4943d28849 2934) return;
4943d28849 2937) c->object.flags |= UNINTERESTING;
4943d28849 2940) return;
4943d28849 2943) mark_parents_uninteresting(c);
4943d28849 2966) return;
4943d28849 2969) return;
4943d28849 2974) return;
4943d28849 3022) init_author_date_slab(&info->author_date);
4943d28849 3023) info->topo_queue.compare = 
compare_commits_by_author_date;

4943d28849 3024) info->topo_queue.cb_data = &info->author_date;
4943d28849 3025) break;
4943d28849 3038) continue;
4943d28849 3048) record_author_date(&info->author_date, c);
6c04ff3001 3086) if (!revs->ignore_missing_links)
6c04ff3001 3087) die("Failed to traverse parents of commit %s",
4943d28849 3088) oid_to_hex(&commit->object.oid));
4943d28849 3096) continue;
Looks like a number of these lines are important to cover, but are not 
covered by tests that _also_ specify '--topo-order'. I bet I can cover 
more of these by overriding the sort logic to use the new algorithm if 
GIT_TEST_COMMIT_GRAPH is specified. Or, should I create yet another test 
variable to cover these cases?


(Note: I run these coverage reports with a variety of optional test 
variables.)



Uncovered code in 'next' (152ad8e) and not in 'master' (5a0cc8a)

builtin/rev-list.c
7c0fe330d5 builtin/rev-list.c 227) die("unexpected missing %s object 
'%s'",
7c0fe330d5 builtin/rev-list.c 228) type_name(obj->type), 
oid_to_hex(&obj->oid));


commit-graph.c
5cef295f28   67) return 0;
20fd6d5799   79) return 0;
These are two ways to say the commit-graph should not be used, but are 
not covered by tests currently. One is if we say "is the repo shallow?" 
which happens to return when the repo has grafts (but we keep the check 
here in case the way shallows are implemented changes) and the other is 
if the repo is not initialized, but I fixed the test-helpers that had 
not initialized the repo yet.


Uncovered code in 'master' (5a0cc8a) and not in (fe8321ec05)
-
builtin/fsck.c
66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir";
66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path;
66ec0390e7 builtin/fsck.c 864) if (run_command(&midx_verify))
66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH;
These are underneath the "for all alternates" loop, and _should_ be 
covered with the coming GIT_TEST_MULTI_PACK_INDEX test variable.


Thanks,
-Stolee


Git Test Coverage Report (Monday, Oct 15)

2018-10-15 Thread Derrick Stolee
not have "

9f630e7480 builtin/stash--helper.c 1137) ret = 1;
9f630e7480 builtin/stash--helper.c 1138) goto done;
c2cc69f192 builtin/stash--helper.c 1154) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1155) fprintf_ln(stderr, _("Cannot 
save the current "

9f630e7480 builtin/stash--helper.c 1157) ret = -1;
9f630e7480 builtin/stash--helper.c 1158) goto done;
c2cc69f192 builtin/stash--helper.c 1163) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1164) fprintf_ln(stderr, _("Cannot save "
9f630e7480 builtin/stash--helper.c 1166) ret = -1;
9f630e7480 builtin/stash--helper.c 1167) goto done;
c2cc69f192 builtin/stash--helper.c 1174) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1175) fprintf_ln(stderr, _("Cannot 
save the current "

9f630e7480 builtin/stash--helper.c 1177) goto done;
c2cc69f192 builtin/stash--helper.c 1183) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1184) fprintf_ln(stderr, _("Cannot 
save the current "

9f630e7480 builtin/stash--helper.c 1186) ret = -1;
9f630e7480 builtin/stash--helper.c 1187) goto done;
c2cc69f192 builtin/stash--helper.c 1213) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1214) fprintf_ln(stderr, _("Cannot 
record "

9f630e7480 builtin/stash--helper.c 1216) ret = -1;
9f630e7480 builtin/stash--helper.c 1217) goto done;
1a0f0409a7 builtin/stash--helper.c 1289) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1290) goto done;
1a0f0409a7 builtin/stash--helper.c 1300) ret = -1;
c2cc69f192 builtin/stash--helper.c 1301) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1302) fprintf_ln(stderr, _("Cannot 
initialize stash"));

1a0f0409a7 builtin/stash--helper.c 1303) goto done;
1a0f0409a7 builtin/stash--helper.c 1313) ret = -1;
c2cc69f192 builtin/stash--helper.c 1314) if (!quiet)
c2cc69f192 builtin/stash--helper.c 1315) fprintf_ln(stderr, _("Cannot 
save the current status"));

1a0f0409a7 builtin/stash--helper.c 1316) goto done;
1a0f0409a7 builtin/stash--helper.c 1333) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1352) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1353) goto done;
1a0f0409a7 builtin/stash--helper.c 1362) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1363) goto done;
1a0f0409a7 builtin/stash--helper.c 1371) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1380) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1391) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1392) goto done;
1a0f0409a7 builtin/stash--helper.c 1401) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1402) goto done;
1a0f0409a7 builtin/stash--helper.c 1410) ret = -1;
1a0f0409a7 builtin/stash--helper.c 1436) ret = -1;
193c3e3516 builtin/stash.c 1568) 
usage_msg_opt(xstrfmt(_("unknown subcommand: %s"), argv[0]),

193c3e3516 builtin/stash.c 1596) continue;

packfile.c
1127a98cce  117) return error("index file %s is too small", path);
1127a98cce  119) return error("empty data");

prio-queue.c
2d181390f3 94) return queue->array[queue->nr - 1].data;

rebase-interactive.c
64a43cbd5d 62) return error_errno(_("could not read '%s'."), todo_file);
64a43cbd5d 66) strbuf_release(&buf);
64a43cbd5d 67) return -1;
a9f5476fbc 75) return error_errno(_("could not read '%s'."), todo_file);
a9f5476fbc 79) strbuf_release(&buf);
a9f5476fbc 80) return -1;
64a43cbd5d 86) return -1;

revision.c
4943d28849 2931) return;
4943d28849 2934) return;
4943d28849 2937) c->object.flags |= UNINTERESTING;
4943d28849 2940) return;
4943d28849 2943) mark_parents_uninteresting(c);
4943d28849 2966) return;
4943d28849 2969) return;
4943d28849 2974) return;
4943d28849 3022) init_author_date_slab(&info->author_date);
4943d28849 3023) info->topo_queue.compare = compare_commits_by_author_date;
4943d28849 3024) info->topo_queue.cb_data = &info->author_date;
4943d28849 3025) break;
4943d28849 3038) continue;
4943d28849 3048) record_author_date(&info->author_date, c);
6c04ff3001 3086) if (!revs->ignore_missing_links)
6c04ff3001 3087) die("Failed to traverse parents of commit %s",
4943d28849 3088) oid_to_hex(&commit->object.oid));
4943d28849 3096) continue;

sequencer.c
65850686cf 2278) return;
65850686cf 2375) write_file(rebase_path_quiet(), "%s\n", quiet);
2c58483a59 3373) return error(_("could not checkout %s"), commit);
4df66c40b0 3387) return error(_("%s: not a valid OID"), orig_head);
71f82465b1 3407) fprintf(stderr, _("Stopped at HEAD\n"));
b97e187364 4771) return -1;
b97e187364 4774) return -1;
b97e187364 4780) return error_errno(_("could not read '%s'."), todo_file);
b97e187364 4783) todo_list_release(&todo_list);
b97e187364 4784) return error(_("unusable todo list: '%s'"), todo_file);
b97e187364 4803) todo_list_release(&todo_list);
b97e187364 4804) return -1;
b97e187364 4808) return error(_("could not copy '%s' to '%s'."), todo

Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear

2018-10-15 Thread Derrick Stolee

On 10/14/2018 10:29 AM, René Scharfe wrote:

It still has some repetition, converted code is a bit longer than the
current one, and I don't know how to build a Coccinelle rule that would
do that conversion.

Looked for a possibility to at least leave QSORT call-sites alone by
enhancing that macro, but didn't find any.  Found a few websites
showing off mindblowing macros, thouhg, this one in particular:

https://github.com/pfultz2/Cloak/wiki/C-Preprocessor-tricks,-tips,-and-idioms

Anyway, drove the generative approach a bit further, and came up with
the new DEFINE_SORT below.  I'm unsure about the name; perhaps it should
be called DEFINE_SORT_BY_COMPARE_FUNCTION_BODY, but that's a bit long.
It handles casts and const attributes behind the scenes and avoids
repetition, but looks a bit weird, as it is placed where a function
signature would go.

Apart from that the macro is simple and doesn't use any tricks or
added checks.  It just sets up boilerplate functions to offer type-safe
sorting.

diffcore-rename.c and refs/packed-backend.c receive special treatment in
the patch because their compare functions are used outside of sorting as
well.  I made them take typed pointers nevertheless and used them from
DEFINE_SORT; the wrapper generated by that macro is supposed to be
private.  Given that such reuse is rare and I think we don't need a way
to make it public.

What do y'all think about this direction?

---

[snip]

diff --git a/git-compat-util.h b/git-compat-util.h
index 5f2e90932f..491230fc57 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1066,6 +1066,21 @@ static inline void sane_qsort(void *base, size_t nmemb, 
size_t size,
qsort(base, nmemb, size, compar);
  }
  
+#define DECLARE_SORT(scope, name, elemtype) \

+scope void name(elemtype, size_t)
+
+#define DEFINE_SORT(scope, name, elemtype, one, two)   \
+static int name##_compare(const elemtype, const elemtype); \
+static int name##_compare_void(const void *a, const void *b)   \
+{  \
+   return name##_compare(a, b);\
+}  \
+scope void name(elemtype base, size_t nmemb)   \
+{  \
+   QSORT(base, nmemb, name##_compare_void);\
+}  \
+static int name##_compare(const elemtype one, const elemtype two)
+


Since you were worried about the "private" name of the compare function, 
maybe split this macro into two: DEFINE_COMPARE and DEFINE_SORT. Then, 
if someone wants direct access to the compare function, they could use 
the DEFINE_COMPARE to ensure the typing is correct, and use QSORT as 
normal with name##_compare_void.


As I think about this, I think this is less of a problem than is worth 
this split. The commit-slab definitions generate a lot of methods using 
the "name##" convention, so perhaps we should just trust developers 
using the macros to look up the macro definition or similar examples. In 
that sense, including a conversion that consumes the compare function 
directly can be a signpost for future callers.


I would say that maybe the times where you need to do something special 
should be pulled out into their own patches, so we can call attention to 
them directly.


[snip]

diff --git a/midx.c b/midx.c
index 713d6f9dde..4407db7949 100644
--- a/midx.c
+++ b/midx.c
@@ -419,10 +419,8 @@ struct pack_pair {
char *pack_name;
  };
  
-static int pack_pair_compare(const void *_a, const void *_b)

+DEFINE_SORT(static, sort_by_pack_name, struct pack_pair *, a, b)
  {
-   struct pack_pair *a = (struct pack_pair *)_a;
-   struct pack_pair *b = (struct pack_pair *)_b;
return strcmp(a->pack_name, b->pack_name);
  }
  
@@ -438,7 +436,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p

pairs[i].pack_name = pack_names[i];
}
  
-	QSORT(pairs, nr_packs, pack_pair_compare);

+   sort_by_pack_name(pairs, nr_packs);


I like this "sort_by_" convention..

  
  	for (i = 0; i < nr_packs; i++) {

pack_names[i] = pairs[i].pack_name;
@@ -455,10 +453,8 @@ struct pack_midx_entry {
uint64_t offset;
  };
  
-static int midx_oid_compare(const void *_a, const void *_b)

+DEFINE_SORT(static, sort_midx, struct pack_midx_entry *, a, b)
  {
-   const struct pack_midx_entry *a = (const struct pack_midx_entry *)_a;
-   const struct pack_midx_entry *b = (const struct pack_midx_entry *)_b;
int cmp = oidcmp(&a->oid, &b->oid);
  
  	if (cmp)

@@ -573,7 +569,7 @@ static struct pack_midx_entry *get_sorted_entries(struct 
multi_pack_index *m,
}
}
  
-		QSORT(entries_by_fanout, nr_fanout, midx_oid_compare)

Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256

2018-10-15 Thread Derrick Stolee

On 10/14/2018 10:19 PM, brian m. carlson wrote:

Since the commit-graph code wants to serialize the hash algorithm into
the data store, specify a version number for each supported algorithm.
Note that we don't use the values of the constants themselves, as they
are internal and could change in the future.

Signed-off-by: brian m. carlson 
---
  commit-graph.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7a28fbb03f..e587c21bb6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir)
  
  static uint8_t oid_version(void)

  {
-   return 1;
+   switch (hash_algo_by_ptr(the_hash_algo)) {
+   case GIT_HASH_SHA1:
+   return 1;
+   case GIT_HASH_SHA256:
+   return 2;
+   default:
+   BUG("unknown hash algorithm");
+   }
  }
  
  static struct commit_graph *alloc_commit_graph(void)

Simple and easy!

Thanks,
-Stolee


Re: [PATCH v2 09/13] commit-graph: convert to using the_hash_algo

2018-10-15 Thread Derrick Stolee

On 10/14/2018 10:18 PM, brian m. carlson wrote:

Instead of using hard-coded constants for object sizes, use
the_hash_algo to look them up.  In addition, use a function call to look
up the object ID version and produce the correct value.
This looks good and I can see already how this new organization will 
help make the hash size more flexible.


Thanks!
-Stolee



Re: [PATCH 0/4] Bloom filter experiment

2018-10-15 Thread Derrick Stolee

On 10/9/2018 3:34 PM, SZEDER Gábor wrote:

To keep the ball rolling, here is my proof of concept in a somewhat
cleaned-up form, with still plenty of rough edges.


Peff, Szeder, and Jonathan,

Thanks for giving me the kick in the pants to finally write a proof of 
concept for my personal take on how this should work. My implementation 
borrows things from both Szeder and Jonathan's series. You can find my 
commits for all of the versions on GitHub (it's a bit too messy to share 
as a patch series right now, I think):


Repo: https://github.com/derrickstolee/git
Branches: bloom/* (includes bloom/stolee, bloom/peff, bloom/szeder, and 
bloom/tan for the respective implementations, and bloom/base as the 
common ancestor)


My implementation uses the following scheme:

1. Bloom filters are computed and stored on a commit-by-commit basis.

2. The filters are sized according to the number of changes in each 
commit, with a minimum of one 64-bit word.


3. The filters are stored in the commit-graph using two new optional 
chunks: one stores a single 32-bit integer for each commit that provides 
the end of its Bloom filter in the second "data" chunk. The data chunk 
also stores the magic constants (hash version, num hash keys, and num 
bits per entry).


4. We fill the Bloom filters as (const char *data, int len) pairs as 
"struct bloom_filter"s in a commit slab.


5. In order to evaluate containment, we need the struct bloom_filter, 
but also struct bloom_settings (stores the magic constants in one 
place), and struct bloom_key (stores the _k_ hash values). This allows 
us to hash a path once and test the same path against many Bloom filters.


6. When we compute the Bloom filters, we don't store a filter for 
commits whose first-parent diff has more than 512 paths.


7. When we compute the commit-graph, we can re-use the pre-existing 
filters without needing to recompute diffs. (Caveat: the current 
implementation will re-compute diffs for the commits with diffs that 
were too large.)


You can build the Bloom filters in my implementation this way:

GIT_TEST_BLOOM_FILTERS=1 ./git commit-graph write --reachable


You can play around with it like this:

   $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
   Computing commit graph generation numbers: 100% (52801/52801), done.
   Computing bloom filter: 100% (52801/52801), done.
   # Yeah, I even added progress indicator! :)
   $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count 
--full-history HEAD -- t/valgrind/valgrind.sh
   886
   20:40:24.783699 revision.c:486  bloom filter total queries: 66095 
definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873


Jonathan used this same test, so will I. Here is a summary table:

| Implementation | Queries | Maybe | FP # | FP %  |
||-|---|--|---|
| Szeder | 66095   | 1142  | 256  | 0.38% |
| Jonathan   | 66459   | 107   | 89   | 0.16% |
| Stolee | 53025   | 492   | 479  | 0.90% |

(Note that we must have used different starting points, which is why my 
"Queries" is so much smaller.)


The increase in false-positive percentage is expected in my 
implementation. I'm using the correct filter sizes to hit a <1% FP 
ratio. This could be lowered by changing the settings, and the size 
would dynamically grow. For my Git repo (which contains 
git-for-windows/git and microsoft/git) this implementation grows the 
commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to 
Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 
20 bytes per commit (compared to Jonathan's 256 bytes per commit).


Related stats for my Linux repo: 781,756 commits, commit-graph grows 
from 43.8 to 55.6 MB (~12 MB additional, ~16 bytes per commit).


I haven't done a side-by-side performance test for these 
implementations, but it would be interesting to do so.


Despite writing a lot of code in a short amount of time, there is a lot 
of work to be done before this is submittable:


1. There are three different environment variables right now. It would 
be better to have one GIT_TEST_ variable and rely on existing tracing 
for logs (trace2 values would be good here).


2. We need config values for writing and consuming bloom filters, but 
also to override the default settings.


3. My bloom.c/bloom.h is too coupled to the commit-graph. I want to 
harden that interface to let Bloom filters live as their own thing, but 
that the commit-graph could load a bloom filter from the file instead of 
from the slab.


4. Tests, tests, and more tests.

We'll see how much time I have to do this polish, but I think the 
benefit is proven.


Thanks,
-Stolee


Re: Git Test Coverage Report (Friday, Oct 12)

2018-10-15 Thread Derrick Stolee

On 10/15/2018 4:09 AM, Johannes Schindelin wrote:

Hi Stolee,

On Fri, 12 Oct 2018, Derrick Stolee wrote:


In an effort to ensure new code is reasonably covered by the test suite, we
now have contrib/coverage-diff.sh to combine the gcov output from 'make
coverage-test ; make coverage-report' with the output from 'git diff A B' to
discover _new_ lines of code that are not covered.

Thanks for doing this. I do notice, though, that there are a few mentions
of BUG() lines, e.g.


0af129b2ed builtin/rebase--interactive2.c 267) BUG("invalid command '%d'",
command);

I do not think that we will ever want to ensure that all of these lines
get code coverage ;-) Maybe it would be possible to exclude those lines
from the analysis?
I'll incorporate that into my build script, but leave it out of 
contrib/coverage-diff.sh in case people really want to see those lines 
when the inspect a single topic.


Thanks,
-Stolee


Re: [PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable

2018-10-12 Thread Derrick Stolee

On 10/12/2018 1:34 PM, Derrick Stolee via GitGitGadget wrote:

To increase coverage of the multi-pack-index feature, add a
GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_*
variables.

After creating the environment variable and running the test suite with it
enabled, I found a few bugs in the multi-pack-index implementation. These
are handled by the first two patches.

I have set up a CI build on Azure Pipelines [1] that runs the test suite
with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX
and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure
they work well with the rest of the ongoing work. Eventually, we can add
these variables to the Travis CI scripts.

[1] https://git.visualstudio.com/git/_build?definitionId=4

Derrick Stolee (3):
   midx: fix broken free() in close_midx()
   midx: close multi-pack-index on repack
   multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

  builtin/repack.c|  7 +--
  midx.c  | 26 --
  midx.h  |  6 +-
  t/README|  4 
  t/t5310-pack-bitmaps.sh |  1 +
  t/t5319-multi-pack-index.sh |  2 +-
  t/t9300-fast-import.sh  |  2 +-
  7 files changed, 37 insertions(+), 11 deletions(-)


base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf
I should explicitly mention that this base commit is different as 
otherwise I will conflict with ds/multi-pack-verify with the new 
prototype in midx.h.


Thanks,
-Stolee


[PATCH v2 2/3] midx: close multi-pack-index on repack

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When repacking, we may remove pack-files. This invalidates the
multi-pack-index (if it exists). Previously, we removed the
multi-pack-index file before removing any pack-file. In some cases,
the repack command may load the multi-pack-index into memory. This
may lead to later in-memory references to the non-existent pack-
files.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c |  3 +--
 midx.c   | 15 ---
 midx.h   |  4 +++-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index c6a7943d5c..44965cbaa3 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -431,8 +431,7 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
char *fname, *fname_old;
 
if (!midx_cleared) {
-   /* if we move a packfile, it will invalidated 
the midx */
-   clear_midx_file(get_object_directory());
+   clear_midx_file(the_repository);
midx_cleared = 1;
}
 
diff --git a/midx.c b/midx.c
index bf1f511862..22247a30ab 100644
--- a/midx.c
+++ b/midx.c
@@ -176,9 +176,13 @@ cleanup_fail:
return NULL;
 }
 
-static void close_midx(struct multi_pack_index *m)
+void close_midx(struct multi_pack_index *m)
 {
uint32_t i;
+
+   if (!m)
+   return;
+
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
@@ -914,9 +918,14 @@ cleanup:
return 0;
 }
 
-void clear_midx_file(const char *object_dir)
+void clear_midx_file(struct repository *r)
 {
-   char *midx = get_midx_filename(object_dir);
+   char *midx = get_midx_filename(r->objects->objectdir);
+
+   if (r->objects && r->objects->multi_pack_index) {
+   close_midx(r->objects->multi_pack_index);
+   r->objects->multi_pack_index = NULL;
+   }
 
if (remove_path(midx)) {
UNLEAK(midx);
diff --git a/midx.h b/midx.h
index ce80b91c68..0f68bccdd5 100644
--- a/midx.h
+++ b/midx.h
@@ -42,7 +42,9 @@ int midx_contains_pack(struct multi_pack_index *m, const char 
*idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, 
int local);
 
 int write_midx_file(const char *object_dir);
-void clear_midx_file(const char *object_dir);
+void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 
+void close_midx(struct multi_pack_index *m);
+
 #endif
-- 
gitgitgadget



[PATCH v2 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index feature is tested in isolation by
t5319-multi-pack-index.sh, but there are many more interesting
scenarios in the test suite surrounding pack-file data shapes
and interactions. Since the multi-pack-index is an optional
data structure, it does not make sense to include it by default
in those tests.

Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable
that enables core.multiPackIndex and writes a multi-pack-index
after each 'git repack' command. This adds extra test coverage
when needed.

There are a few spots in the test suite that need to react to this
change:

* t5319-multi-pack-index.sh: there is a test that checks that
  'git repack' deletes the multi-pack-index. Disable the environment
  variable to ensure this still happens.

* t5310-pack-bitmaps.sh: One test moves a pack-file from the object
  directory to an alternate. This breaks the multi-pack-index, so
  delete the multi-pack-index at this point, if it exists.

* t9300-fast-import.sh: One test verifies the number of files in
  the .git/objects/pack directory is exactly 8. Exclude the
  multi-pack-index from this count so it is still 8 in all cases.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c| 4 
 midx.c  | 9 +++--
 midx.h  | 2 ++
 t/README| 4 
 t/t5310-pack-bitmaps.sh | 1 +
 t/t5319-multi-pack-index.sh | 2 +-
 t/t9300-fast-import.sh  | 2 +-
 7 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 44965cbaa3..26dcccdafc 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -553,6 +553,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (!no_update_server_info)
update_server_info(0);
remove_temporary_files();
+
+   if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
+   write_midx_file(get_object_directory());
+
string_list_clear(&names, 0);
string_list_clear(&rollback, 0);
string_list_clear(&existing_packs, 0);
diff --git a/midx.c b/midx.c
index 22247a30ab..02b2211e31 100644
--- a/midx.c
+++ b/midx.c
@@ -335,9 +335,14 @@ int prepare_multi_pack_index_one(struct repository *r, 
const char *object_dir, i
struct multi_pack_index *m;
struct multi_pack_index *m_search;
int config_value;
+   static int env_value = -1;
 
-   if (repo_config_get_bool(r, "core.multipackindex", &config_value) ||
-   !config_value)
+   if (env_value < 0)
+   env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0);
+
+   if (!env_value &&
+   (repo_config_get_bool(r, "core.multipackindex", &config_value) ||
+   !config_value))
return 0;
 
for (m_search = r->objects->multi_pack_index; m_search; m_search = 
m_search->next)
diff --git a/midx.h b/midx.h
index 0f68bccdd5..f2bb7e681c 100644
--- a/midx.h
+++ b/midx.h
@@ -3,6 +3,8 @@
 
 #include "repository.h"
 
+#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+
 struct multi_pack_index {
struct multi_pack_index *next;
 
diff --git a/t/README b/t/README
index 5e48a043ce..9bfdd3004c 100644
--- a/t/README
+++ b/t/README
@@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the 
commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack-
+index to be written after every 'git repack' command, and overrides the
+'core.multiPackIndex' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 1be3459c5b..82d7f7f6a5 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects 
--honor-pack-keep (local bitmapped pa
 
 test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' 
'
mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+   rm -f .git/objects/pack/multi-pack-index &&
test_when_finished "mv alt.git/objects/pack/$packbitmap.* 
.git/objects/pack/" &&
echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
git index-pack 3b.pack &&
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bd8e841b81..70926b5bc0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -271,7 +271,7 @@ test_expect_success 'git-fsck incorrect offset' '
 
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
-   git repack -adf &&
+   GIT_TEST_MULTI_PA

[PATCH v2 1/3] midx: fix broken free() in close_midx()

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When closing a multi-pack-index, we intend to close each pack-file
and free the struct packed_git that represents it. However, this
line was previously freeing the array of pointers, not the
pointer itself. This leads to a double-free issue.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 713d6f9dde..bf1f511862 100644
--- a/midx.c
+++ b/midx.c
@@ -186,7 +186,7 @@ static void close_midx(struct multi_pack_index *m)
for (i = 0; i < m->num_packs; i++) {
if (m->packs[i]) {
close_pack(m->packs[i]);
-   free(m->packs);
+   free(m->packs[i]);
}
}
FREE_AND_NULL(m->packs);
-- 
gitgitgadget



[PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable

2018-10-12 Thread Derrick Stolee via GitGitGadget
To increase coverage of the multi-pack-index feature, add a
GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_*
variables.

After creating the environment variable and running the test suite with it
enabled, I found a few bugs in the multi-pack-index implementation. These
are handled by the first two patches.

I have set up a CI build on Azure Pipelines [1] that runs the test suite
with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX
and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure
they work well with the rest of the ongoing work. Eventually, we can add
these variables to the Travis CI scripts.

[1] https://git.visualstudio.com/git/_build?definitionId=4

Derrick Stolee (3):
  midx: fix broken free() in close_midx()
  midx: close multi-pack-index on repack
  multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

 builtin/repack.c|  7 +--
 midx.c  | 26 --
 midx.h  |  6 +-
 t/README|  4 
 t/t5310-pack-bitmaps.sh |  1 +
 t/t5319-multi-pack-index.sh |  2 +-
 t/t9300-fast-import.sh  |  2 +-
 7 files changed, 37 insertions(+), 11 deletions(-)


base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-27/derrickstolee/midx-test/upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/27

Range-diff vs v1:

 1:  9fcbbe336d = 1:  8bd672fe26 midx: fix broken free() in close_midx()
 2:  725ebadc92 ! 2:  2d8f26679d midx: close multi-pack-index on repack
 @@ -15,16 +15,15 @@
  --- a/builtin/repack.c
  +++ b/builtin/repack.c
  @@
 +  char *fname, *fname_old;
   
if (!midx_cleared) {
 -  /* if we move a packfile, it will invalidated 
the midx */
 -+ if (the_repository->objects) {
 -+ 
close_midx(the_repository->objects->multi_pack_index);
 -+ 
the_repository->objects->multi_pack_index = NULL;
 -+ }
 -  clear_midx_file(get_object_directory());
 +- /* if we move a packfile, it will invalidated 
the midx */
 +- clear_midx_file(get_object_directory());
 ++ clear_midx_file(the_repository);
midx_cleared = 1;
}
 + 
  
  diff --git a/midx.c b/midx.c
  --- a/midx.c
 @@ -44,13 +43,34 @@
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
 +@@
 +  return 0;
 + }
 + 
 +-void clear_midx_file(const char *object_dir)
 ++void clear_midx_file(struct repository *r)
 + {
 +- char *midx = get_midx_filename(object_dir);
 ++ char *midx = get_midx_filename(r->objects->objectdir);
 ++
 ++ if (r->objects && r->objects->multi_pack_index) {
 ++ close_midx(r->objects->multi_pack_index);
 ++ r->objects->multi_pack_index = NULL;
 ++ }
 + 
 +  if (remove_path(midx)) {
 +  UNLEAK(midx);
  
  diff --git a/midx.h b/midx.h
  --- a/midx.h
  +++ b/midx.h
  @@
 + int prepare_multi_pack_index_one(struct repository *r, const char 
*object_dir, int local);
 + 
   int write_midx_file(const char *object_dir);
 - void clear_midx_file(const char *object_dir);
 +-void clear_midx_file(const char *object_dir);
 ++void clear_midx_file(struct repository *r);
 + int verify_midx_file(const char *object_dir);
   
  +void close_midx(struct multi_pack_index *m);
  +
 3:  04e3e91082 = 3:  57c64e814c multi-pack-index: define 
GIT_TEST_MULTI_PACK_INDEX

-- 
gitgitgadget


Re: [PATCH v3 1/2] commit-graph write: add progress output

2018-10-12 Thread Derrick Stolee

On 10/12/2018 11:07 AM, Ævar Arnfjörð Bjarmason wrote:

On Fri, Oct 12 2018, Junio C Hamano wrote:


Makes sense. If this second iteration were also time consuming,
then it probably is a good idea to split these into two separate
phases?  "Counting 1...N" followed by "Inspecting 1...N" or
something like that.  Of course, if the latter does not take much
time, then doing without any progress indicator is also fine.

That's a good point. Derrick: If the three loops in close_reachable()
had to be split up into different progress stages and given different
names what do you think they should be? Now it's "Annotating commits in
commit graph" for all of them.


The following is the best I can think of right now:

1. Loading known commits.
2. Expanding reachable commits.
3. Clearing commit marks.

-Stolee


Git Test Coverage Report (Friday, Oct 12)

2018-10-12 Thread Derrick Stolee
i < nr; i++) {
3255089ada 3515)    ieot->entries[i].offset = get_be32(index);
3255089ada 3516)    index += sizeof(uint32_t);
3255089ada 3517)    ieot->entries[i].nr = get_be32(index);
3255089ada 3518)    index += sizeof(uint32_t);
3255089ada 3521)    return ieot;
3255089ada 3524) static void write_ieot_extension(struct strbuf *sb, 
struct index_entry_offset_table *ieot)

3255089ada 3530)    put_be32(&buffer, IEOT_VERSION);
3255089ada 3531)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3534)    for (i = 0; i < ieot->nr; i++) {
3255089ada 3537)    put_be32(&buffer, ieot->entries[i].offset);
3255089ada 3538)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3541)    put_be32(&buffer, ieot->entries[i].nr);
3255089ada 3542)    strbuf_add(sb, &buffer, sizeof(uint32_t));
3255089ada 3544) }

rebase-interactive.c
64a43cbd5d 61) return error_errno(_("could not read '%s'."), todo_file);
64a43cbd5d 65) strbuf_release(&buf);
64a43cbd5d 66) return -1;
a9f5476fbc 74) return error_errno(_("could not read '%s'."), todo_file);
a9f5476fbc 78) strbuf_release(&buf);
a9f5476fbc 79) return -1;
64a43cbd5d 85) return -1;

revision.c
4943d28849 2931) return;
4943d28849 2934) return;
4943d28849 2937) c->object.flags |= UNINTERESTING;
4943d28849 2940) return;
4943d28849 2943) mark_parents_uninteresting(c);
4943d28849 2966) return;
4943d28849 2969) return;
4943d28849 2974) return;
4943d28849 3019) info->topo_queue.compare = compare_commits_by_commit_date;
4943d28849 3020) break;
4943d28849 3022) init_author_date_slab(&info->author_date);
4943d28849 3023) info->topo_queue.compare = compare_commits_by_author_date;
4943d28849 3024) info->topo_queue.cb_data = &info->author_date;
4943d28849 3025) break;
4943d28849 3038) continue;
4943d28849 3048) record_author_date(&info->author_date, c);
6c04ff3001 3086) if (!revs->ignore_missing_links)
6c04ff3001 3087) die("Failed to traverse parents of commit %s",
4943d28849 3088) oid_to_hex(&commit->object.oid));
4943d28849 3096) continue;

sequencer.c
65850686cf 2276) return;
65850686cf 2373) write_file(rebase_path_quiet(), "%s\n", quiet);
2c58483a59 3371) return error(_("could not checkout %s"), commit);
4df66c40b0 3385) return error(_("%s: not a valid OID"), orig_head);
b97e187364 4748) return -1;
b97e187364 4751) return -1;
b97e187364 4757) return error_errno(_("could not read '%s'."), todo_file);
b97e187364 4760) todo_list_release(&todo_list);
b97e187364 4761) return error(_("unusable todo list: '%s'"), todo_file);
b97e187364 4780) todo_list_release(&todo_list);
b97e187364 4781) return -1;
b97e187364 4785) return error(_("could not copy '%s' to '%s'."), todo_file,
b97e187364 4789) return error(_("could not transform the todo list"));
b97e187364 4818) return error(_("could not transform the todo list"));
b97e187364 4821) return error(_("could not skip unnecessary pick 
commands"));

b97e187364 4827) return -1;

split-index.c
568f3a6073 310) const unsigned int ondisk_flags =
568f3a6073 314) ce_flags = ce->ce_flags;
568f3a6073 315) base_flags = base->ce_flags;
568f3a6073 317) ce->ce_flags   &= ondisk_flags;
568f3a6073 318) base->ce_flags &= ondisk_flags;
568f3a6073 319) ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
568f3a6073 322) ce->ce_flags = ce_flags;
568f3a6073 323) base->ce_flags = base_flags;
568f3a6073 324) if (ret)
568f3a6073 325) ce->ce_flags |= CE_UPDATE_IN_BASE;

strbuf.c
f95736288a  127) --sb->len;

transport-helper.c
fb19d32f05  643) if (!data->connect && !data->stateless_connect)

transport.c
99bcb883cb  299) BUG("buffer must be empty at the end of handshake()");

Commits introducing uncovered code:
Alban Gruin  0af129b2e: rebase--interactive2: rewrite the submodes 
of interactive rebase in C

Alban Gruin  2c58483a5: rebase -i: rewrite setup_reflog_action() in C
Alban Gruin  34b47315d: rebase -i: move rebase--helper modes to 
rebase--interactive

Alban Gruin  4df66c40b: rebase -i: rewrite checkout_onto() in C
Alban Gruin  53bbcfbde: rebase -i: implement the main part of 
interactive rebase as a builtin
Alban Gruin  64a43cbd5: rebase -i: rewrite the edit-todo 
functionality in C

Alban Gruin  65850686c: rebase -i: rewrite write_basic_state() in C
Alban Gruin  a9f5476fb: sequencer: refactor append_todo_help() to 
write its message to a buffer

Alban Gruin  b97e18736: rebase -i: rewrite complete_action() in C
Ben Peart  3255089ad: ieot: add Index Entry Offset Table (IEOT) 
extension

Ben Peart  3b1d9e045: eoie: add End of Index Entry (EOIE) extension
Ben Peart  77ff1127a: read-cache: load cache entries on worker threads
Ben 

Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic

2018-10-12 Thread Derrick Stolee

On 10/12/2018 2:33 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


* revs->limited implies we run limit_list() to walk the entire
   reachable set. There are some short-cuts here, such as if we
   perform a range query like 'git rev-list COMPARE..HEAD' and we
   can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
   the implementation of that method in commit.c. It implies that
   the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

With or without "--topo-order", running rev-list without any
negative commit means we must dig down to the roots that can be
reached from the positive commits we have.
If we use default order in 'git log', we don't walk all the way to the 
root commits, and instead trust the commit-date. (This is different than 
--date-order, which does guarantee parents after children.) In this 
case, revs->limited is false.

I am to sure if having to run the "sort" of order N counts as "walk
the entire reachable set once" (in addition to the enumeration that
must be done to prepare that N commits, performed in limit_list()).


sort_in_topological_order() does actually _two_ walks (the in-degree 
computation plus the walk that peels commits of in-degree zero), but 
those walks are cheaper because we've already parsed the commits in 
limit_list().

3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.

"latest"?  We dig down the history from newer to older, so at some
point we hit an old commit and need to find the parents to keep
walking towards even older parts of the history.  Did you mean
"earliest" instead?
I mean "latest" in terms of the algorithm, so "the commit that was 
returned by get_revision_1() most recently". This could use some 
rewriting for clarity.
  
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s

if (revs->diffopt.objfind)
revs->simplify_history = 0;
  
-	if (revs->topo_order)

+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;

Are we expecting that this is always a bool?  Can there be new
commits for which generation numbers are not computed and stored
while all the old, stable and packed commits have generation
numbers?


For this algorithm to work, we only care that _some_ commits have 
generation numbers. We expect that if a commit-graph file exists with 
generation numbers, then the majority of commits have generation 
numbers. The commits that were added or fetched since the commit-graph 
was written will have generation number INFINITY, but the topo-order 
algorithm will still work and be efficient in those cases. (This is also 
why we have the "half graph" case in test_three_modes.)



@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
  }
  
+struct topo_walk_info {};

+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));

There is no member in the struct at this point.  Are we sure this is
safe?  Just being curious.  I know xmalloc() gives us at least one
byte and info won't be NULL.  I just do not know offhand if we have
a guarantee that memset() acts sensibly to fill the first 0 bytes.
This is a good question. It seems to work for me when I check out your 
version of this commit (6c04ff30 "revision.c: begin refactoring 
--topo-order logic") and run all tests.

+   limit_list(revs);
+   sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(&commit->object.oid));
+   }
+}
+
  int prepare_revision_walk(struct rev_info *revs)
  {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(&revs->commits);
if (revs->no_wa

Re: [PATCH v4 0/1] contrib: Add script to show uncovered "new" lines

2018-10-12 Thread Derrick Stolee

On 10/11/2018 11:01 PM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


CHANGES IN V4: I reduced the blame output using -s which decreases the
width. I include a summary of the commit authors at the end to help people
see the lines they wrote. This version is also copied into a build
definition in the public Git project on Azure Pipelines [1]. I'll use this
build definition to generate the coverage report after each "What's Cooking"
email.

[1] https://git.visualstudio.com/git/_build?definitionId=5

Thanks.  Let's move this forward by merging it down to 'next'.
Sounds good. When it moves into 'master' I can update my build to call 
the script from source.


Thanks,
-Stolee


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-11 Thread Derrick Stolee

On 10/11/2018 11:35 AM, Jeff King wrote:

On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:


From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
[...]
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,

A minor nit, but this commit message doesn't mention the most basic
thing up front: that its main purpose is to introduce a new algorithm
for topo-order. ;)

It's obvious in the context of reviewing the series, but somebody
reading "git log" later may want a bit more. Perhaps:

   revision.c: implement generation-based topo-order algorithm

as a subject, and/or an introductory paragraph like:

   The current --topo-order algorithm requires walking all commits we
   are going to output up front, topo-sorting them, all before
   outputting the first value. This patch introduces a new algorithm
   which uses stored generation numbers to incrementally walk in
   topo-order, outputting commits as we go.

Other than that, I find this to be a wonderfully explanatory commit
message. :)


Good idea. I'll make that change.




The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.

OK, this makes sense. If we know that everybody else in our queue is at
generation X, then it is safe to output a commit at generation greater
than X.

I think this by itself would allow us to implement "show no parents
before all of its children are shown", right? But --topo-order promises
a bit more: "avoid showing commits no multiple lines of history
intermixed".

I guess also INFINITY generation numbers need more. For a real
generation number, we know that "gen(A) == gen(B)" implies that there is
no ancestry relationship between the two. But not so for INFINITY.


Yeah, to deal with INFINITY (and ZERO, but that won't happen if 
generation_numbers_enabled() returns true), we treat gen(A) == gen(B) as 
a "no information" state. So, to output a commit at generation X, we 
need to have our maximum generation number in the unexplored area to be 
at most X - 1. You'll see strict inequality when checking generations.




2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.


The INDEGREE walk is an important element for Kahn's algorithm. The 
final output order is dictated by peeling commits of "indegree zero" to 
ensure all children are output before their parents. (Note: since we use 
literal 0 to mean "uninitialized", we peel commits when the indegree 
slab has value 1.)


This walk replaces the indegree logic from sort_in_topological_order(). 
That method performs one walk that fills the indegree slab, then another 
walk that peels the commits with indegree 0 and inserts them into a list.



3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.

OK, this makes sense to make --author-date-order, etc, work. Potentially
those numbers might have no relationship at all with the graph
structure, but we promise "no parent before its children are shown", so
this is really just a tie-breaker after the topo-sort anyway. As long as
steps 1 and 2 are correct and produce a complete set of commits for one
"layer", this should be OK.

I guess 

Re: [PATCH 1/2] One filter per commit

2018-10-11 Thread Derrick Stolee

On 10/10/2018 9:21 PM, Jonathan Tan wrote:

diff --git a/commit-graph.c b/commit-graph.c
index f415d3b41f..90b0b3df90 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -715,13 +715,11 @@ static int add_ref_to_list(const char *refname,
  static void add_changes_to_bloom_filter(struct bloom_filter *bf,
struct commit *parent,
struct commit *commit,
+   int index,
struct diff_options *diffopt)
  {
-   unsigned char p_c_hash[GIT_MAX_RAWSZ];
int i;
  
-	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);

-
diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
diffcore_std(diffopt);
  
@@ -756,8 +754,8 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,

the_hash_algo->update_fn(&ctx, path, p - path);
the_hash_algo->final_fn(name_hash, &ctx);
  
-			hashxor(name_hash, p_c_hash, hash);

-   bloom_filter_add_hash(bf, hash);
+   hashxor(name_hash, parent->object.oid.hash, hash);
+   bloom_filter_add_hash(bf, index, hash);
} while (*p);
  
  		diff_free_filepair(diff_queued_diff.queue[i]);

[snip]

@@ -768,11 +766,10 @@ static void add_changes_to_bloom_filter(struct 
bloom_filter *bf,
  }
  
  static void fill_bloom_filter(struct bloom_filter *bf,

-   struct progress *progress)
+   struct progress *progress, struct commit 
**commits, int commit_nr)
  {
struct rev_info revs;
const char *revs_argv[] = {NULL, "--all", NULL};
-   struct commit *commit;
int i = 0;
  
  	/* We (re-)create the bloom filter from scratch every time for now. */

@@ -783,18 +780,19 @@ static void fill_bloom_filter(struct bloom_filter *bf,
if (prepare_revision_walk(&revs))
die("revision walk setup failed while preparing bloom filter");
  
-	while ((commit = get_revision(&revs))) {

+   for (i = 0; i < commit_nr; i++) {
+   struct commit *commit = commits[i];
struct commit_list *parent;
  
  		for (parent = commit->parents; parent; parent = parent->next)

-   add_changes_to_bloom_filter(bf, parent->item, commit,
+   add_changes_to_bloom_filter(bf, parent->item, commit, i,
&revs.diffopt);
  

[snip]
  
-		hashxor(pi->name_hash, p_c_hash, hash);

-   if (bloom_filter_check_hash(&bf, hash)) {
+   hashxor(pi->name_hash, parent->object.oid.hash, hash);
+   if (bloom_filter_check_hash(&bf, commit->graph_pos, hash)) {
/*
 * At least one of the interesting pathspecs differs,
 * so we can return early and let the diff machinery
One main benefit of storing on Bloom filter per commit is to avoid 
recomputing hashes at every commit. Currently, this patch only improves 
locality when checking membership at the cost of taking up more space. 
Drop the dependence on the parent oid and then we can save the time 
spent hashing during history queries.


-Stolee


Re: Bloom Filters

2018-10-11 Thread Derrick Stolee

On 10/9/2018 7:12 PM, Jeff King wrote:

On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:


Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.

Thinking on it more, what I was missing is that for truly dense random
bitmaps, this will perform much worse. Because it will use a byte to say
"there's one 1", rather than a bit.

But I think it does OK in practice for the very sparse bitmaps we tend
to see in this application.  I was able to generate a complete output
that can reproduce "log --name-status -t" for linux.git in 32MB. But:

   - 15MB of that is commit sha1s, which will be stored elsewhere in a
 "real" system

   - 5MB of that is path list (which should shrink by a factor of 10 with
 prefix compression, and is really a function of a tree size less
 than history depth)

So the per-commit cost is not too bad. That's still not counting merges,
though, which would add another 10-15% (or maybe more; their bitmaps are
less sparse).

I don't know if this is a fruitful path at all or not. I was mostly just
satisfying my own curiosity on the bitmap encoding question. But I'll
post the patches, just to show my work. The first one is the same
initial proof of concept I showed earlier.

   [1/3]: initial tree-bitmap proof of concept
   [2/3]: test-tree-bitmap: add "dump" mode
   [3/3]: test-tree-bitmap: replace ewah with custom rle encoding

  Makefile|   1 +
  t/helper/test-tree-bitmap.c | 344 
  2 files changed, 345 insertions(+)
  create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse 
engineering how I'm supposed to test it.


Looks like running "t/helper/test-tree-bitmap gen" will output a lot of 
binary data. Where should I store that? Does any file work?


Is this series just for the storage costs, assuming that we would 
replace all TREESAME checks with a query into this database? Or do you 
have a way to test how much this would improve a "git log -- " query?


Thanks,
-Stolee


Re: What's cooking in git.git (Oct 2018, #01; Wed, 10)

2018-10-11 Thread Derrick Stolee

On 10/10/2018 1:43 AM, Junio C Hamano wrote:


* ds/reachable-topo-order (2018-09-21) 7 commits
  - revision.c: refactor basic topo-order logic
  - revision.h: add whitespace in flag definitions
  - commit/revisions: bookkeeping before refactoring
  - revision.c: begin refactoring --topo-order logic
  - test-reach: add rev-list tests
  - test-reach: add run_three_modes method
  - prio-queue: add 'peek' operation

  The revision walker machinery learned to take advantage of the
  commit generation numbers stored in the commit-graph file.

  What's the status of this topic?
I've reached out for review, especially on the rather large patch 
"revision.c: refactor basic topo-order logic" but have received no 
messages about it. I don't think anyone has even done a cursory style 
review.


I really think that this is a valuable topic, and it would be nice to 
have in 2.20, but I'm not pushing to merge something that no one has 
reviewed.



* ds/format-commit-graph-docs (2018-08-21) 2 commits
  - commit-graph.txt: improve formatting for asciidoc
  - Docs: Add commit-graph tech docs to Makefile

  Design docs for the commit-graph machinery is now made into HTML as
  well as text.

  Will discard.
  I am inclined to drop these, as I do not see much clarity in HTML
  output over the text source.  Opinions?
These have been marked for discard for a few weeks. I agree they should 
be discarded.


Thanks,
-Stolee


Re: [RFC PATCH 6/6] commit-reach: fix first-parent heuristic

2018-10-11 Thread Derrick Stolee

On 10/10/2018 9:50 PM, Jonathan Nieder wrote:

Hi,

Derrick Stolee wrote:


  commit-reach.c| 4 +++-
  t/t6600-test-reach.sh | 2 +-
  2 files changed, 4 insertions(+), 2 deletions(-)

I like this testing technique, and the test passes for me.

Except: if I put

CC = cc -m32
NO_OPENSSL = YesPlease
NO_CURL = YesPlease

in config.mak (the first line to force 32-bit pointers, the others
to avoid some dependencies on libs that I don't have 32-bit versions
of), then the test fails for me:

  $ ./t6600-test-reach.sh -v -x -i
  [...]
  expecting success:
  cp commit-graph-full .git/objects/info/commit-graph &&
  run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input 
\
  "test-tool reach can_all_from_reach"

  ++ cp commit-graph-full .git/objects/info/commit-graph
  ++ run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input 
'test-tool reach can_all_from_r
  each'
  ++ CATEGORY=can_all_from_reach_with_flag
  ++ KEY=num_walked
  ++ VALUE=19
  ++ INPUT=input
  ++ COMMAND='test-tool reach can_all_from_reach'
  +++ pwd
  ++ GIT_TR2_PERFORMANCE='/usr/local/google/home/jrn/src/git/t/trash 
directory.t6600-test-reach/perf-log.t
  xt'
  ++ test-tool reach can_all_from_reach
  can_all_from_reach(X,Y):1
  ++ cat perf-log.txt
  ++ grep 'category:can_all_from_reach_with_flag key:num_walked value:19'
  error: last command exited with $?=1
  not ok 11 - can_all_from_reach:perf
  #
  #   cp commit-graph-full .git/objects/info/commit-graph &&
  #   run_and_check_trace2 can_all_from_reach_with_flag num_walked 
19 input \
  #   "test-tool reach can_all_from_reach"
  #

When I cat perf-log.txt, I see

   ..category:can_all_from_reach_with_flag key:num_walked value:20

instead of the expected 19.
It is possible this is related to the sort-order problem reported and 
fixed by Rene [1]. I'll look into it in any case.


[1] 
https://public-inbox.org/git/dca35e44-a763-bcf0-f457-b8dab5381...@web.de/


Re: [PATCH 0/4] Bloom filter experiment

2018-10-09 Thread Derrick Stolee

On 10/9/2018 3:34 PM, SZEDER Gábor wrote:

To keep the ball rolling, here is my proof of concept in a somewhat
cleaned-up form, with still plenty of rough edges.

You can play around with it like this:

   $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
   Computing commit graph generation numbers: 100% (52801/52801), done.
   Computing bloom filter: 100% (52801/52801), done.
   # Yeah, I even added progress indicator! :)
   $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count 
--full-history HEAD -- t/valgrind/valgrind.sh
   886
   20:40:24.783699 revision.c:486  bloom filter total queries: 66095 
definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873

The value of $GIT_USE_POC_BLOOM_FILTER only really matters when writing
the Bloom filter, and it specifies the number of bits in the filter's
bitmap, IOW the above command creates a 8MB Bloom filter.  To make use
of the filter the variable can be anything non-empty.

Writing the Bloom filter is very slow as it is (yeah, that's why
bothered with the progress indicator ;).  I wrote about it in patch 2's
commit message: the cause for about half of the slowness is rather
obvious, but I don't (yet) know what's responsible for the other half.


Not a single test...  but I've run loops over all files in git.git
comparing 'git rev-list HEAD -- $file's output with and without the
Bloom filter, and, surprisingly, they match.  My quick'n'dirty
experiments usually don't fare this well...


It's also available at:

   https://github.com/szeder/git bloom-filter-experiment


Thanks! I will take a close look at this tomorrow and start playing with it.

-Stolee



Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Derrick Stolee

On 10/9/2018 2:46 PM, Jeff King wrote:

On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:


[I snipped all of the parts about bloom filters that seemed entirely
  reasonable to me ;) ]

Imagine we have that list. Is a bloom filter still the best data
structure for each commit? At the point that we have the complete
universe of paths, we could give each commit a bitmap of changed paths.
That lets us ask "did this commit touch these paths" (collect the bits
from the list of paths, then check for 1's), as well as "what paths did
we touch" (collect the 1 bits, and then index the path list).  Those
bitmaps should compress very well via EWAH or similar (most of them
would be huge stretches of 0's punctuated by short runs of 1's).

I'm not convinced we would frequently have runs of 1's, and the bitmap would
not compress much better than simply listing the positions. For example, a
path "foo/bar" that resolves to a tree would only start a run if the next
changes are the initial section of entries in that tree (sorted
lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
then we will break the run of 1's unless we changed every path deeper than
that tree.

Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
2). I agree that listing the positions could work, though I sort of
assumed that was more or less what a decent compressed bitmap would
turn into. E.g., if bit N is set, we should be able to say "N-1
zeroes, 1 one" in about the same size as we could say "position N".

EWAH seems pretty awful in that regard. Or at least its serialized
format is (or maybe it's our implementation that is bad).

The patch below generates a bitmap for each commit in a repository (it
doesn't output the total list of paths; I've left that as an exercise
for the reader). On linux.git, the result is 57MB. But when I look at
the individual bitmap sizes (via GIT_TRACE), they're quite silly.
Storing a single set bit takes 28 bytes in serialized form!

There are only around 120k unique paths (including prefix trees).
Naively using run-length encoding and varints, our worst case should be
something like 18-20 bits to say "120k zeroes, then a 1, then all
zeroes".  And the average case should be better (you don't even need to
say "120k", but some smaller number).

I wonder if Roaring does better here.


In these sparse cases, usually Roaring will organize the data as "array 
chunks" which are simply lists of the values. The thing that makes this 
still compressible is that we store two bytes per entry, as the entries 
are grouped by a common most-significant two bytes. SInce you say ~120k 
unique paths, the Roaring bitmap would have two or three chunks per 
bitmap (and those chunks could be empty). The overhead to store the 
chunk positions, types, and lengths does come at a cost, but it's more 
like 32 bytes _per commit_.



Gzipping the resulting bitmaps drops the total size to about 7.5MB.
That's not a particularly important number, but I think it shows that
the built-in ewah compression is far from ideal.

Just listing the positions with a series of varints would generally be
fine, since we expect sparse bitmaps. I just hoped that a good
RLE scheme would degrade to roughly that for the sparse case, but also
perform well for more dense cases.

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter 
approach and see what the numbers are like. You did all the "git" stuff 
like computing the changed paths, so thanks!


-- >8 --
diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
  TEST_PROGRAMS_NEED_X += test-pkt-line
  TEST_PROGRAMS_NEED_X += test-svn-fe
  TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
  
  TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
  
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c

new file mode 100644
index 00..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+   struct hashmap_entry ent;
+   unsigned pos;
+   char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+

Re: [PATCH 2/3] midx: close multi-pack-index on repack

2018-10-09 Thread Derrick Stolee

On 10/9/2018 5:10 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


diff --git a/builtin/repack.c b/builtin/repack.c
index c6a7943d5c..7925bb976e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
  
  			if (!midx_cleared) {

/* if we move a packfile, it will invalidated 
the midx */
+   if (the_repository->objects) {
+   
close_midx(the_repository->objects->multi_pack_index);
+   
the_repository->objects->multi_pack_index = NULL;
+   }
clear_midx_file(get_object_directory());
midx_cleared = 1;

It somehow looks like a bit of layering violation, doesn't it?  When
we are clearing a midx file, don't we always want to do this as well?


You're right. It did feel a bit wrong. In v2, I'll replace this commit 
with a refactor of clear_midx_file() to do that. One tricky part is that 
we need to clear the file even if the in-memory struct hasn't been 
initialized, but I think passing a repository will suffice for that.


CC Stefan: Is there a plan to make get_object_directory() take a 
repository parameter?


Thanks,

-Stolee



Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Derrick Stolee

(Changing title to reflect the new topic.)

On 10/8/2018 11:08 PM, Jeff King wrote:

On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:

There are two questions that I was hoping to answer by looking at 
your code:

1. How do you store your Bloom filter? Is it connected to the commit-graph
and split on a commit-by-commit basis (storing "$path" as a key), or is it
one huge Bloom filter (storing "$commitid:$path" as key)?

I guess you've probably thought all of this through for your
implementation, but let me pontificate.

I'd have done it as one fixed-size filter per commit. Then you should be
able to hash the path keys once, and apply the result as a bitwise query
to each individual commit (I'm assuming that it's constant-time to
access the filter for each, as an index into an mmap'd array, with the
offset coming from a commit-graph entry we'd be able to look up anyway).


You're right that we want to hash the path a constant number of times. 
Add in that we want to re-use information already serialized when 
updating the file, and we see that having a commit-by-commit Bloom 
filter is a good idea. Using (commit, path) pairs requires lots of 
re-hashing, repeated work when extending the filter, and poor locality 
when evaluating membership of a single key.


The nice thing is that you can generate k 32-bit hash values based on 
two 32-bit hash values that are "independent enough" (see [1]). We used 
Murmur3 with two different seed values to generate hashes a & b, then 
used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our 
k hash values. These can be computed up front and then dropped into any 
size filter using a simple modulo operation. This allows flexible sizes 
in our filters.


I don't think fixed size filters are a good idea, and instead would opt 
for flex-sized filters with a maximum size. The typical parameters use 7 
hash functions and a filter size of (at least) 10 bits per entry. For 
most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a 
maximum of 512 bytes covers 99% of commits. We will want these bounds to 
be configurable via config. If we had a fixed size, then we either make 
it too small (and don't have sufficient coverage of commits) or too 
large (and waste a lot of space on the commits that change very little).


We can store these flex-sized filters in the commit-graph using two 
columns of data (two new optional chunks):


* Bloom filter data: stores the binary data of each commit's Bloom 
filter, concatenated together in the same order as the commits appear in 
the commit-graph.


* Bloom filter positions: The ith position of this column stores the 
start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A 
Bloom filter of size 0 is intended to mean "we didn't store this filter 
because it would be too large". We can compute the lengths of the filter 
by inspecting adjacent values.


In order to be flexible, we will want to encode some basic information 
into the Bloom filter data chunk, such as a tuple of (hash version, num 
hash bits, num bits per entry). This allows us to change the parameters 
in config but still be able to read a serialized filter. Here I assume 
that all filters share the same parameters. The "hash version" here is 
different than the_hash_algo, because we don't care about cryptographic 
security, only a uniform distrubution (hence, Murmur3 is a good, fast 
option).


[1] 
https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf




I think it would also be easier to deal with maintenance, since each
filter is independent (IIRC, you cannot delete from a bloom filter
without re-adding all of the other keys).

The obvious downside is that it's O(commits) storage instead of O(1).
It would always be O(changes), as the Bloom filter needs to grow in size 
as the number of entries grows.

Now let me ponder a bit further afield. A bloom filter lets us answer
the question "did this commit (probably) touch these paths?". But it
does not let us answer "which paths did this commit touch?".

That second one is less useful than you might think, because we almost
always care about not just the names of the paths, but their actual
object ids. Think about a --raw diff, or even traversing for
reachability (where if we knew the tree-diff cheaply, we could avoid
asking "have we seen this yet?" on most of the tree entries). The names
alone can make that a bit faster, but in the worst case you still have
to walk the whole tree to find their entries.

But there's also a related question: how do we match pathspec patterns?
For a changed path like "foo/bar/baz", I imagine a bloom filter would
mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
don't think a

Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Derrick Stolee

On 10/8/2018 2:10 PM, SZEDER Gábor wrote:

On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:

Nice! These numbers make sense to me, in terms of how many TREESAME queries
we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)

As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.


This is expected. The performance ratio is better when the path is any 
of the following:


1. A very deep path (need to walk multiple trees to answer TREESAME)

2. An entry is late in a very wide tree (need to spend extra time 
parsing tree object)


3. The path doesn't change very often (need to inspect many TREESAME 
pairs before finding enough interesting commits)


4. Some sub-path changes often (so the TREESAME comparison needs to 
parse beyond that sub-path often)


Our standard examples (Git and Linux repos) don't have many paths that 
have these properties. But: they do exist. In other projects, this is 
actually typical. Think about Java projects that frequently have ~5 
levels of folders before actually touching a code file.


When I was implementing the Bloom filter feature for Azure Repos, I ran 
performance tests on the Linux repo using a random sampling of paths. 
The typical speedup was 5x while some outliers were in the 25x range.





But I'm afraid it will take a while until I get around to turn it into
something presentable...

Do you have the code pushed somewhere public where one could take a look? I
Do you have the code pushed somewhere public where one could take a
look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)

There are two questions that I was hoping to answer by looking at your code:

1. How do you store your Bloom filter? Is it connected to the 
commit-graph and split on a commit-by-commit basis (storing "$path" as a 
key), or is it one huge Bloom filter (storing "$commitid:$path" as key)?


2. Where does your Bloom filter check plug into the TREESAME logic? I 
haven't investigated this part, but hopefully it isn't too complicated.


Thanks,
-Stolee


Re: [PATCH][Outreachy] remove all the inclusions of git-compat-util.h in header files

2018-10-08 Thread Derrick Stolee

On 10/8/2018 1:05 PM, Ananya Krishna Maram wrote:

Hi All,

Hello, Ananya! Welcome.


I was searching through #leftovers and found this.
https://public-inbox.org/git/cabpp-bgvvxcbzx44er6to-pusfen_6gnyj1u5cuon9deaa4...@mail.gmail.com/

This patch address the task discussed in the above link.
The discussion above seems to not be intended for your commit message, 
but it does show up when I run `git am` and provide your email as input. 
The typical way to avoid this is to place all commentary below the "---" 
that signifies the commit message is over.

From: Ananya Krishan Maram 

skip the #include of git-compat-util.h since all .c files include it.

Signed-off-by: Ananya Krishna Maram 
---
  advice.h | 1 -
  commit-graph.h   | 1 -
  hash.h   | 1 -
  pkt-line.h   | 1 -
  t/helper/test-tool.h | 1 -
  5 files changed, 5 deletions(-)

diff --git a/advice.h b/advice.h
index ab24df0fd..09148baa6 100644
--- a/advice.h
+++ b/advice.h
@@ -1,7 +1,6 @@
  #ifndef ADVICE_H
  #define ADVICE_H
  
-#include "git-compat-util.h"
  
  extern int advice_push_update_rejected;

  extern int advice_push_non_ff_current;
diff --git a/commit-graph.h b/commit-graph.h
index b05047676..0e93c2bed 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -1,7 +1,6 @@
  #ifndef COMMIT_GRAPH_H
  #define COMMIT_GRAPH_H
  
-#include "git-compat-util.h"

  #include "repository.h"
  #include "string-list.h"
  #include "cache.h"
diff --git a/hash.h b/hash.h
index 7c8238bc2..9a4334c5d 100644
--- a/hash.h
+++ b/hash.h
@@ -1,7 +1,6 @@
  #ifndef HASH_H
  #define HASH_H
  
-#include "git-compat-util.h"
  
  #if defined(SHA1_PPC)

  #include "ppc/sha1.h"
diff --git a/pkt-line.h b/pkt-line.h
index 5b28d4347..fdd316494 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -1,7 +1,6 @@
  #ifndef PKTLINE_H
  #define PKTLINE_H
  
-#include "git-compat-util.h"

  #include "strbuf.h"
  
  /*

diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e07495727..24e0a1589 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -1,7 +1,6 @@
  #ifndef __TEST_TOOL_H__
  #define __TEST_TOOL_H__
  
-#include "git-compat-util.h"
  
  int cmd__chmtime(int argc, const char **argv);

  int cmd__config(int argc, const char **argv);
I applied these changes locally and confirmed the code compiles, so all 
.c files including these _do_ include git-compat-util.h properly.


Thanks,
-Stolee



Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Derrick Stolee

On 10/8/2018 12:41 PM, SZEDER Gábor wrote:

On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:

I'm still excited about the prospect of a bloom filter for paths which
each commit touches. I think that's the next big frontier in getting
things like "git log -- path" to a reasonable run-time.

There is certainly potential there.  With a (very) rough PoC
experiment, a 8MB bloom filter, and a carefully choosen path I can
achieve a nice, almost 25x speedup:

   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
   6

   real0m1.563s
   user0m1.519s
   sys 0m0.045s

   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- 
t/valgrind/valgrind.sh
   6

   real0m0.063s
   user0m0.043s
   sys 0m0.020s

   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false 
positives: 64 fp ratio: 0.003934
Nice! These numbers make sense to me, in terms of how many TREESAME 
queries we actually need to perform for such a query.

But I'm afraid it will take a while until I get around to turn it into
something presentable...
Do you have the code pushed somewhere public where one could take a 
look? I could provide some early feedback.


Thanks,
-Stolee


[PATCH 1/3] midx: fix broken free() in close_midx()

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When closing a multi-pack-index, we intend to close each pack-file
and free the struct packed_git that represents it. However, this
line was previously freeing the array of pointers, not the
pointer itself. This leads to a double-free issue.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index f3e8dbc108..999717b96f 100644
--- a/midx.c
+++ b/midx.c
@@ -190,7 +190,7 @@ static void close_midx(struct multi_pack_index *m)
for (i = 0; i < m->num_packs; i++) {
if (m->packs[i]) {
close_pack(m->packs[i]);
-   free(m->packs);
+   free(m->packs[i]);
}
}
FREE_AND_NULL(m->packs);
-- 
gitgitgadget



[PATCH 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable

2018-10-08 Thread Derrick Stolee via GitGitGadget
To increase coverage of the multi-pack-index feature, add a
GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_*
variables.

After creating the environment variable and running the test suite with it
enabled, I found a few bugs in the multi-pack-index implementation. These
are handled by the first two patches.

I have set up a CI build on Azure Pipelines [1] that runs the test suite
with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX
and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure
they work well with the rest of the ongoing work. Eventually, we can add
these variables to the Travis CI scripts.

[1] https://git.visualstudio.com/git/_build?definitionId=4

Derrick Stolee (3):
  midx: fix broken free() in close_midx()
  midx: close multi-pack-index on repack
  multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

 builtin/repack.c|  8 
 midx.c  | 17 +
 midx.h  |  4 
 t/README|  4 
 t/t5310-pack-bitmaps.sh |  1 +
 t/t5319-multi-pack-index.sh |  2 +-
 t/t9300-fast-import.sh  |  2 +-
 7 files changed, 32 insertions(+), 6 deletions(-)


base-commit: f84b9b09d40408cf91bbc500d9f190a7866c3e0f
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-27/derrickstolee/midx-test/upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/27
-- 
gitgitgadget


[PATCH 2/3] midx: close multi-pack-index on repack

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When repacking, we may remove pack-files. This invalidates the
multi-pack-index (if it exists). Previously, we removed the
multi-pack-index file before removing any pack-file. In some cases,
the repack command may load the multi-pack-index into memory. This
may lead to later in-memory references to the non-existent pack-
files.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c | 4 
 midx.c   | 6 +-
 midx.h   | 2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index c6a7943d5c..7925bb976e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
 
if (!midx_cleared) {
/* if we move a packfile, it will invalidated 
the midx */
+   if (the_repository->objects) {
+   
close_midx(the_repository->objects->multi_pack_index);
+   
the_repository->objects->multi_pack_index = NULL;
+   }
clear_midx_file(get_object_directory());
midx_cleared = 1;
}
diff --git a/midx.c b/midx.c
index 999717b96f..fe8532a9d1 100644
--- a/midx.c
+++ b/midx.c
@@ -180,9 +180,13 @@ cleanup_fail:
return NULL;
 }
 
-static void close_midx(struct multi_pack_index *m)
+void close_midx(struct multi_pack_index *m)
 {
uint32_t i;
+
+   if (!m)
+   return;
+
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
diff --git a/midx.h b/midx.h
index a210f1af2a..af6b5cb58f 100644
--- a/midx.h
+++ b/midx.h
@@ -44,4 +44,6 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(const char *object_dir);
 
+void close_midx(struct multi_pack_index *m);
+
 #endif
-- 
gitgitgadget



[PATCH 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index feature is tested in isolation by
t5319-multi-pack-index.sh, but there are many more interesting
scenarios in the test suite surrounding pack-file data shapes
and interactions. Since the multi-pack-index is an optional
data structure, it does not make sense to include it by default
in those tests.

Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable
that enables core.multiPackIndex and writes a multi-pack-index
after each 'git repack' command. This adds extra test coverage
when needed.

There are a few spots in the test suite that need to react to this
change:

* t5319-multi-pack-index.sh: there is a test that checks that
  'git repack' deletes the multi-pack-index. Disable the environment
  variable to ensure this still happens.

* t5310-pack-bitmaps.sh: One test moves a pack-file from the object
  directory to an alternate. This breaks the multi-pack-index, so
  delete the multi-pack-index at this point, if it exists.

* t9300-fast-import.sh: One test verifies the number of files in
  the .git/objects/pack directory is exactly 8. Exclude the
  multi-pack-index from this count so it is still 8 in all cases.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c| 4 
 midx.c  | 9 +++--
 midx.h  | 2 ++
 t/README| 4 
 t/t5310-pack-bitmaps.sh | 1 +
 t/t5319-multi-pack-index.sh | 2 +-
 t/t9300-fast-import.sh  | 2 +-
 7 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 7925bb976e..418442bfe2 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -558,6 +558,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (!no_update_server_info)
update_server_info(0);
remove_temporary_files();
+
+   if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
+   write_midx_file(get_object_directory());
+
string_list_clear(&names, 0);
string_list_clear(&rollback, 0);
string_list_clear(&existing_packs, 0);
diff --git a/midx.c b/midx.c
index fe8532a9d1..aeafb58fa3 100644
--- a/midx.c
+++ b/midx.c
@@ -338,9 +338,14 @@ int prepare_multi_pack_index_one(struct repository *r, 
const char *object_dir, i
struct multi_pack_index *m;
struct multi_pack_index *m_search;
int config_value;
+   static int env_value = -1;
 
-   if (repo_config_get_bool(r, "core.multipackindex", &config_value) ||
-   !config_value)
+   if (env_value < 0)
+   env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0);
+
+   if (!env_value &&
+   (repo_config_get_bool(r, "core.multipackindex", &config_value) ||
+   !config_value))
return 0;
 
for (m_search = r->objects->multi_pack_index; m_search; m_search = 
m_search->next)
diff --git a/midx.h b/midx.h
index af6b5cb58f..bec8f73d28 100644
--- a/midx.h
+++ b/midx.h
@@ -3,6 +3,8 @@
 
 #include "repository.h"
 
+#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+
 struct multi_pack_index {
struct multi_pack_index *next;
 
diff --git a/t/README b/t/README
index 3ea6c85460..9d0277c338 100644
--- a/t/README
+++ b/t/README
@@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the 
commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack-
+index to be written after every 'git repack' command, and overrides the
+'core.multiPackIndex' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 1be3459c5b..82d7f7f6a5 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects 
--honor-pack-keep (local bitmapped pa
 
 test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' 
'
mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+   rm -f .git/objects/pack/multi-pack-index &&
test_when_finished "mv alt.git/objects/pack/$packbitmap.* 
.git/objects/pack/" &&
echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
git index-pack 3b.pack &&
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 6f56b38674..4024ff9a39 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -152,7 +152,7 @@ compare_results_with_midx "twelve packs"
 
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
-   git repack -adf &&
+   GIT_TEST_MULTI_PACK_INDEX=0 gi

Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-10-08 Thread Derrick Stolee

On 10/8/2018 10:58 AM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Oct 08 2018, Derrick Stolee wrote:


On 10/8/2018 9:43 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Aug 28 2018, Derrick Stolee via GitGitGadget wrote:


From: Derrick Stolee 

The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.

To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.

There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.

There is one test in t6024-recursive-merge.sh that relies on the
merge-base algorithm picking one of two ambiguous merge-bases, and
the commit-graph feature changes which merge-base is picked.


The test feature itself seems fine, but this consistently fails ever
since it got introduced (a reset --hard on the commit merged to msater
in git.git):

  GIT_TEST_COMMIT_GRAPH=true prove -j$(parallel --number-of-cores) 
t5500-fetch-pack.sh t6001-rev-list-graft.sh t6050-replace.sh
  Test Summary Report
  ---
  t6001-rev-list-graft.sh (Wstat: 256 Tests: 14 Failed: 6)
Failed tests:  3, 5, 7, 9, 11, 13
Non-zero exit status: 1
  t6050-replace.sh   (Wstat: 256 Tests: 35 Failed: 9)
Failed tests:  12-16, 24-25, 30, 35
Non-zero exit status: 1
  t5500-fetch-pack.sh(Wstat: 256 Tests: 357 Failed: 1)
Failed test:  351
Non-zero exit status: 1

This is on Linux/Debian 4.17.0-1-amd64. Can you reproduce this? If not I
can provide more info (-x output etc..).

I see these failures, too, but I believe they are due to
ds/commit-graph-with-grafts not being merged to 'next' yet. The
purpose of that branch is to fix these test breaks. The environment
variable got merged a lot faster.

I just built & tested the 'jch' branch at 515d82d9 with
GIT_TEST_COMMIT_GRAPH=1 and they all passed.

I should have tested "pu" first. These failures are indeed fixed
there. Thanks, and sorry about the noise.
Thanks for testing with the optional features! It's good to keep them 
exercised.


[PATCH v4 1/1] contrib: add coverage-diff script

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has
an X.c.gcov file containing the coverage counts for every line, and "#"
at every uncovered line.

There have been a few bugs in recent patches what would have been caught
if the test suite covered those blocks (including a few of mine). I want
to work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks covering
very unlikely (or near-impossible) situations may not warrant coverage.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by the
test suite. The output uses 'git blame -s' format so you can find the commits
responsible and view the line numbers for quick access to the context, but
trims leading tabs in the file contents to reduce output width.

Signed-off-by: Derrick Stolee 
---
 contrib/coverage-diff.sh | 108 +++
 1 file changed, 108 insertions(+)
 create mode 100755 contrib/coverage-diff.sh

diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
new file mode 100755
index 00..4ec419f900
--- /dev/null
+++ b/contrib/coverage-diff.sh
@@ -0,0 +1,108 @@
+#!/bin/sh
+
+# Usage: Run 'contrib/coverage-diff.sh  ' from source-root
+# after running
+#
+# make coverage-test
+# make coverage-report
+#
+# while checked out at . This script combines the *.gcov files
+# generated by the 'make' commands above with 'git diff  '
+# to report new lines that are not covered by the test suite.
+
+V1=$1
+V2=$2
+
+diff_lines () {
+   perl -e '
+   my $line_num;
+   while (<>) {
+   # Hunk header?  Grab the beginning in postimage.
+   if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) {
+   $line_num = $1;
+   next;
+   }
+
+   # Have we seen a hunk?  Ignore "diff --git" etc.
+   next unless defined $line_num;
+
+   # Deleted line? Ignore.
+   if (/^-/) {
+   next;
+   }
+
+   # Show only the line number of added lines.
+   if (/^\+/) {
+   print "$line_num\n";
+   }
+   # Either common context or added line appear in
+   # the postimage.  Count it.
+   $line_num++;
+   }
+   '
+}
+
+files=$(git diff --name-only "$V1" "$V2" -- \*.c)
+
+# create empty file
+>coverage-data.txt
+
+for file in $files
+do
+   git diff "$V1" "$V2" -- "$file" |
+   diff_lines |
+   sort >new_lines.txt
+
+   if ! test -s new_lines.txt
+   then
+   continue
+   fi
+
+   hash_file=$(echo $file | sed "s/\//\#/")
+
+   if ! test -s "$hash_file.gcov"
+   then
+   continue
+   fi
+
+   sed -ne '/#:/{
+   s/#://
+   s/:.*//
+   s/ //g
+   p
+   }' "$hash_file.gcov" |
+   sort >uncovered_lines.txt
+
+   comm -12 uncovered_lines.txt new_lines.txt |
+   sed -e 's/$/\)/' |
+   sed -e 's/^/ /' >uncovered_new_lines.txt
+
+   grep -q '[^[:space:]]' >coverage-data.txt &&
+   git blame -s "$V2" -- "$file" |
+   sed 's/\t//g' |
+   grep -f uncovered_new_lines.txt >>coverage-data.txt &&
+   echo >>coverage-data.txt
+
+   rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt
+done
+
+cat coverage-data.txt
+
+echo "Commits introducing uncovered code:"
+
+commit_list=$(cat coverage-data.txt |
+   grep -E '^[0-9a-f]{7,} ' |
+   awk '{print $1;}' |
+   sort |
+   uniq)
+
+(
+   for commit in $commit_list
+   do
+   git log --no-decorate --pretty=format:'%an  %h: %s' -1 
$commit
+   echo
+   done
+) | sort
+
+rm coverage-data.txt
-- 
gitgitgadget


[PATCH v4 0/1] contrib: Add script to show uncovered "new" lines

2018-10-08 Thread Derrick Stolee via GitGitGadget
We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has an
X.c.gcov file containing the coverage counts for every line, and "#" at
every uncovered line.

There have been a few bugs in recent patches what would have been caught if
the test suite covered those blocks (including a few of mine). I want to
work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by
the test suite. For example, I ran this against the 'next' branch (e82ca0)
versus 'master' (f84b9b) and got the following output:

builtin/commit.c
76f2f5c1e3 builtin/commit.c 1657) 
write_commit_graph_reachable(get_object_directory(), 0, 0);

builtin/fsck.c
66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir";
66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path;
66ec0390e7 builtin/fsck.c 864) if (run_command(&midx_verify))
66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH;

fsck.c
fb8952077d  214) die_errno("Could not read '%s'", path);

midx.c
56ee7ff156  949) return 0;
cc6af73c02  990) midx_report(_("failed to load pack-index for packfile %s"),
cc6af73c02  991) e.p->pack_name);
cc6af73c02  992) break;

Commits introducing uncovered code:
Derrick Stolee  56ee7ff15: multi-pack-index: add 'verify' verb
Derrick Stolee  66ec0390e: fsck: verify multi-pack-index
Derrick Stolee  cc6af73c0: multi-pack-index: verify object offsets
Junio C Hamano  76f2f5c1e: Merge branch 'ab/commit-graph-progress' into next
René Scharfe  fb8952077: fsck: use strbuf_getline() to read skiplist file

Thanks, -Stolee

CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the
performance greatly. Some of the other sed commands needed some massaging,
but also added extra cleanup. Thanks for the help!

CHANGES IN V4: I reduced the blame output using -s which decreases the
width. I include a summary of the commit authors at the end to help people
see the lines they wrote. This version is also copied into a build
definition in the public Git project on Azure Pipelines [1]. I'll use this
build definition to generate the coverage report after each "What's Cooking"
email.

[1] https://git.visualstudio.com/git/_build?definitionId=5

Derrick Stolee (1):
  contrib: add coverage-diff script

 contrib/coverage-diff.sh | 108 +++
 1 file changed, 108 insertions(+)
 create mode 100755 contrib/coverage-diff.sh


base-commit: 1d4361b0f344188ab5eec6dcea01f61a3a3a1670
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-40%2Fderrickstolee%2Fcoverage-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-40/derrickstolee/coverage-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/40

Range-diff vs v3:

 1:  21214cc321 ! 1:  6daf310a43 contrib: add coverage-diff script
 @@ -26,10 +26,10 @@
  contrib/coverage-diff.sh base topic
  
  to see the lines added between 'base' and 'topic' that are not 
covered by the
 -test suite. The output uses 'git blame -c' format so you can find the 
commits
 -responsible and view the line numbers for quick access to the context.
 +test suite. The output uses 'git blame -s' format so you can find the 
commits
 +responsible and view the line numbers for quick access to the 
context, but
 +trims leading tabs in the file contents to reduce output width.
  
 -Helped-by: Junio C Hamano 
  Signed-off-by: Derrick Stolee 
  
  diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
 @@ -81,13 +81,16 @@
  + '
  +}
  +
 -+files=$(git diff --name-only $V1 $V2 -- *.c)
 ++files=$(git diff --name-only "$V1" "$V2" -- \*.c)
 ++
 ++# create empty file
 ++>coverage-data.txt
  +
  +for file in $files
  +do
 -+ git diff $V1 $V2 -- $file \
 -+ | diff_lines \
 -+ | sort >new_lines.txt
 ++ git diff "$V1" "$V2" -- "$file" |
 ++ diff_lines |
 ++ sort >new_lines.txt
  +
  + if ! test -s new_lines.txt
  + then

Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH

2018-10-08 Thread Derrick Stolee

On 10/8/2018 9:43 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Aug 28 2018, Derrick Stolee via GitGitGadget wrote:


From: Derrick Stolee 

The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.

To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.

There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.

There is one test in t6024-recursive-merge.sh that relies on the
merge-base algorithm picking one of two ambiguous merge-bases, and
the commit-graph feature changes which merge-base is picked.


The test feature itself seems fine, but this consistently fails ever
since it got introduced (a reset --hard on the commit merged to msater
in git.git):

 GIT_TEST_COMMIT_GRAPH=true prove -j$(parallel --number-of-cores) 
t5500-fetch-pack.sh t6001-rev-list-graft.sh t6050-replace.sh
 Test Summary Report
 ---
 t6001-rev-list-graft.sh (Wstat: 256 Tests: 14 Failed: 6)
   Failed tests:  3, 5, 7, 9, 11, 13
   Non-zero exit status: 1
 t6050-replace.sh   (Wstat: 256 Tests: 35 Failed: 9)
   Failed tests:  12-16, 24-25, 30, 35
   Non-zero exit status: 1
 t5500-fetch-pack.sh(Wstat: 256 Tests: 357 Failed: 1)
   Failed test:  351
   Non-zero exit status: 1

This is on Linux/Debian 4.17.0-1-amd64. Can you reproduce this? If not I
can provide more info (-x output etc..).
I see these failures, too, but I believe they are due to 
ds/commit-graph-with-grafts not being merged to 'next' yet. The purpose 
of that branch is to fix these test breaks. The environment variable got 
merged a lot faster.


I just built & tested the 'jch' branch at 515d82d9 with 
GIT_TEST_COMMIT_GRAPH=1 and they all passed.


Thanks,
-Stolee


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 3:47 PM, Jeff King wrote:

On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:


So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?

The commit-graph only stores the number of _commits_, not total objects.

Oh, right, of course. That does throw a monkey wrench in that line of
thought. ;)

There's unfortunately not a fast way of doing that. One option would be
to keep a counter of "ungraphed commit objects", and have callers update
it. Anybody admitting a pack via index-pack or unpack-objects can easily
get this information. Commands like fast-import can do likewise, and
"git commit" obviously increments it by one.

I'm not excited about adding a new global on-disk data structure (and
the accompanying lock).


If we want, then we can add an optional chunk to the commit-graph file 
that stores the object count.




Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 3:21 PM, Jeff King wrote:

On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:


My misunderstanding was that your proposed change to gc computes the
commit-graph in either of these two cases:

(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
pack, and it probably contains a lot of new commits") or we could create
some more complicated condition based on counting reachable commits with
infinite generation number (the number of commits not in the commit-graph
file).

I like that you are moving forward to make the commit-graph be written more
frequently, but I'm trying to push us in a direction of writing it even more
often than your proposed strategy. We should avoid creating too many
orthogonal conditions that trigger the commit-graph write, which is why I'm
pushing on your design here.

Anyone else have thoughts on this direction?

Yes, I think measuring "sufficiently behind" is the right thing.
Everything else is a proxy or heuristic, and will run into corner cases.
E.g., I have some small number of objects and then do a huge fetch, and
now my commit-graph only covers 5% of what's available.

We know how many objects are in the graph already. And it's not too
expensive to get the number of objects in the repository. We can do the
same sampling for loose objects that "gc --auto" does, and counting
packed objects just involves opening up the .idx files (that can be slow
if you have a ton of packs, but you'd want to either repack or use a
.midx in that case anyway, either of which would help here).

So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?


The commit-graph only stores the number of _commits_, not total objects.

Azure Repos' commit-graph does store the total number of objects, and 
that is how we trigger updating the graph, so it is not unreasonable to 
use that as a heuristic.


Thanks,

-Stolee



Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:

On Fri, Oct 05 2018, Derrick Stolee wrote:


On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

   * There's a gc.clone.autoDetach=false default setting which overrides
 gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
 --cloning option to indicate this).

I'll repeat that it could make sense to do the same thing on clone
_and_ fetch. Perhaps a "--post-fetch" flag would be good here to
communicate that we just downloaded a pack from a remote.

I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

 # Slow, if we assume background forked commit-graph generation
 # (which I'm avoiding)
 git clone x && cd x && git tag --contains
 # Fast enough, since we have an existing commit-graph
 cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.


My misunderstanding was that your proposed change to gc computes the 
commit-graph in either of these two cases:


(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we 
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a 
pack, and it probably contains a lot of new commits") or we could create 
some more complicated condition based on counting reachable commits with 
infinite generation number (the number of commits not in the 
commit-graph file).


I like that you are moving forward to make the commit-graph be written 
more frequently, but I'm trying to push us in a direction of writing it 
even more often than your proposed strategy. We should avoid creating 
too many orthogonal conditions that trigger the commit-graph write, 
which is why I'm pushing on your design here.


Anyone else have thoughts on this direction?

Thanks,

-Stolee



Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear

2018-10-05 Thread Derrick Stolee

On 10/4/2018 6:59 PM, René Scharfe wrote:

Am 01.10.2018 um 22:37 schrieb René Scharfe:

Am 01.10.2018 um 21:26 schrieb Derrick Stolee:

Good catch! I'm disappointed that we couldn't use type-checking here, as
it is quite difficult to discover that the types are wrong here.

Generics in C are hard, and type checking traditionally falls by the
wayside.  You could use macros for that, like klib [*] does, but that
has its own downsides (more object text, debugging the sort macros
themselves is harder).

[*] https://github.com/attractivechaos/klib/blob/master/ksort.h

We could also do something like this to reduce the amount of manual
casting, but do we want to?  (Macro at the bottom, three semi-random
examples at the top.)


I like the idea! It certainly can assist in some of the repeat work when 
preparing to QSORT, and make it less error-prone.



---
  bisect.c  | 11 +++
  commit-graph.c|  9 ++---
  commit-reach.c| 12 +---
  git-compat-util.h | 12 
  4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/bisect.c b/bisect.c
index e8b17cf7e1..06be3a3c15 100644
--- a/bisect.c
+++ b/bisect.c
@@ -192,16 +192,11 @@ struct commit_dist {
int distance;
  };
  
-static int compare_commit_dist(const void *a_, const void *b_)

-{
-   struct commit_dist *a, *b;
-
-   a = (struct commit_dist *)a_;
-   b = (struct commit_dist *)b_;
+DEFINE_SORT(sort_by_commit_dist, struct commit_dist, a, b, {
if (a->distance != b->distance)
return b->distance - a->distance; /* desc sort */
return oidcmp(&a->commit->object.oid, &b->commit->object.oid);
-}
+})
  
  static struct commit_list *best_bisection_sorted(struct commit_list *list, int nr)

  {
@@ -223,7 +218,7 @@ static struct commit_list *best_bisection_sorted(struct 
commit_list *list, int n
array[cnt].distance = distance;
cnt++;
}
-   QSORT(array, cnt, compare_commit_dist);
+   sort_by_commit_dist(array, cnt);
for (p = list, i = 0; i < cnt; i++) {
struct object *obj = &(array[i].commit->object);
  
diff --git a/commit-graph.c b/commit-graph.c

index 7f4519ec3b..a2202414e0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -550,12 +550,7 @@ static void write_graph_chunk_large_edges(struct hashfile 
*f,
}
  }
  
-static int commit_compare(const void *_a, const void *_b)

-{
-   const struct object_id *a = (const struct object_id *)_a;
-   const struct object_id *b = (const struct object_id *)_b;
-   return oidcmp(a, b);
-}
+DEFINE_SORT(sort_oids, struct object_id, a, b, return oidcmp(a, b))
  
  struct packed_commit_list {

struct commit **list;
@@ -780,7 +775,7 @@ void write_commit_graph(const char *obj_dir,
  
  	close_reachable(&oids);
  
-	QSORT(oids.list, oids.nr, commit_compare);

+   sort_oids(oids.list, oids.nr);
  
  	count_distinct = 1;

for (i = 1; i < oids.nr; i++) {
diff --git a/commit-reach.c b/commit-reach.c
index 2f5e592d16..3aef47c3dd 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -527,17 +527,15 @@ int commit_contains(struct ref_filter *filter, struct 
commit *commit,
return is_descendant_of(commit, list);
  }
  
-static int compare_commits_by_gen(const void *_a, const void *_b)

-{
-   const struct commit *a = *(const struct commit * const *)_a;
-   const struct commit *b = *(const struct commit * const *)_b;
-
+DEFINE_SORT(sort_commits_by_gen, struct commit *, ap, bp, {
+   const struct commit *a = *ap;
+   const struct commit *b = *bp;
if (a->generation < b->generation)
return -1;
if (a->generation > b->generation)
return 1;
return 0;
-}
+})


Here, to make the macro version compile you need to cast ap and bp, 
which gives us a level of type-safety that wasn't there before. That can 
help us find errors at compile-time!


  
  int can_all_from_reach_with_flag(struct object_array *from,

 unsigned int with_flag,
@@ -580,7 +578,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
nr_commits++;
}
  
-	QSORT(list, nr_commits, compare_commits_by_gen);

+   sort_commits_by_gen(list, nr_commits);
  
  	for (i = 0; i < nr_commits; i++) {

/* DFS from list[i] */
diff --git a/git-compat-util.h b/git-compat-util.h
index 5f2e90932f..f9e78d69a2 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1066,6 +1066,18 @@ static inline void sane_qsort(void *base, size_t nmemb, 
size_t size,
qsort(base, nmemb, size, compar);
  }
  
+#define DEFINE_SORT(name, elemtype, one, two, code)			\

+static int name##_compare(const void *one##_v_, const void *two##_v_)  \
+{  \
+   elemtype const *one

Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

  * There's a gc.clone.autoDetach=false default setting which overrides
gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
--cloning option to indicate this).


I'll repeat that it could make sense to do the same thing on clone _and_ 
fetch. Perhaps a "--post-fetch" flag would be good here to communicate 
that we just downloaded a pack from a remote.



  * A clone of say git.git with gc.writeCommitGraph=true looks like:

[...]
Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
Resolving deltas: 100% (188947/188947), done.
Computing commit graph generation numbers: 100% (55210/55210), done.


This looks like good UX. Thanks for the progress here!


  * The 'git gc --auto' command also knows to (only) run the commit-graph
(and space is left for future optimization steps) if general GC isn't
needed, but we need "optimization":

$ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c 
gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
Annotating commits in commit graph: 341229, done.
Computing commit graph generation numbers: 100% (165969/165969), done.
$


Will this also trigger a full commit-graph rewrite on every 'git commit' 
command? Or is there some way we can compute the staleness of the 
commit-graph in order to only update if we get too far ahead? 
Previously, this was solved by relying on the auto-GC threshold.



  * The patch to gc.c looks less scary with -w, most of it is indenting
the existing pack-refs etc. with a "!auto_gc || should_gc" condition.

  * I added a commit_graph_exists() exists function and only care if I
get ENOENT for the purposes of this gc mode. This would need to be
tweaked for the incremental mode Derrick talks about, but if we just
set "should_optimize" that'll also work as far as gc --auto is
concerned (e.g. on fetch, am etc.)


The incremental mode would operate the same as split-index, which means 
we will still look for .git/objects/info/commit-graph. That file may 
point us to more files.



+int commit_graph_exists(const char *graph_file)
+{
+   struct stat st;
+   if (stat(graph_file, &st)) {
+   if (errno == ENOENT)
+   return 0;
+   else
+   return -1;
+   }
+   return 1;
+}
+


This method serves a very similar purpose to 
generation_numbers_enabled(), except your method only cares about the 
file existing. It ignores information like `core.commitGraph`, which 
should keep us from doing anything with the commit-graph file if false.


Nothing about your method is specific to the commit-graph file, since 
you provide a filename as a parameter. It could easily be "int 
file_exists(const char *filename)".


Thanks,

-Stolee




Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Derrick Stolee

On 10/3/2018 2:51 PM, Jeff King wrote:

On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote:


On Wed, Oct 03 2018, Stefan Beller wrote:


So we wouldn't be spending 5 minutes repacking linux.git right after
cloning it, just ~10s generating the commit graph, and the same would
happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
which would kick of "gc --auto" in the background and do the same thing.

Or generating local bitmaps or pack idx files as well?

I'm less familiar with this area, but when I clone I get a pack *.idx
file, why does it need to be regenerated?

But yeah, in principle this would be a sensible addition, but I'm not
aware of cases where clients get significant benefits from bitmaps (see
https://githubengineering.com/counting-objects/), and I don't turn it on
for clients, but maybe I've missed something.

They don't help yet, and there's no good reason to enable bitmaps for
clients. I have a few patches that use bitmaps for things like
ahead/behind and --contains checks, but the utility of those may be
lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
I'm mildly in favor of replacing the existing .bitmap format with
something better integrated with commit-graphs (which would give us an
opportunity to clean up some of the rough edges).


If the commit-graph doesn't improve enough on those applications, then 
we could consider adding a commit-to-commit reachability bitmap inside 
the commit-graph. ;)


-Stolee


[PATCH v2 1/3] commit-graph: clean up leaked memory during write

2018-10-03 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() method in commit-graph.c leaks some lits
and strings during execution. In addition, a list of strings is
leaked in write_commit_graph_reachable(). Clean these up so our
memory checking is cleaner.

Further, if we use a list of pack-files to find the commits, we
can leak the packed_git structs after scanning them for commits.

Running the following commands demonstrates the leak before and
the fix after:

* valgrind --leak-check=full ./git commit-graph write --reachable
* valgrind --leak-check=full ./git commit-graph write --stdin-packs

Signed-off-by: Martin Ågren 
Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2a24eb8b5a..ceca6026b0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -693,11 +693,12 @@ static int add_ref_to_list(const char *refname,
 void write_commit_graph_reachable(const char *obj_dir, int append,
  int report_progress)
 {
-   struct string_list list;
+   struct string_list list = STRING_LIST_INIT_DUP;
 
-   string_list_init(&list, 1);
for_each_ref(add_ref_to_list, &list);
write_commit_graph(obj_dir, NULL, &list, append, report_progress);
+
+   string_list_clear(&list, 0);
 }
 
 void write_commit_graph(const char *obj_dir,
@@ -764,6 +765,7 @@ void write_commit_graph(const char *obj_dir,
die(_("error opening index for %s"), 
packname.buf);
for_each_object_in_pack(p, add_packed_commits, &oids, 
0);
close_pack(p);
+   free(p);
}
stop_progress(&oids.progress);
strbuf_release(&packname);
@@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers(&commits, report_progress);
 
graph_name = get_commit_graph_filename(obj_dir);
-   if (safe_create_leading_directories(graph_name))
+   if (safe_create_leading_directories(graph_name)) {
+   UNLEAK(graph_name);
die_errno(_("unable to create leading directories of %s"),
  graph_name);
+   }
 
hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
@@ -893,9 +897,9 @@ void write_commit_graph(const char *obj_dir,
finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
commit_lock_file(&lk);
 
+   free(graph_name);
+   free(commits.list);
free(oids.list);
-   oids.alloc = 0;
-   oids.nr = 0;
 }
 
 #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
-- 
gitgitgadget



[PATCH v2 3/3] commit-graph: reduce initial oid allocation

2018-10-03 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

While writing a commit-graph file, we store the full list of
commits in a flat list. We use this list for sorting and ensuring
we are closed under reachability.

The initial allocation assumed that (at most) one in four objects
is a commit. This is a dramatic over-count for many repos,
especially large ones. Since we grow the repo dynamically, reduce
this count by a factor of eight. We still set it to a minimum of
1024 before allocating.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index ceca6026b0..e773703e1d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -720,7 +720,7 @@ void write_commit_graph(const char *obj_dir,
struct progress *progress = NULL;
 
oids.nr = 0;
-   oids.alloc = approximate_object_count() / 4;
+   oids.alloc = approximate_object_count() / 32;
oids.progress = NULL;
oids.progress_done = 0;
 
-- 
gitgitgadget


[PATCH v2 0/3] Clean up leaks in commit-graph.c

2018-10-03 Thread Derrick Stolee via GitGitGadget
While looking at the commit-graph code, I noticed some memory leaks. These
can be found by running

valgrind --leak-check=full ./git commit-graph write --reachable

The impact of these leaks are small, as we never call write_commit_graph
_reachable in a loop, but it is best to be diligent here.

While looking at memory consumption within write_commit_graph(), I noticed
that we initialize our oid list with "object count / 4", which seems to be a
huge over-count. I reduce this by a factor of eight.

I built off of ab/commit-graph-progress, because my patch involves lines
close to those changes.

V2 includes feedback from V1 along with Martin's additional patches.

Thanks, -Stolee

Derrick Stolee (2):
  commit-graph: clean up leaked memory during write
  commit-graph: reduce initial oid allocation

Martin Ågren (1):
  builtin/commit-graph.c: UNLEAK variables

 builtin/commit-graph.c | 11 ++-
 commit-graph.c | 16 ++--
 2 files changed, 16 insertions(+), 11 deletions(-)


base-commit: 6b89a34c89fc763292f06012318b852b74825619
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-42/derrickstolee/commit-graph-leak-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/42

Range-diff vs v1:

 1:  6906c25415 ! 1:  ba65680b3d commit-graph: clean up leaked memory during 
write
 @@ -7,17 +7,29 @@
  leaked in write_commit_graph_reachable(). Clean these up so our
  memory checking is cleaner.
  
 -Running 'valgrind --leak-check=full git commit-graph write
 ---reachable' demonstrates these leaks and how they are fixed after
 -this change.
 +Further, if we use a list of pack-files to find the commits, we
 +can leak the packed_git structs after scanning them for commits.
  
 +Running the following commands demonstrates the leak before and
 +the fix after:
 +
 +* valgrind --leak-check=full ./git commit-graph write --reachable
 +* valgrind --leak-check=full ./git commit-graph write --stdin-packs
 +
 +Signed-off-by: Martin Ågren 
      Signed-off-by: Derrick Stolee 
  
  diff --git a/commit-graph.c b/commit-graph.c
  --- a/commit-graph.c
  +++ b/commit-graph.c
  @@
 -  string_list_init(&list, 1);
 + void write_commit_graph_reachable(const char *obj_dir, int append,
 +int report_progress)
 + {
 +- struct string_list list;
 ++ struct string_list list = STRING_LIST_INIT_DUP;
 + 
 +- string_list_init(&list, 1);
for_each_ref(add_ref_to_list, &list);
write_commit_graph(obj_dir, NULL, &list, append, report_progress);
  +
 @@ -25,6 +37,14 @@
   }
   
   void write_commit_graph(const char *obj_dir,
 +@@
 +  die(_("error opening index for %s"), 
packname.buf);
 +  for_each_object_in_pack(p, add_packed_commits, &oids, 
0);
 +  close_pack(p);
 ++ free(p);
 +  }
 +  stop_progress(&oids.progress);
 +  strbuf_release(&packname);
  @@
compute_generation_numbers(&commits, report_progress);
   
 @@ -45,5 +65,8 @@
  + free(graph_name);
  + free(commits.list);
free(oids.list);
 -  oids.alloc = 0;
 -  oids.nr = 0;
 +- oids.alloc = 0;
 +- oids.nr = 0;
 + }
 + 
 + #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
 -:  -- > 2:  13032d8475 builtin/commit-graph.c: UNLEAK variables
 2:  e29a0eaf03 = 3:  1002fd34fc commit-graph: reduce initial oid allocation

-- 
gitgitgadget


Re: [PATCH 0/2] commit-graph: more leak fixes

2018-10-03 Thread Derrick Stolee

On 10/3/2018 11:36 AM, Martin Ågren wrote:

Hi Derrick,

These two patches on top of yours make the test suite (i.e., the subset
of it that I run) leak-free with respect to builtin/commit-graph.c and
commit-graph.c.


Thanks!


The first could be squashed into your patch 1/2. It touches the same
function, but it requires a different usage to trigger, so squashing it
in would require broadening the scope. I understand if you don't want to
do that.
I'm fine with squashing it in with both our sign-offs. It is the same 
idea, it just requires a different set of arguments to hit it. I'll 
adjust the commit message as necessary.



If you want to pick these up as part of your re-roll in any way, shape
or form, go ahead. If not, they can go in separately, either in parallel
or after your series lands. Whatever the destiny of this posting, I'll
follow through as appropriate.


I'll add your PATCH 2/2 to my v2. Thanks!


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Derrick Stolee

On 10/3/2018 9:36 AM, SZEDER Gábor wrote:

On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

Don't have time to patch this now, but thought I'd send a note / RFC
about this.

Now that we have the commit graph it's nice to be able to set
e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
/etc/gitconfig to apply them to all repos.

But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
until whenever my first "gc" kicks in, which may be quite some time if
I'm just using it passively.

So we should make "git gc --auto" be run on clone,

There is no garbage after 'git clone'...


And since there is no garbage, the gc will not write the commit-graph.




and change the
need_to_gc() / cmd_gc() behavior so that we detect that the
gc.writeCommitGraph=true setting is on, but we have no commit graph, and
then just generate that without doing a full repack.

Or just teach 'git clone' to run 'git commit-graph write ...'


I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting 
until the file is incremental (on my to-do list soon), so the write is 
fast when only adding a few commits at a time. This would cover the 
clone case, too.


Thanks,
-Stolee


Re: [PATCH 1/2] commit-graph: clean up leaked memory during write

2018-10-03 Thread Derrick Stolee

On 10/2/2018 6:44 PM, Stefan Beller wrote:

My preference is to avoid them in the name of simplicity. If you're
using "make SANITIZE=leak test" to check for leaks, it will skip these
cases. If you're using valgrind, I think these may be reported as
"reachable". But that number already isn't useful for finding real
leaks, because it includes cases like the "foo" above as well as
program-lifetime globals.

The only argument (IMHO) for such an UNLEAK() is that it annotates the
location for when somebody later changes the function to "return -1"
instead of dying. But if we are going to do such annotation, we may as
well actually call free(), which is what the "return" version will
ultimately have to do.

Heh, that was part of my reasoning why we'd want to have *something*.


I'd actually be _more_ favorable to calling free() instead of UNLEAK()
there, but I'm still mildly negative, just because it may go stale (and
our leak-checking wouldn't usefully notice these cases). Anybody
converting that die() to a return needs to re-analyze the function for
what might need to be released (and that includes non-memory bits like
descriptors, too).

Sounds reasonable, so then the consensus (between Martin, you and me)
is to drop the UNLEAK.
Thanks for the discussion here. I'll drop the UNLEAK for now and think 
about how to remove the die() calls from commit-graph.c in a later series.


Thanks,
-Stolee


Re: [PATCH] more oideq/hasheq conversions

2018-10-02 Thread Derrick Stolee

On 10/2/2018 5:19 PM, Jeff King wrote:

We added faster equality-comparison functions for hashes in
14438c4497 (introduce hasheq() and oideq(), 2018-08-28). A
few topics were in-flight at the time, and can now be
converted. This covers all spots found by "make coccicheck"
in master (the coccicheck results were tweaked by hand for
style).

Signed-off-by: Jeff King 
---
Jake: I was surprised that this was not a "patch 2" on top of your
earlier coccicheck patch. Apologies if you were planning to send it out.

This doesn't conflict with anything in "pu", so it's a reasonable time
to apply it. There are a few lingering cases in pu, so another option is
to wait a week or two and see if they get merged.

These conversions look good to me!

Reviewed-by: Derrick Stolee 


[PATCH 2/2] commit-graph: reduce initial oid allocation

2018-10-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

While writing a commit-graph file, we store the full list of
commits in a flat list. We use this list for sorting and ensuring
we are closed under reachability.

The initial allocation assumed that (at most) one in four objects
is a commit. This is a dramatic over-count for many repos,
especially large ones. Since we grow the repo dynamically, reduce
this count by a factor of eight. We still set it to a minimum of
1024 before allocating.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7226bd6b58..a24cceb55f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -721,7 +721,7 @@ void write_commit_graph(const char *obj_dir,
struct progress *progress = NULL;
 
oids.nr = 0;
-   oids.alloc = approximate_object_count() / 4;
+   oids.alloc = approximate_object_count() / 32;
oids.progress = NULL;
oids.progress_done = 0;
 
-- 
gitgitgadget


[PATCH 1/2] commit-graph: clean up leaked memory during write

2018-10-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() method in commit-graph.c leaks some lits
and strings during execution. In addition, a list of strings is
leaked in write_commit_graph_reachable(). Clean these up so our
memory checking is cleaner.

Running 'valgrind --leak-check=full git commit-graph write
--reachable' demonstrates these leaks and how they are fixed after
this change.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2a24eb8b5a..7226bd6b58 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -698,6 +698,8 @@ void write_commit_graph_reachable(const char *obj_dir, int 
append,
string_list_init(&list, 1);
for_each_ref(add_ref_to_list, &list);
write_commit_graph(obj_dir, NULL, &list, append, report_progress);
+
+   string_list_clear(&list, 0);
 }
 
 void write_commit_graph(const char *obj_dir,
@@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers(&commits, report_progress);
 
graph_name = get_commit_graph_filename(obj_dir);
-   if (safe_create_leading_directories(graph_name))
+   if (safe_create_leading_directories(graph_name)) {
+   UNLEAK(graph_name);
die_errno(_("unable to create leading directories of %s"),
  graph_name);
+   }
 
hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
@@ -893,6 +897,8 @@ void write_commit_graph(const char *obj_dir,
finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
commit_lock_file(&lk);
 
+   free(graph_name);
+   free(commits.list);
free(oids.list);
oids.alloc = 0;
oids.nr = 0;
-- 
gitgitgadget



[PATCH 0/2] Clean up leaks in commit-graph.c

2018-10-02 Thread Derrick Stolee via GitGitGadget
While looking at the commit-graph code, I noticed some memory leaks. These
can be found by running

valgrind --leak-check=full ./git commit-graph write --reachable

The impact of these leaks are small, as we never call write_commit_graph
_reachable in a loop, but it is best to be diligent here.

While looking at memory consumption within write_commit_graph(), I noticed
that we initialize our oid list with "object count / 4", which seems to be a
huge over-count. I reduce this by a factor of eight.

I built off of ab/commit-graph-progress, because my patch involves lines
close to those changes.

Thanks, -Stolee

Derrick Stolee (2):
  commit-graph: clean up leaked memory during write
  commit-graph: reduce initial oid allocation

 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)


base-commit: 6b89a34c89fc763292f06012318b852b74825619
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-42/derrickstolee/commit-graph-leak-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/42
-- 
gitgitgadget


Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear

2018-10-01 Thread Derrick Stolee

On 10/1/2018 3:16 PM, René Scharfe wrote:

Am 28.06.2018 um 14:31 schrieb Derrick Stolee via GitGitGadget:

diff --git a/commit-reach.c b/commit-reach.c
index c58e50fbb..ac132c8e4 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -513,65 +513,88 @@ int commit_contains(struct ref_filter *filter, struct 
commit *commit,
return is_descendant_of(commit, list);
  }
  
-int reachable(struct commit *from, int with_flag, int assign_flag,

- time_t min_commit_date)
+static int compare_commits_by_gen(const void *_a, const void *_b)
  {
-   struct prio_queue work = { compare_commits_by_commit_date };
+   const struct commit *a = (const struct commit *)_a;
+   const struct commit *b = (const struct commit *)_b;

This cast is bogus.  QSORT gets handed a struct commit **, i.e. an array
of pointers, and qsort(1) passes references to those pointers to the
compare function, and not the pointer values.


Good catch! I'm disappointed that we couldn't use type-checking here, as 
it is quite difficult to discover that the types are wrong here.




As a result it's unlikely that the array is sorted in the intended
order.  Given that, a silly question: Is sorting even necessary here?


The reason to sort is to hopefully minimize the amount we walk by 
exploring the "lower" commits first. This is a performance-only thing, 
not a correctness issue (which is why the bug exists). Even then, it is 
just a heuristic.

Anyway, the patch below should fix it.

-- >8 --
Subject: [PATCH] commit-reach: fix cast in compare_commits_by_gen()

The elements of the array to be sorted are commit pointers, so the
comparison function gets handed references to these pointers, not
pointers to commit objects.  Cast to the right type and dereference
once to correctly get the commit reference.

Found using Clang's ASan and t5500.

Signed-off-by: Rene Scharfe 
---
Has this patch a performance impact?

  commit-reach.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 00e5ceee6f..2f5e592d16 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -529,8 +529,8 @@ int commit_contains(struct ref_filter *filter, struct 
commit *commit,
  
  static int compare_commits_by_gen(const void *_a, const void *_b)

  {
-   const struct commit *a = (const struct commit *)_a;
-   const struct commit *b = (const struct commit *)_b;
+   const struct commit *a = *(const struct commit * const *)_a;
+   const struct commit *b = *(const struct commit * const *)_b;


I would expect s/* const */**/ here, but I'm guessing your formulation 
is a bit extra careful about types.


Thanks!

Reviewed-by: Derrick Stolee 


Re: [PATCH] read-cache: fix division by zero core-dump

2018-10-01 Thread Derrick Stolee

On 9/28/2018 11:31 AM, Ramsay Jones wrote:

Also, this is not the first time some multi-threaded code in git
has 'failed' by assuming more than one cpu, so ...

I wonder if this is a good time to create a GIT_TEST_CPU_COUNT variable 
so we can mock out single-processor environments instead of relying on 
old hardware or custom VMs.


Thanks,
-Stolee


Re: Git Evolve

2018-10-01 Thread Derrick Stolee

On 9/29/2018 7:00 PM, Stefan Xenos wrote:

Hello, List!


Hello! Welcome.


I'm interested in porting something like Mercurial's evolve command to
Git. I'll be following up with a formal proposal shortly, but before I
do I thought I'd introduce myself to the list and find out if anyone
else is interested in this topic.


I'm CC'ing some contributors who have also expressed interest in this topic.


What is the evolve command?


I'm snipping the rest of your thread because I'm vaguely familiar with 
how this is used in hg, but I haven't used it myself. Instead, I'm going 
to ask you the same questions I asked the last time I had a conversation 
about this with someone. In my opinion, these questions should have good 
answers before we start working on the solution, or else we could paint 
ourselves into a corner as we build the first pieces.


---

What would the command-line experience look like for this workflow? Be 
specific, including examples!


How does one create a commit that obsoletes another? Are we in the 
middle of an interactive rebase, or are we simply checking out the 
commit? How does a use keep track of their progress in a topic?


How do I view which commits in my topic are obsolete, and to what commits?

If I want to obsolete commits on one machine and then finish the work on 
another machine, how do I do that? Similarly: how can I share obsolete 
commits with other users so they can apply them (or not)?


Do obsolescence markers live in the ref space? (This is one way to help 
answer the question above.)


Can I make multiple commits obsolete into one commit (merge patches)? 
Can I make my commit obsolete in multiple ways (split a patch)? How is 
this communicated to the user?


---

In my opinion, a good way to move forward is to create a patch that adds 
a design document to Documentation/technical that answers these 
questions. Then we can dig in more to make the vision clearer.


I'm not a UX expert, but this seems like a place where we could use 
someone with expertise in the area. If we are not careful, we could end 
up with something even harder to use than interactive rebase. The main 
goal here is to make it easy to rewrite a topic (plus, the ability to do 
so in stages).


I look forward to see what goes on in this space. Count me in to review 
the technical details.


Thanks,
-Stolee


Re: [PATCH v2 0/4] git-commit-graph.txt: various cleanups

2018-09-27 Thread Derrick Stolee

On 9/27/2018 3:12 PM, Martin Ågren wrote:

This v2 starts with the same two patches as v1 did, then goes on to
change "[commit] graph file" to "commit-graph file" with a dash, to
match other instances as well as Derrick's feedback.

Thanks! This version satisfies my concerns and looks good to me.

Reviewed-by: Derrick Stolee 


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-09-27 Thread Derrick Stolee

On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:

From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.

2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Hello, Git contributors.

I understand that this commit message and patch are pretty daunting. 
There is a lot to read and digest. I would like to see if anyone is 
willing to put the work in to review this patch, as I quite like what it 
does, and the performance numbers below.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
   HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
   HEAD, w/ commit-graph: 0.06 s


If there is something I can do to make this easier to review, then 
please let me know.


Thanks,
-Stolee


Re: Git for Windows for Unix?

2018-09-27 Thread Derrick Stolee

On 9/27/2018 12:01 PM, Ævar Arnfjörð Bjarmason wrote:

I had an IRC conversation with Johannes saying I didn't know Git For
Windows builds perfectly well for Linux, this just isn't advertised in
the ANNOUNCE E-Mails, so I hadn't tried.


We run CI to ensure it builds and tests on Mac OSX, too. This is 
important to the VFS for Git group, as we work on making that work for 
Mac clients. We have our fork of Git for Windows at 
https://github.com/microsoft/git.



GFW is a "friendly fork", but a permanent one it seems. The diff between
it and 2.19.0 proper is ~10k lines, and e.g. this last release had
experimental stash/rebase in C that 2.19.0 didn't.


Hopefully we can learn from having this experimental feature in the wild 
and improve the patches on-list by fixing issues.


We do have a desire to move as much as possible upstream. It's difficult 
to find time to pay down that "fork debt".


Thanks,

-Stolee



<    5   6   7   8   9   10   11   12   13   14   >