Re: [PATCH 5/6] commit-graph: implement file format version 2

2019-01-24 Thread Derrick Stolee

On 1/24/2019 4:40 AM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Jan 23 2019, Derrick Stolee via GitGitGadget wrote:


The new format modifies the header of the commit-graph to solve
these problems. We use the 4-byte hash format id, freeing up a byte
in our 32-bit alignment to introduce a reachability index version.
We can also fail to read the commit-graph if the eighth byte is
non-zero.

I haven't tested, but it looks from the patch like we can transparently
read existing v1 data and then will write v2 the next time. Would be
helpful for reviewers if this was noted explicitly in the commit
message.


Can do.



Should there be a GIT_TEST_COMMIT_GRAPH_VERSION=[12] going forward to
test the non-default version, or do you feel confident the tests added
here test the upgrade path & old code well enough?


You're right that we should have an explicit "upgrade" test:

 1. Write a v1 commit-graph
 2. Add a commit
 3. Write a v2 commit-graph

As for a new GIT_TEST_ variable, we should only need to continue relying 
on GIT_TEST_COMMIT_GRAPH to test v2. I can add a 'graph_git_behavior' 
call on an explicitly v1 commit-graph file to get most of the coverage 
we need.



The 'git commit-graph read' subcommand needs updating to show the
new data.

Let's say "The ... subcommand has been updated to show the new
data". This sounds like a later patch is going to do that, but in fact
it's done here.


Will clean up.


Set the default file format version to 2, and adjust the tests to
expect the new 'git commit-graph read' output.

Signed-off-by: Derrick Stolee 
---
  .../technical/commit-graph-format.txt | 26 +-
  builtin/commit-graph.c|  9 
  commit-graph.c| 47 ---
  commit-graph.h|  1 +
  t/t5318-commit-graph.sh   |  9 +++-
  5 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
index 16452a0504..e367aa94b1 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -31,13 +31,22 @@ and hash type.

  All 4-byte numbers are in network order.

+There are two versions available, 1 and 2. These currently differ only in
+the header.

Shouldn't this be s/currently/ / ? Won't we add a version 3 if we make
new changes?


When we add a new reachability index version, then the content of the 
data chunk will change. Since we have a separate byte for versioning 
that data, we do not need a v3 for the file format as a whole. A similar 
statement applies to the unused byte reserved for the incremental file 
format: we won't need to increase the file format version as we will 
make that number non-zero and add a chunk with extra data.


Thanks,
-Stolee


Re: [PATCH 0/9] drop some unused parameters

2019-01-24 Thread Derrick Stolee

On 1/24/2019 8:11 AM, Jeff King wrote:

   2. Patches that drop unused parameters (i.e., code cleanup).
I've read the patches and they do exactly this, along with explanations 
of why they don't fit in the first category.


Thanks,
-Stolee


Re: [PATCH 5/8] show_date_relative(): drop unused "tz" parameter

2019-01-24 Thread Derrick Stolee

On 1/24/2019 8:12 AM, Jeff King wrote:

The timestamp we receive is in epoch time, so there's no need for a
timezone parameter to interpret it. The matching show_date() uses "tz"
to show dates in author local time, but relative dates show only the
absolute time difference. The author's location is irrelevant, barring
relativistic effects from using Git close to the speed of light.

I fully support more humor in our commit messages.

-Stolee


[PATCH 6/6] commit-graph: test verifying a corrupt v2 header

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph file format v2 changes the v1 data only in the
header information. Add tests that check the 'verify' subcommand
catches corruption in the v2 header.

Signed-off-by: Derrick Stolee 
---
 t/t5318-commit-graph.sh | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3ff5e3b48d..be7bbf911a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -497,6 +497,37 @@ test_expect_success 'git fsck (checks commit-graph)' '
test_must_fail git fsck
 '
 
+test_expect_success 'rewrite commmit-graph with version 2' '
+   rm -f .git/objects/info/commit-graph &&
+   git commit-graph write --reachable --version=2 &&
+   git commit-graph verify
+'
+
+GRAPH_BYTE_CHUNK_COUNT=5
+GRAPH_BYTE_REACH_INDEX=6
+GRAPH_BYTE_UNUSED=7
+GRAPH_BYTE_HASH=8
+
+test_expect_success 'detect low chunk count (v2)' '
+   corrupt_graph_and_verify $GRAPH_CHUNK_COUNT "\02" \
+   "missing the .* chunk"
+'
+
+test_expect_success 'detect incorrect reachability index' '
+   corrupt_graph_and_verify $GRAPH_REACH_INDEX "\03" \
+   "reachability index version"
+'
+
+test_expect_success 'detect non-zero unused byte' '
+   corrupt_graph_and_verify $GRAPH_BYTE_UNUSED "\01" \
+   "unsupported value"
+'
+
+test_expect_success 'detect bad hash version (v2)' '
+   corrupt_graph_and_verify $GRAPH_BYTE_HASH "\00" \
+   "hash algorithm"
+'
+
 test_expect_success 'setup non-the_repository tests' '
rm -rf repo &&
git init repo &&
-- 
gitgitgadget


[PATCH 3/6] commit-graph: create new version flags

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In anticipation of a new commit-graph file format version, create
a flag for the write_commit_graph() and write_commit_graph_reachable()
methods to take a version number.

When there is no specified version, the implementation selects a
default value. Currently, the only valid value is 1.

The file format will change the header information, so place the
existing header logic inside a switch statement with only one case.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 58 +-
 commit-graph.h |  1 +
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 28fe2378be..f7f45893fd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -25,9 +25,6 @@
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
-#define GRAPH_VERSION_1 0x1
-#define GRAPH_VERSION GRAPH_VERSION_1
-
 #define GRAPH_EXTRA_EDGES_NEEDED 0x8000
 #define GRAPH_PARENT_MISSING 0x7fff
 #define GRAPH_EDGE_LAST_MASK 0x7fff
@@ -118,30 +115,35 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
}
 
graph_version = *(unsigned char*)(data + 4);
-   if (graph_version != GRAPH_VERSION) {
+   if (graph_version != 1) {
error(_("graph version %X does not match version %X"),
- graph_version, GRAPH_VERSION);
-   goto cleanup_fail;
-   }
-
-   hash_version = *(unsigned char*)(data + 5);
-   if (hash_version != oid_version()) {
-   error(_("hash version %X does not match version %X"),
- hash_version, oid_version());
+ graph_version, 1);
goto cleanup_fail;
}
 
graph = alloc_commit_graph();
 
+   switch (graph_version) {
+   case 1:
+   hash_version = *(unsigned char*)(data + 5);
+   if (hash_version != oid_version()) {
+   error(_("hash version %X does not match version %X"),
+ hash_version, oid_version());
+   goto cleanup_fail;
+   }
+
+   graph->num_chunks = *(unsigned char*)(data + 6);
+   chunk_lookup = data + 8;
+   break;
+   }
+
graph->hash_len = the_hash_algo->rawsz;
-   graph->num_chunks = *(unsigned char*)(data + 6);
graph->graph_fd = fd;
graph->data = graph_map;
graph->data_len = graph_size;
 
last_chunk_id = 0;
last_chunk_offset = 8;
-   chunk_lookup = data + 8;
for (i = 0; i < graph->num_chunks; i++) {
uint32_t chunk_id = get_be32(chunk_lookup + 0);
uint64_t chunk_offset = get_be64(chunk_lookup + 4);
@@ -792,10 +794,22 @@ int write_commit_graph(const char *obj_dir,
int res = 0;
int append = flags & COMMIT_GRAPH_APPEND;
int report_progress = flags & COMMIT_GRAPH_PROGRESS;
+   int version = 0;
+   int header_size = 0;
 
if (!commit_graph_compatible(the_repository))
return 0;
 
+   if (flags & COMMIT_GRAPH_VERSION_1)
+   version = 1;
+   if (!version)
+   version = 1;
+   if (version != 1) {
+   error(_("unsupported commit-graph version %d"),
+ version);
+   return 1;
+   }
+
oids.nr = 0;
approx_nr_objects = approximate_object_count();
oids.alloc = approx_nr_objects / 32;
@@ -980,10 +994,16 @@ int write_commit_graph(const char *obj_dir,
 
hashwrite_be32(f, GRAPH_SIGNATURE);
 
-   hashwrite_u8(f, GRAPH_VERSION);
-   hashwrite_u8(f, oid_version());
-   hashwrite_u8(f, num_chunks);
-   hashwrite_u8(f, 0); /* unused padding byte */
+   hashwrite_u8(f, version);
+
+   switch (version) {
+   case 1:
+   hashwrite_u8(f, oid_version());
+   hashwrite_u8(f, num_chunks);
+   hashwrite_u8(f, 0); /* unused padding byte */
+   header_size = 8;
+   break;
+   }
 
chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
@@ -994,7 +1014,7 @@ int write_commit_graph(const char *obj_dir,
chunk_ids[3] = 0;
chunk_ids[4] = 0;
 
-   chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
+   chunk_offsets[0] = header_size + (num_chunks + 1) * 
GRAPH_CHUNKLOOKUP_WIDTH;
chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
chunk_offsets[2] = chunk_offsets[1] + hashsz * commits.nr;
chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * commits.nr;
diff --git a/commit-graph.h b/commit-graph.h
index 83fa548138..e03df54e33 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -62,6 +62,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND 

[PATCH 5/6] commit-graph: implement file format version 2

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph file format had some shortcomings which we now
correct:

  1. The hash algorithm was determined by a single byte, instead
 of the 4-byte format identifier.

  2. There was no way to update the reachability index we used.
 We currently only support generation numbers, but that will
 change in the future.

  3. Git did not fail with error if the unused eighth byte was
 non-zero, so we could not use that to indicate an incremental
 file format without breaking compatibility across versions.

The new format modifies the header of the commit-graph to solve
these problems. We use the 4-byte hash format id, freeing up a byte
in our 32-bit alignment to introduce a reachability index version.
We can also fail to read the commit-graph if the eighth byte is
non-zero.

The 'git commit-graph read' subcommand needs updating to show the
new data.

Set the default file format version to 2, and adjust the tests to
expect the new 'git commit-graph read' output.

Signed-off-by: Derrick Stolee 
---
 .../technical/commit-graph-format.txt | 26 +-
 builtin/commit-graph.c|  9 
 commit-graph.c| 47 ---
 commit-graph.h|  1 +
 t/t5318-commit-graph.sh   |  9 +++-
 5 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
index 16452a0504..e367aa94b1 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -31,13 +31,22 @@ and hash type.
 
 All 4-byte numbers are in network order.
 
+There are two versions available, 1 and 2. These currently differ only in
+the header.
+
 HEADER:
 
+All commit-graph files use the first five bytes for the same purpose.
+
   4-byte signature:
   The signature is: {'C', 'G', 'P', 'H'}
 
   1-byte version number:
-  Currently, the only valid version is 1.
+  Currently, the valid version numbers are 1 and 2.
+
+The remainder of the header changes depending on the version.
+
+Version 1:
 
   1-byte Hash Version (1 = SHA-1)
   We infer the hash length (H) from this value.
@@ -47,6 +56,21 @@ HEADER:
   1-byte (reserved for later use)
  Current clients should ignore this value.
 
+Version 2:
+
+  1-byte number (C) of "chunks"
+
+  1-byte reachability index version number:
+  Currently, the only valid number is 1.
+
+  1-byte (reserved for later use)
+  Current clients expect this value to be zero, and will not
+  try to read the commit-graph file if it is non-zero.
+
+  4-byte format identifier for the hash algorithm:
+  If this identifier does not agree with the repository's current
+  hash algorithm, then the client will not read the commit graph.
+
 CHUNK LOOKUP:
 
   (C + 1) * 12 bytes listing the table of contents for the chunks:
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b1bed84260..28787d0c9c 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -102,6 +102,11 @@ static int graph_read(int argc, const char **argv)
*(unsigned char*)(graph->data + 5),
*(unsigned char*)(graph->data + 6),
*(unsigned char*)(graph->data + 7));
+
+   if (*(unsigned char *)(graph->data + 4) == 2)
+   printf("hash algorithm: %X\n",
+  get_be32(graph->data + 8));
+
printf("num_commits: %u\n", graph->num_commits);
printf("chunks:");
 
@@ -162,6 +167,10 @@ static int graph_write(int argc, const char **argv)
case 1:
flags |= COMMIT_GRAPH_VERSION_1;
break;
+
+   case 2:
+   flags |= COMMIT_GRAPH_VERSION_2;
+   break;
}
 
read_replace_refs = 0;
diff --git a/commit-graph.c b/commit-graph.c
index f7f45893fd..aeb6cae656 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -90,7 +90,8 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
uint64_t last_chunk_offset;
uint32_t last_chunk_id;
uint32_t graph_signature;
-   unsigned char graph_version, hash_version;
+   unsigned char graph_version, hash_version, reach_index_version;
+   uint32_t hash_id;
 
if (fd < 0)
return NULL;
@@ -115,9 +116,9 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
}
 
graph_version = *(unsigned char*)(data + 4);
-   if (graph_version != 1) {
-   error(_("graph version %X does not match version %X"),
- graph_version, 1);
+   if (!graph_version || graph_version > 2) {
+   error(_("unsupported graph version %X"),
+ g

[PATCH 2/6] commit-graph: collapse parameters into flags

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() and write_commit_graph_reachable() methods
currently take two boolean parameters: 'append' and 'report_progress'.
We will soon expand the possible options to send to these methods, so
instead of complicating the parameter list, first simplify it.

Collapse these parameters into a 'flags' parameter, and adjust the
callers to provide flags as necessary.

Signed-off-by: Derrick Stolee 
---
 builtin/commit-graph.c | 8 +---
 builtin/commit.c   | 2 +-
 builtin/gc.c   | 4 ++--
 commit-graph.c | 9 +
 commit-graph.h | 8 +---
 5 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b12d46fdc8..0c92421f75 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -127,6 +127,7 @@ static int graph_write(int argc, const char **argv)
struct string_list *commit_hex = NULL;
struct string_list lines;
int result;
+   int flags = COMMIT_GRAPH_PROGRESS;
 
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", &opts.obj_dir,
@@ -151,11 +152,13 @@ static int graph_write(int argc, const char **argv)
die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
+   if (opts.append)
+   flags |= COMMIT_GRAPH_APPEND;
 
read_replace_refs = 0;
 
if (opts.reachable)
-   return write_commit_graph_reachable(opts.obj_dir, opts.append, 
1);
+   return write_commit_graph_reachable(opts.obj_dir, flags);
 
string_list_init(&lines, 0);
if (opts.stdin_packs || opts.stdin_commits) {
@@ -175,8 +178,7 @@ static int graph_write(int argc, const char **argv)
result = write_commit_graph(opts.obj_dir,
pack_indexes,
commit_hex,
-   opts.append,
-   1);
+   flags);
 
UNLEAK(lines);
return result;
diff --git a/builtin/commit.c b/builtin/commit.c
index 04b0717b35..3228de4e3c 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1668,7 +1668,7 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
  "not exceeded, and then \"git reset HEAD\" to recover."));
 
if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-   write_commit_graph_reachable(get_object_directory(), 0, 0))
+   write_commit_graph_reachable(get_object_directory(), 0))
return 1;
 
repo_rerere(the_repository, 0);
diff --git a/builtin/gc.c b/builtin/gc.c
index 9c6c9c9007..198872206b 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -663,8 +663,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
clean_pack_garbage();
 
if (gc_write_commit_graph &&
-   write_commit_graph_reachable(get_object_directory(), 0,
-!quiet && !daemonized))
+   write_commit_graph_reachable(get_object_directory(),
+!quiet && !daemonized ? 
COMMIT_GRAPH_PROGRESS : 0))
return 1;
 
if (auto_gc && too_many_loose_objects())
diff --git a/commit-graph.c b/commit-graph.c
index 162b9f2a85..28fe2378be 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -755,15 +755,14 @@ static int add_ref_to_list(const char *refname,
return 0;
 }
 
-int write_commit_graph_reachable(const char *obj_dir, int append,
-int report_progress)
+int write_commit_graph_reachable(const char *obj_dir, int flags)
 {
struct string_list list = STRING_LIST_INIT_DUP;
int result;
 
for_each_ref(add_ref_to_list, &list);
result = write_commit_graph(obj_dir, NULL, &list,
-   append, report_progress);
+   flags);
 
string_list_clear(&list, 0);
return result;
@@ -772,7 +771,7 @@ int write_commit_graph_reachable(const char *obj_dir, int 
append,
 int write_commit_graph(const char *obj_dir,
   struct string_list *pack_indexes,
   struct string_list *commit_hex,
-  int append, int report_progress)
+  int flags)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -791,6 +790,8 @@ int write_commit_graph(const char *obj_dir,
struct strbuf progress_title = STRBUF_INIT;
unsigned long approx_nr_objects;
int res = 0;
+   int append = flags & COMMIT_GRAPH_APPEND;
+   int report_progress = flags & COMMIT_GRAP

[PATCH 4/6] commit-graph: add --version= option

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Allo the commit-graph builtin to specify the file format version
using the '--version=' option. Specify the version exactly in
the verification tests as using a different version would change
the offsets used in those tests.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-commit-graph.txt |  3 +++
 builtin/commit-graph.c | 13 +++--
 t/t5318-commit-graph.sh|  2 +-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 624470e198..1d1cc70de4 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -51,6 +51,9 @@ or `--stdin-packs`.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
++
+With the `--version=` option, specify the file format version. Used
+only for testing.
 
 'read'::
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0c92421f75..b1bed84260 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph verify [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits] [--version=]"),
NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits] [--version=]"),
NULL
 };
 
@@ -35,6 +35,7 @@ static struct opts_commit_graph {
int stdin_packs;
int stdin_commits;
int append;
+   int version;
 } opts;
 
 
@@ -141,6 +142,8 @@ static int graph_write(int argc, const char **argv)
N_("start walk at commits listed by stdin")),
OPT_BOOL(0, "append", &opts.append,
N_("include all commits already in the commit-graph 
file")),
+   OPT_INTEGER(0, "version", &opts.version,
+   N_("specify the file format version")),
OPT_END(),
};
 
@@ -155,6 +158,12 @@ static int graph_write(int argc, const char **argv)
if (opts.append)
flags |= COMMIT_GRAPH_APPEND;
 
+   switch (opts.version) {
+   case 1:
+   flags |= COMMIT_GRAPH_VERSION_1;
+   break;
+   }
+
read_replace_refs = 0;
 
if (opts.reachable)
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index f4deb13b1d..b79d6263e9 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -328,7 +328,7 @@ test_expect_success 'replace-objects invalidates 
commit-graph' '
 
 test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
-   git rev-parse commits/8 | git commit-graph write --stdin-commits &&
+   git rev-parse commits/8 | git commit-graph write --stdin-commits 
--version=1 &&
git commit-graph verify >output
 '
 
-- 
gitgitgadget



[PATCH 1/6] commit-graph: return with errors during write

2019-01-23 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() method uses die() to report failure and
exit when confronted with an unexpected condition. This use of
die() in a library function is incorrect and is now replaced by
error() statements and an int return type.

Now that we use 'goto cleanup' to jump to the terminal condition
on an error, we have new paths that could lead to uninitialized
values. New initializers are added to correct for this.

The builtins 'commit-graph', 'gc', and 'commit' call these methods,
so update them to check the return value.

Signed-off-by: Derrick Stolee 
---
 builtin/commit-graph.c | 19 +++--
 builtin/commit.c   |  5 ++--
 builtin/gc.c   |  7 ++---
 commit-graph.c | 60 +-
 commit-graph.h | 10 +++
 5 files changed, 62 insertions(+), 39 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 4ae502754c..b12d46fdc8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -126,6 +126,7 @@ static int graph_write(int argc, const char **argv)
struct string_list *pack_indexes = NULL;
struct string_list *commit_hex = NULL;
struct string_list lines;
+   int result;
 
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", &opts.obj_dir,
@@ -153,10 +154,8 @@ static int graph_write(int argc, const char **argv)
 
read_replace_refs = 0;
 
-   if (opts.reachable) {
-   write_commit_graph_reachable(opts.obj_dir, opts.append, 1);
-   return 0;
-   }
+   if (opts.reachable)
+   return write_commit_graph_reachable(opts.obj_dir, opts.append, 
1);
 
string_list_init(&lines, 0);
if (opts.stdin_packs || opts.stdin_commits) {
@@ -173,14 +172,14 @@ static int graph_write(int argc, const char **argv)
UNLEAK(buf);
}
 
-   write_commit_graph(opts.obj_dir,
-  pack_indexes,
-  commit_hex,
-  opts.append,
-  1);
+   result = write_commit_graph(opts.obj_dir,
+   pack_indexes,
+   commit_hex,
+   opts.append,
+   1);
 
UNLEAK(lines);
-   return 0;
+   return result;
 }
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
diff --git a/builtin/commit.c b/builtin/commit.c
index 004b816635..04b0717b35 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1667,8 +1667,9 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
  "new_index file. Check that disk is not full and quota 
is\n"
  "not exceeded, and then \"git reset HEAD\" to recover."));
 
-   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
-   write_commit_graph_reachable(get_object_directory(), 0, 0);
+   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
+   write_commit_graph_reachable(get_object_directory(), 0, 0))
+   return 1;
 
repo_rerere(the_repository, 0);
run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
diff --git a/builtin/gc.c b/builtin/gc.c
index 7696017cd4..9c6c9c9007 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -662,9 +662,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
-   if (gc_write_commit_graph)
-   write_commit_graph_reachable(get_object_directory(), 0,
-!quiet && !daemonized);
+   if (gc_write_commit_graph &&
+   write_commit_graph_reachable(get_object_directory(), 0,
+!quiet && !daemonized))
+   return 1;
 
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
diff --git a/commit-graph.c b/commit-graph.c
index 0f8274d15d..162b9f2a85 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -755,27 +755,30 @@ static int add_ref_to_list(const char *refname,
return 0;
 }
 
-void write_commit_graph_reachable(const char *obj_dir, int append,
- int report_progress)
+int write_commit_graph_reachable(const char *obj_dir, int append,
+int report_progress)
 {
struct string_list list = STRING_LIST_INIT_DUP;
+   int result;
 
for_each_ref(add_ref_to_list, &list);
-   write_commit_graph(obj_dir, NULL, &list, append, report_progress);
+   result = write_commit_graph(obj_dir, NULL, &list,
+   append, report_progress);
 
string

[PATCH 0/6] Create commit-graph file format v2

2019-01-23 Thread Derrick Stolee via GitGitGadget
The commit-graph file format has some shortcomings that were discussed
on-list:

 1. It doesn't use the 4-byte format ID from the_hash_algo.


 2. There is no way to change the reachability index from generation numbers
to corrected commit date [1].


 3. The unused byte in the format could be used to signal the file is
incremental, but current clients ignore the value even if it is
non-zero.



This series adds a new version (2) to the commit-graph file. The fifth byte
already specified the file format, so existing clients will gracefully
respond to files with a different version number. The only real change now
is that the header takes 12 bytes instead of 8, due to using the 4-byte
format ID for the hash algorithm.

The new bytes reserved for the reachability index version and incremental
file formats are now expected to be equal to the defaults. When we update
these values to be flexible in the future, if a client understands
commit-graph v2 but not those new values, then it will fail gracefully.

This series is based on ab/commit-graph-write-progress and bc/sha-256.

Thanks, -Stolee

[1] 
https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/

Derrick Stolee (6):
  commit-graph: return with errors during write
  commit-graph: collapse parameters into flags
  commit-graph: create new version flags
  commit-graph: add --version= option
  commit-graph: implement file format version 2
  commit-graph: test verifying a corrupt v2 header

 Documentation/git-commit-graph.txt|   3 +
 .../technical/commit-graph-format.txt |  26 ++-
 builtin/commit-graph.c|  43 +++--
 builtin/commit.c  |   5 +-
 builtin/gc.c  |   7 +-
 commit-graph.c| 158 +-
 commit-graph.h|  16 +-
 t/t5318-commit-graph.sh   |  42 -
 8 files changed, 233 insertions(+), 67 deletions(-)


base-commit: 91b3ce35eeb93be1f4406e25ccdc4ab983a8e5af
Published-As: 
https://github.com/gitgitgadget/git/releases/tag/pr-112%2Fderrickstolee%2Fgraph%2Fv2-head-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-112/derrickstolee/graph/v2-head-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/112
-- 
gitgitgadget


Re: [PATCH 10/14] pack-objects: add trace2 regions

2019-01-22 Thread Derrick Stolee

On 1/22/2019 4:22 PM, Derrick Stolee via GitGitGadget wrote:

From: Derrick Stolee 

When studying the performance of 'git push' we would like to know
how much time is spent at various parts of the command. One area
that could cause performance trouble is 'git pack-objects'.

Add trace2 regions around the three main actions taken in this
command:

1. Enumerate objects.
2. Prepare pack.
3. Write pack-file.

Signed-off-by: Derrick Stolee 


We've had this patch in our private branch for a while, and is how we 
discovered the need for the sparse push algorithm [1]. We will use it to 
measure it's effectiveness as new Microsoft users onboard to the new 
algorithm.


Thanks,

-Stolee

[1] https://public-inbox.org/git/pull.89.v5.git.gitgitgad...@gmail.com/



[PATCH 10/14] pack-objects: add trace2 regions

2019-01-22 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When studying the performance of 'git push' we would like to know
how much time is spent at various parts of the command. One area
that could cause performance trouble is 'git pack-objects'.

Add trace2 regions around the three main actions taken in this
command:

1. Enumerate objects.
2. Prepare pack.
3. Write pack-file.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 889df2c755..6708529e3c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -33,6 +33,7 @@
 #include "object-store.h"
 #include "dir.h"
 #include "midx.h"
+#include "trace2.h"
 
 #define IN_PACK(obj) oe_in_pack(&to_pack, obj)
 #define SIZE(obj) oe_size(&to_pack, obj)
@@ -3472,6 +3473,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
}
}
 
+   trace2_region_enter("pack-objects", "enumerate-objects", 
the_repository);
prepare_packing_data(the_repository, &to_pack);
 
if (progress)
@@ -3486,12 +3488,20 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (include_tag && nr_result)
for_each_ref(add_ref_tag, NULL);
stop_progress(&progress_state);
+   trace2_region_leave("pack-objects", "enumerate-objects", 
the_repository);
 
if (non_empty && !nr_result)
return 0;
-   if (nr_result)
+   if (nr_result) {
+   trace2_region_enter("pack-objects", "prepare-pack", 
the_repository);
prepare_pack(window, depth);
+   trace2_region_leave("pack-objects", "prepare-pack", 
the_repository);
+   }
+
+   trace2_region_enter("pack-objects", "write-pack-file", the_repository);
write_pack_file();
+   trace2_region_leave("pack-objects", "write-pack-file", the_repository);
+
if (progress)
fprintf_ln(stderr,
   _("Total %"PRIu32" (delta %"PRIu32"),"
-- 
gitgitgadget



Re: [PATCH v6 00/10] commit-graph write: progress output improvements

2019-01-22 Thread Derrick Stolee

On 1/19/2019 3:21 PM, Ævar Arnfjörð Bjarmason wrote:

Improvements since v6:

  * Integrate my "commit-graph write: use pack order when finding
commits" patch, and per Junio's suggestion put it at the start so
it's easier to split the two apart.

  * A signed-off-by of mine was missing on that patch. Fixed.

  * Replace SZEDER's two patches with his re-roll at
https://public-inbox.org/git/20190118170549.30403-1-szeder@gmail.com/
for convenienc, since mine needed rebasing on top of his.

  * SZEDER rightly pointed out that I was doing some work for nothing
in https://public-inbox.org/git/20190118171605.gm...@szeder.dev/;
fixed.

SZEDER Gábor (2):
   commit-graph: rename "large edges" to "extra edges"
   commit-graph: don't call write_graph_chunk_large_edges() unnecessarily

Ævar Arnfjörð Bjarmason (8):
   commit-graph write: use pack order when finding commits
   commit-graph write: add "Writing out" progress output
   commit-graph write: more descriptive "writing out" output
   commit-graph write: show progress for object search
   commit-graph write: add more descriptive progress output
   commit-graph write: remove empty line for readability
   commit-graph write: add itermediate progress
   commit-graph write: emit a percentage for all progress


I'm a few days late, but I took another look through the patches and 
they look good to me. Thanks!


-Stolee



Re: Contributor Summit Topics and Logistics

2019-01-22 Thread Derrick Stolee

On 1/22/2019 2:50 AM, Jeff King wrote:

For people who want to try to join remotely, I don't think we're going
to have a particularly fancy AV setup. But there should at least be a
big screen (which we typically do not really use for presenting), and I
hope we can provide some connectivity. I'll be visiting the venue the
day before (Jan 30th) in the late afternoon (Brussels time) and I'll try
to do a test run. If anybody wants to volunteer to be the guinea pig on
the other end of the line, I'd welcome it.


I would like to join remotely, so I volunteer to do a test run. I'll 
need to wake up early, so let's set an exact time privately.



Topics I would like to hear about:

- commit-graph status report (I can lead, if I'm able to join)

- multi-pack-index status report (same)

- reftable

- partial clone

- test coverage report, usefulness or improvements


Thanks,

-Stolee



Re: [PATCH] commit-graph write: use pack order when finding commits

2019-01-17 Thread Derrick Stolee

On 1/17/2019 10:09 AM, Derrick Stolee wrote:


The code itself looks good, I just need to double-check the 
performance numbers.


I tested this patch on our VFS-specific version using five "prefetch 
packs".


Before: 22.61 s (avg)
After: 22.18 s (avg)

Improvement: 2%

The error rate among the samples is very low (~.15 s) so this result is 
significant.


Considering that we are writing 4.4 million commits -- many more than 
the objects in the packs that we are reading -- this is an excellent 
improvement!


Thanks!
-Stolee


Re: [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index

2019-01-17 Thread Derrick Stolee
I've seen comments about some leftover uses of the word 'verb' instead 
of 'subcommand'. I can definitely clean those up, but I'm hoping that 
more comments could be made on the technical aspects of this series.


Thanks,
-Stolee


Re: [PATCH] commit-graph write: use pack order when finding commits

2019-01-17 Thread Derrick Stolee

On 1/17/2019 8:23 AM, Ævar Arnfjörð Bjarmason wrote:

Slightly optimize the "commit-graph write" step by using
FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack(). See commit
[1] and [2] for the facility and a similar optimization for "cat-file".

On Linux it is around 5% slower to run:

 echo 1 >/proc/sys/vm/drop_caches &&
 cat .git/objects/pack/* >/dev/null &&
 git cat-file --batch-all-objects --batch-check --unordered

Than the same thing with the "cat" omitted. This is as expected, since
we're iterating in pack order and the "cat" is extra work.

Before this change the opposite was true of "commit-graph write". We
were 6% faster if we first ran "cat" to efficiently populate the FS
cache for our sole big pack on linux.git than if we had it populated
via for_each_object_in_pack(). Now we're 3% faster without the "cat"
instead.

The tests were done on an unloaded Linux 3.10 system with 10 runs for
each.

1. 736eb88fdc ("for_each_packed_object: support iterating in
pack-order", 2018-08-10)

2. 0750bb5b51 ("cat-file: support "unordered" output for
--batch-all-objects", 2018-08-10)
Thanks, Aevar! I plan to give this a test on the Windows repository, as 
this is the way we generate the commit-graph in VFS for Git.


I created a PR on microsoft/git [1] and am generating the installers now [2]

[1] https://github.com/Microsoft/git/pull/110

[2] https://dev.azure.com/gvfs/ci/_build/results?buildId=6114

The code itself looks good, I just need to double-check the performance 
numbers.


Thanks,
-Stolee


[PATCH v5 3/5] revision: implement sparse algorithm

2019-01-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 138 ++---
 t/t5322-pack-objects-sparse.sh |  14 +++-
 2 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/revision.c b/revision.c
index 60421f3a10..5de739384a 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reac

[PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse

2019-01-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism. To
ensure that mark_tree_uninteresting walks the tree, we need to
remove the UNINTERESTING flag before calling the method. This
implementation will be replaced entirely in a later commit.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 25 +
 revision.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..60421f3a10 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *trees)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(trees, &iter);
+   while ((oid = oidset_iter_next(&iter))) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (!tree)
+   continue;
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..df684701b9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset 
*trees);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget



[PATCH v5 4/5] pack-objects: create pack.useSparse setting

2019-01-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 Documentation/config/pack.txt  |  9 +
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
true. You should not generally need to turn this off unless
you are debugging pack bitmaps.
 
+pack.useSparse::
+   When true, git will default to using the '--sparse' option in
+   'git pack-objects' when the '--revs' option is present. This
+   algorithm only walks trees that appear in paths that introduce new
+   objects. This can have significant performance benefits when
+   computing a pack to send a small change. However, it is possible
+   that extra objects are added to the pack-file if the included
+   commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 9f2a6e5d31..3233fafc90 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -118,4 +118,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp required_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget



[PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE

2019-01-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 1 +
 t/README   | 4 
 t/t5322-pack-objects-sparse.sh | 4 ++--
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 
read_replace_refs = 0;
 
+   sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
reset_pack_idx_option(&pack_idx_opts);
git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write 
code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 3233fafc90..7124b5581a 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -64,7 +64,7 @@ test_expect_success 'duplicate a folder from f3 and commit to 
topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
comm -1 -2 required_objects.txt nonsparse_objects.txt 
>nonsparse_required_objects.txt &&
-- 
gitgitgadget


[PATCH v5 2/5] list-objects: consume sparse tree walk

2019-01-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits. This
commit walk is not changing during this series.

The mark_edges_uninteresting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly different way. As we iterate
over the commits, we add all of the root trees to an oidset. Then,
call mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

Add a '--sparse' flag in 'git pack-objects' to call this new logic.
Add a new test script t/t5322-pack-objects-sparse.sh that tests this
option. The tests currently demonstrate that the resulting object
list is the same as the old algorithm. This includes a case where
both algorithms pack an object that is not needed by a remote due to
limits on the explored set of trees. When the sparse algorithm is
changed in a later commit, we will add a test that demonstrates a
change of behavior in some cases.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c   |   2 +-
 builtin/pack-objects.c |   5 +-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  70 +++---
 list-objects.h |   4 +-
 t/t5322-pack-objects-sparse.sh | 113 +
 8 files changed, 192 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack, when combined with the "--revs" option. This algorithm
+   only walks trees that appear in paths that introduce new objects.
+   This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@

[PATCH v5 0/5] Add a new "sparse" tree walk algorithm

2019-01-16 Thread Derrick Stolee via GitGitGadget
One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Updates in V4:

 * Switched to using hashmap instead of string_list for the path/oidset
   dictionary. (This is due to some fear that the string_list performance
   would degrade for a very wide tree, but I am unable to measure a
   performance difference.)
 * Some cleanup of code snippets across commits.
 * Some grammar cleanup in the commit messages.

Updates in V5:

 * Renamed the generic "set" to "trees"
 * Used a less-restrictive test condition when testing the old algorithm,
   except for the case where we want to verify that the new algorithm is
   being run.
 * Removed a global variable that was not used. Thanks, Ramsay and Junio!

Derrick Stolee (5):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt  |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c   |   2 +-
 builtin/pack-objects.c |  10 +-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  70 +++---
 list-objects.h |   4 +-
 revision.c | 143 +
 revision.h |   2 +
 t/README   |   4 +
 t/t5322-pack-objects-sparse.sh | 136 +++
 12 files changed, 378 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tag/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v4:

 1:  817e30a287 ! 1:  02ef702884 revision: add mark_tree_uninteresting_sparse
 @@ -9,7 +9,10 @@
  
  The current implementation uses the existing
  mark_tree_uninteresting to recursively walk the trees and blobs.
 -This will walk the same number of trees as the old mechanism.
 +This will walk the same number of trees as the old mechanism. To
 +ensure that mark_tree_uninteresting walks the tree, we need to
 +remove the UNINTERESTING flag before calling the method. This
 +implementation will be replaced entirely in a later commit.
  
  There is one new assumption in this approach: we are also given
  the oids of the interesting trees. This implementation does not
 @@ -26,12 +29,12 @@
   }
   
  +void mark_trees_uninteresting_sparse(struct repository *r,
 -+  struct oidse

Git Test Coverage Report (Wed Jan 16)

2019-01-16 Thread Derrick Stolee
rebase--interactive.c
Alban Gruin  8414c890a: sequencer: refactor 
sequencer_add_exec_commands() to work on a todo_list
Alban Gruin  98b29e060: sequencer: refactor skip_unnecessary_picks() 
to work on a todo_list
Alban Gruin  c1c074e0c: sequencer: change complete_action() to use 
the refactored functions
Alban Gruin  c27b32f0e: sequencer: refactor check_todo_list() to 
work on a todo_list

Alban Gruin  cf18b3f6c: sequencer: introduce todo_list_write_to_file()
Alban Gruin  e5b1c9d92: rebase-interactive: rewrite edit_todo_list() 
to handle the initial edit
Alban Gruin  febebd99b: sequencer: refactor rearrange_squash() to 
work on a todo_list
Anders Waldenborg  18f8e8109: strbuf: separate callback for 
strbuf_expand:ing literals

Anders Waldenborg  4681fe38e: pretty: allow showing specific trailers
Anders Waldenborg  b755bf6f8: pretty: allow %(trailers) options with 
explicit value

Derrick Stolee  8446350be: revision: add mark_tree_uninteresting_sparse
Derrick Stolee  9192bb22f: list-objects: consume sparse tree walk
Derrick Stolee  9949aaeef: revision: implement sparse algorithm
Jeff King  34a9469d6: remote-curl: refactor smart-http discovery
Jiang Xin  a338d1039: pack-redundant: consistent sort method
Jiang Xin  cb7e0336f: pack-redundant: rename pack_list.all_objects
Joel Teichroeb  cdca49bc4: stash: convert drop and clear to builtin
Joel Teichroeb  e1d01876a: stash: convert pop to builtin
Joel Teichroeb  f6bbd7812: stash: convert apply to builtin
Johannes Schindelin  26799a208: stash: optionally use the scripted 
version again
Johannes Schindelin  bec65d5b7: tests: add a special setup where 
stash.useBuiltin is off
Josh Steadmon  6da1f1a92: protocol: advertise multiple supported 
versions
Liam Beguin  0cce4a275: rebase -i -x: add exec commands via the 
rebase--helper

Linus Torvalds  080448fbe: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Nguyễn Thái Ngọc Duy  0d6caa2d0: merge-recursive.c: remove implicit 
dependency on the_index

Nguyễn Thái Ngọc Duy  6f11fd5ed: config: add --move-to
Nguyễn Thái Ngọc Duy  8f7c7f555: config.c: add 
repo_config_set_worktree_gently()
Nguyễn Thái Ngọc Duy  a12c1ff3a: config: factor out 
set_config_source_file()

Nguyễn Thái Ngọc Duy  e1ff0a32e: read-cache.c: kill read_index()
Paul-Sebastian Ungureanu  1f5a011d9: stash: convert create to builtin
Paul-Sebastian Ungureanu  51809c70c: stash: convert 
`stash--helper.c` into `stash.c`

Paul-Sebastian Ungureanu  847eb0b0a: stash: convert store to builtin
Paul-Sebastian Ungureanu  9a95010a1: stash: make push -q quiet
Paul-Sebastian Ungureanu  b4493f269: stash: convert show to builtin
Paul-Sebastian Ungureanu  bfc3fe33f: strbuf.c: add 
`strbuf_insertf()` and `strbuf_vinsertf()`

Paul-Sebastian Ungureanu  fa38428f7: stash: convert push to builtin
René Scharfe  878056005: sequencer: factor out 
strbuf_read_file_or_whine()




Uncovered code in 'jch' not in 'next'


builtin/archive.c
01f9ec64c8 builtin/archive.c  63) if (starts_with(reader.line, "NACK "))
01f9ec64c8 builtin/archive.c  64) die(_("git archive: NACK %s"), 
reader.line + 5);


builtin/bisect--helper.c
5e82c3dd22 builtin/bisect--helper.c 162) if (get_oid_commit(commit, &oid))
5e82c3dd22 builtin/bisect--helper.c 163) return error(_("'%s' is not a 
valid commit"), commit);

5e82c3dd22 builtin/bisect--helper.c 164) strbuf_addstr(&branch, commit);
5e82c3dd22 builtin/bisect--helper.c 172) strbuf_release(&branch);
5e82c3dd22 builtin/bisect--helper.c 173) argv_array_clear(&argv);
5e82c3dd22 builtin/bisect--helper.c 174) return error(_("could not check 
out original"
0f30233a11 builtin/bisect--helper.c 215) retval = error(_("Bad 
bisect_write argument: %s"), state);

0f30233a11 builtin/bisect--helper.c 216) goto finish;
0f30233a11 builtin/bisect--helper.c 220) retval = error(_("couldn't get 
the oid of the rev '%s'"), rev);

0f30233a11 builtin/bisect--helper.c 221) goto finish;
0f30233a11 builtin/bisect--helper.c 226) retval = -1;
0f30233a11 builtin/bisect--helper.c 227) goto finish;
0f30233a11 builtin/bisect--helper.c 232) retval = 
error_errno(_("couldn't open the file '%s'"), git_path_bisect_log());

0f30233a11 builtin/bisect--helper.c 233) goto finish;
129a6cf344 builtin/bisect--helper.c 329) yesno = git_prompt(_("Are you 
sure [Y/n]? "), PROMPT_ECHO);
129a6cf344 builtin/bisect--helper.c 330) if (starts_with(yesno, "N") || 
starts_with(yesno, "n"))

129a6cf344 builtin/bisect--helper.c 331) retval = -1;
129a6cf344 builtin/bisect--helper.c 332) goto finish;
129a6cf344 builtin/bisect--helper.c 338) retval = 
error(_(need_bisect_start_warning),
450ebb7359 builtin/bisect--helper.c 3

Re: [PATCH v4 3/6] pack-objects: add --sparse option

2019-01-15 Thread Derrick Stolee

On 1/11/2019 5:30 PM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Somehow checking the "these are exactly what we expect" feels
brittle.  In a history with three relevant commits A---B---C,
packing B..C could omit trees and blobs in C that appear in A but
not in B, but traditionally, because we stop traversal at B and do
not even look at A, we do not notice that such objects that need to
complete C's tree are already available in a repository that has B.
The approach of test in this patch feels similar to saying "we must
send these duplicates because we must stay dumb", even though with a
perfect knowledge about the history, e.g. bitmap, we would be able
to omit these objects in A that appear in C but not in B.
My intention with this test was to show that the existing algorithm also 
sends "extra" objects in certain situations, which is later contrasted 
by a test where the new algorithm sends objects the old algorithm did not.


Instead of "we must stay dumb" I wanted to say "we are not dumber (in 
this case)".


I understand your brittle feeling. If the logic changed in either to be 
smarter, then we would need to change the test. In some sense, that does 
mean we have extra coverage of "this is how it works now" so we would 
understand the behavior change of that hypothetical future change.



I think we want to test test both "are we sending enough, even
though we might be wasting some bandwidth by not noticing the other
side already have some?" and "are we still avoiding from becoming
overly stupid, even though we may be cheating to save traversal cost
and risking to redundantly send some objects?"

IOW, a set of tests to make sure that truly new objects are all sent
by seeing what is sent is a strict superset of what we expect.  And
another set of tests to make sure that objects that are so obviously
(this criterion may be highly subjective) be present in the
receiving repository (e.g. the tree object of commit B and what it
contains when seinding B..C) are never sent, even when using an
algorithm that are tuned for traversal cost over bandwidth
consumption.


To properly test "these objects are included," do we have an established 
pattern for saying "this file is a subset of that file"? Or, should I 
use something like `comm expected actual >common && test_cmp expected 
common`?



The code change in this step looks all trivially good, and it may
want to be squashed into the previous step to become a single patch.
Otherwise, [2/6] would be adding a dead code that nobody exercises
until this step.


Will squash.

Thanks,
-Stolee


Re: [PATCH] revision.c: fix sparse warnings (sparse algorithm)

2019-01-15 Thread Derrick Stolee

On 1/13/2019 3:55 PM, Ramsay Jones wrote:

Signed-off-by: Ramsay Jones 
---

Hi Derrick,

If you need to re-roll your 'ds/push-sparse-tree-walk' branch, could
you please squash this into the relevant patch [commit 9949aaeef4
("revision: implement sparse algorithm", 2018-12-14)].

This commit caused both 'sparse' and my 'static-check.pl' script to
complain about the visibility of the 'map_flags' variable (it is a
file local variable), so one solution would be to mark it 'static'.
However, it is simply not being used, so ...

Thanks!

ATB,
Ramsay Jones


Thanks, Ramsay! I'm rerolling the series today, so I will make this change.

-Stolee



Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse

2019-01-11 Thread Derrick Stolee

On 1/11/2019 3:25 PM, Junio C Hamano wrote:

Junio C Hamano  writes:

So, I assumed that the implementation was wrong, but it is the other
way around.  You do mean to pick only already uninteresting trees
out of "set" and mark its reachables.

One thing that would make me worried is what help the callers of
this function will get (or they may have to devise the way
themselves) to avoid having to traverse the same tree number of
times.  A tree can be made uninteresting after a traversal of
another tree that contains it, but the logic in this function


+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }

ignores the fact that it is already UNINTERESTING (in fact, in a
sense it is even worse---it cannot be used to make a not-yet
UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit
and forces the traversal of that tree.  The only thing I see that
would work as a saving grace is that mark_tree_uninteresting()
itself would honor existing UNINTERESING bit and refuses to recurse
into its subtrees, but that means blobs at the top-level of such a
tree would be marked UNINTERESING while those in its subtrees can be
left not-UNINTERESING, which sounds like inviting a mess.

It does *not* immediately mean this function is misdesigned.  It
just means that the caller needs to carefully follow whatever
calling convention this series will establish in the later patches
(which we haven't seen yet at this point).

I'm sorry that this implementation is particularly confusing. It is 
created only as a placeholder so we can wire up the --sparse option to 
use this method without being "wrong" but is then removed entirely in 
PATCH 4/6:


...

void mark_trees_uninteresting_sparse(struct repository *r,
 struct oidset *set)
 {
+   unsigned has_interesting = 0, has_uninteresting = 0;
+   struct hashmap map;
+   struct hashmap_iter map_iter;
+   struct path_and_oids_entry *entry;
struct object_id *oid;
struct oidset_iter iter;
 
 	oidset_iter_init(set, &iter);

-   while ((oid = oidset_iter_next(&iter))) {
+   while ((!has_interesting || !has_uninteresting) &&
+  (oid = oidset_iter_next(&iter))) {
struct tree *tree = lookup_tree(r, oid);
 
 		if (!tree)

continue;
 
-		if (tree->object.flags & UNINTERESTING) {

-   /*
-* Remove the flag so the next call
-* is not a no-op. The flag is added
-* in mark_tree_unintersting().
-*/
-   tree->object.flags ^= UNINTERESTING;
-   mark_tree_uninteresting(r, tree);
-   }
+   if (tree->object.flags & UNINTERESTING)
+   has_uninteresting = 1;
+   else
+   has_interesting = 1;
+   }
...

You are definitely correct that "set" is not a valuable variable name. It could instead 
be "tree_oids" to be slightly more informative.

Thanks,
-Stolee



[PATCH v3 1/9] repack: refactor pack deletion for future use

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c | 14 ++
 packfile.c   | 28 
 packfile.h   |  7 +++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list 
*fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-   const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-   int i;
struct strbuf buf = STRBUF_INIT;
-   size_t plen;
-
-   strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-   plen = buf.len;
-
-   for (i = 0; i < ARRAY_SIZE(exts); i++) {
-   strbuf_setlen(&buf, plen);
-   strbuf_addstr(&buf, exts[i]);
-   unlink(buf.buf);
-   }
+   strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+   unlink_pack_path(buf.buf, 1);
strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+   static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", 
".promisor"};
+   int i;
+   struct strbuf buf = STRBUF_INIT;
+   size_t plen;
+
+   strbuf_addstr(&buf, pack_name);
+   strip_suffix_mem(buf.buf, &buf.len, ".pack");
+   plen = buf.len;
+
+   if (!force_delete) {
+   strbuf_addstr(&buf, ".keep");
+   if (!access(buf.buf, F_OK)) {
+   strbuf_release(&buf);
+   return;
+   }
+   }
+
+   for (i = 0; i < ARRAY_SIZE(exts); i++) {
+   strbuf_setlen(&buf, plen);
+   strbuf_addstr(&buf, exts[i]);
+   unlink(buf.buf);
+   }
+
+   strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, 
int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
gitgitgadget



[PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 11 +++
 builtin/multi-pack-index.c | 10 +-
 midx.c |  5 +
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 11 +++
 5 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 6186c4c936..cc63531cc0 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@ expire::
have no objects referenced by the MIDX. Rewrite the MIDX file
afterward to remove all references to these pack-files.
 
+repack::
+   Collect a batch of pack-files whose size are all at most the
+   size given by --batch-size, but whose sizes sum to larger
+   than --batch-size. The batch is selected by greedily adding
+   small pack-files starting with the oldest pack-files that fit
+   the size. Create a new pack-file containing the objects the
+   multi-pack-index indexes into those pack-files, and rewrite
+   the multi-pack-index to contain that pack-file. A later run
+   of 'git multi-pack-index expire' will delete the pack-files
+   that were part of this batch.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
+   N_("git multi-pack-index [--object-dir=] 
(write|verify|expire|repack --batch-size=)"),
NULL
 };
 
 static struct opts_multi_pack_index {
const char *object_dir;
+   unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
static struct option builtin_multi_pack_index_options[] = {
OPT_FILENAME(0, "object-dir", &opts.object_dir,
  N_("object directory containing set of packfile and 
pack-index pairs")),
+   OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+ N_("during repack, collect pack-files of smaller size into a 
batch that is larger than this size")),
OPT_END(),
};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
return 1;
}
 
+   if (!strcmp(argv[0], "repack"))
+   return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+   if (opts.batch_size)
+   die(_("--batch-size option is only for 'repack' verb"));
+
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 6ccbec3f19..30ff4430ab 100644
--- a/midx.c
+++ b/midx.c
@@ -1106,3 +1106,8 @@ int expire_midx_packs(const char *object_dir)
string_list_clear(&packs_to_drop, 0);
return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct reposit

[PATCH v3 5/9] midx: refactor permutation logic and pack sorting

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee 
---
 midx.c | 150 -
 1 file changed, 63 insertions(+), 87 deletions(-)

diff --git a/midx.c b/midx.c
index f087bbbe82..7ae5275c25 100644
--- a/midx.c
+++ b/midx.c
@@ -377,12 +377,23 @@ static size_t write_midx_header(struct hashfile *f,
return MIDX_HEADER_SIZE;
 }
 
+struct pack_info {
+   uint32_t orig_pack_int_id;
+   char *pack_name;
+   struct packed_git *p;
+};
+
+static int pack_info_compare(const void *_a, const void *_b)
+{
+   struct pack_info *a = (struct pack_info *)_a;
+   struct pack_info *b = (struct pack_info *)_b;
+   return strcmp(a->pack_name, b->pack_name);
+}
+
 struct pack_list {
-   struct packed_git **list;
-   char **names;
+   struct pack_info *info;
uint32_t nr;
-   uint32_t alloc_list;
-   uint32_t alloc_names;
+   uint32_t alloc;
struct multi_pack_index *m;
 };
 
@@ -395,66 +406,32 @@ static void add_pack_to_midx(const char *full_path, 
size_t full_path_len,
if (packs->m && midx_contains_pack(packs->m, file_name))
return;
 
-   ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
-   ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+   ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
 
-   packs->list[packs->nr] = add_packed_git(full_path,
-   full_path_len,
-   0);
+   packs->info[packs->nr].p = add_packed_git(full_path,
+ full_path_len,
+ 0);
 
-   if (!packs->list[packs->nr]) {
+   if (!packs->info[packs->nr].p) {
warning(_("failed to add packfile '%s'"),
full_path);
return;
}
 
-   if (open_pack_index(packs->list[packs->nr])) {
+   if (open_pack_index(packs->info[packs->nr].p)) {
warning(_("failed to open pack-index '%s'"),
full_path);
-   close_pack(packs->list[packs->nr]);
-   FREE_AND_NULL(packs->list[packs->nr]);
+   close_pack(packs->info[packs->nr].p);
+   FREE_AND_NULL(packs->info[packs->nr].p);
return;
}
 
-   packs->names[packs->nr] = xstrdup(file_name);
+   packs->info[packs->nr].pack_name = xstrdup(file_name);
+   packs->info[packs->nr].orig_pack_int_id = packs->nr;
packs->nr++;
}
 }
 
-struct pack_pair {
-   uint32_t pack_int_id;
-   char *pack_name;
-};
-
-static int pack_pair_compare(const void *_a, const void *_b)
-{
-   struct pack_pair *a = (struct pack_pair *)_a;
-   struct pack_pair *b = (struct pack_pair *)_b;
-   return strcmp(a->pack_name, b->pack_name);
-}
-
-static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t 
*perm)
-{
-   uint32_t i;
-   struct pack_pair *pairs;
-
-   ALLOC_ARRAY(pairs, nr_packs);
-
-   for (i = 0; i < nr_packs; i++) {
-   pairs[i].pack_int_id = i;
-   pairs[i].pack_name = pack_names[i];
-   }
-
-   QSORT(pairs, nr_packs, pack_pair_compare);
-
-   for (i = 0; i < nr_packs; i++) {
-   pack_names[i] = pairs[i].pack_name;
-   perm[pairs[i].pack_int_id] = i;
-   }
-
-   free(pairs);
-}
-
 struct pack_midx_entry {
struct object_id oid;
uint32_t pack_int_id;
@@ -480,7 +457,6 @@ static int midx_oid_compare(const void *_a, const void *_b)
 }
 
 static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
- 

[PATCH v3 8/9] midx: implement midx_repack()

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 109 +++-
 t/t5319-multi-pack-index.sh |  25 +
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 30ff4430ab..f0f16690e6 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1107,7 +1108,113 @@ int expire_midx_packs(const char *object_dir)
return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+   timestamp_t mtime;
+   uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+   const struct time_and_id *a, *b;
+
+   a = (const struct time_and_id *)a_;
+   b = (const struct time_and_id *)b_;
+
+   if (a->mtime < b->mtime)
+   return -1;
+   if (a->mtime > b->mtime)
+   return 1;
return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   int result = 0;
+   uint32_t i, packs_to_repack;
+   size_t total_size;
+   struct time_and_id *pack_ti;
+   unsigned char *include_pack;
+   struct child_process cmd = CHILD_PROCESS_INIT;
+   struct strbuf base_name = STRBUF_INIT;
+   struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+   if (!m)
+   return 0;
+
+   include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+   pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+   for (i = 0; i < m->num_packs; i++) {
+   pack_ti[i].pack_int_id = i;
+
+   if (prepare_midx_pack(m, i))
+   continue;
+
+   pack_ti[i].mtime = m->packs[i]->mtime;
+   }
+   QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+   total_size = 0;
+   packs_to_repack = 0;
+   for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+   int pack_int_id = pack_ti[i].pack_int_id;
+   struct packed_git *p = m->packs[pack_int_id];
+
+   if (!p)
+   continue;
+   if (p->pack_size >= batch_size)
+   continue;
+
+   packs_to_repack++;
+   total_size += p->pack_size;
+   include_pack[pack_int_id] = 1;
+   }
+
+   if (total_size < batch_size || packs_to_repack < 2)
+   goto cleanup;
+
+   argv_array_push(&cmd.args, "pack-objects");
+
+   strbuf_addstr(&base_name, object_dir);
+   strbuf_addstr(&base_name, "/pack/pack");
+   argv_array_push(&cmd.args, base_name.buf);
+   strbuf_release(&base_name);
+
+   cmd.git_cmd = 1;
+   cmd.in = cmd.out = -1;
+
+   if (start_command(&cmd)) {
+   error(_("could not start pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   for (i = 0; i < m->num_objects; i++) {
+   struct object_id oid;
+   uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+   if (!include_pack[pack_int_id])
+   continue;
+
+   nth_midxed_object_oid(&oid, m, i);
+   xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+   xwrite(cmd.in, "\n", 1);
+   }
+   close(cmd.in);
+
+   if (finish_command(&cmd)) {
+   error(_("could not finish pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   result = write_midx_internal(object_dir, m, NULL);
+   m = NULL;
+
+cleanup:
+   if (m)
+   close_midx(m);
+   free(include_pack);
+   free(pack_ti);
+   return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
i

[PATCH v3 4/9] midx: simplify computation of pack name lengths

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
command, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee 
---
 midx.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..f087bbbe82 100644
--- a/midx.c
+++ b/midx.c
@@ -383,7 +383,6 @@ struct pack_list {
uint32_t nr;
uint32_t alloc_list;
uint32_t alloc_names;
-   size_t pack_name_concat_len;
struct multi_pack_index *m;
 };
 
@@ -418,7 +417,6 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
}
 
packs->names[packs->nr] = xstrdup(file_name);
-   packs->pack_name_concat_len += strlen(file_name) + 1;
packs->nr++;
}
 }
@@ -762,6 +760,7 @@ int write_midx_file(const char *object_dir)
uint32_t nr_entries, num_large_offsets = 0;
struct pack_midx_entry *entries = NULL;
int large_offsets_needed = 0;
+   int pack_name_concat_len = 0;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -777,7 +776,6 @@ int write_midx_file(const char *object_dir)
packs.alloc_names = packs.alloc_list;
packs.list = NULL;
packs.names = NULL;
-   packs.pack_name_concat_len = 0;
ALLOC_ARRAY(packs.list, packs.alloc_list);
ALLOC_ARRAY(packs.names, packs.alloc_names);
 
@@ -788,7 +786,6 @@ int write_midx_file(const char *object_dir)
 
packs.list[packs.nr] = NULL;
packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
-   packs.pack_name_concat_len += 
strlen(packs.names[packs.nr]) + 1;
packs.nr++;
}
}
@@ -798,10 +795,6 @@ int write_midx_file(const char *object_dir)
if (packs.m && packs.nr == packs.m->num_packs)
goto cleanup;
 
-   if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
-   packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
- (packs.pack_name_concat_len % 
MIDX_CHUNK_ALIGNMENT);
-
ALLOC_ARRAY(pack_perm, packs.nr);
sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
@@ -814,6 +807,13 @@ int write_midx_file(const char *object_dir)
large_offsets_needed = 1;
}
 
+   for (i = 0; i < packs.nr; i++)
+   pack_name_concat_len += strlen(packs.names[i]) + 1;
+
+   if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+   pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+   (pack_name_concat_len % 
MIDX_CHUNK_ALIGNMENT);
+
hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
FREE_AND_NULL(midx_name);
@@ -831,7 +831,7 @@ int write_midx_file(const char *object_dir)
 
cur_chunk++;
chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-   chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + 
packs.pack_name_concat_len;
+   chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + 
pack_name_concat_len;
 
cur_chunk++;
chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-- 
gitgitgadget



[PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c |  4 ++-
 midx.c |  5 +++
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 47 ++
 5 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
Verify the contents of the MIDX file.
 
+expire::
+   Delete the pack-files that are tracked  by the MIDX file, but
+   have no objects referenced by the MIDX. Rewrite the MIDX file
+   afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify)"),
+   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
return verify_midx_file(opts.object_dir);
+   if (!strcmp(argv[0], "expire"))
+   return expire_midx_packs(opts.object_dir);
 
die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+   mkdir dup &&
+   (
+   cd dup &&
+   git init &&
+   for i in $(test_seq 1 20)
+   do
+   test_commit $i
+   done &&
+   git branch A HEAD &&
+   git branch B HEAD~8 &&
+   git branch C HEAD~13 &&
+   git branch D HEAD~16 &&
+   git branch E HEAD~18 &&
+   git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+   refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+   refs/heads/D
+   ^refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+   refs/heads/C
+   ^refs/heads/D
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+   refs/heads/B
+   ^refs/heads/C
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+   refs/heads/A
+   

[PATCH v3 9/9] multi-pack-index: test expire while adding packs

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee 
---
 t/t5319-multi-pack-index.sh | 32 
 1 file changed, 32 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 99afb5ec51..1213549d25 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -449,4 +449,36 @@ test_expect_success 'expire removes repacked packs' '
)
 '
 
+test_expect_success 'expire works when adding new packs' '
+   (
+   cd dup &&
+   git pack-objects --revs .git/objects/pack/pack-combined <<-EOF 
&&
+   refs/heads/A
+   ^refs/heads/B
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-combined <<-EOF 
&&
+   refs/heads/B
+   ^refs/heads/C
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-combined <<-EOF 
&&
+   refs/heads/C
+   ^refs/heads/D
+   EOF
+   git multi-pack-index write &&
+   git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
+   refs/heads/D
+   ^refs/heads/E
+   EOF
+   git multi-pack-index write &&
+   git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
+   refs/heads/E
+   EOF
+   git multi-pack-index expire &&
+   ls .git/objects/pack/ | grep idx >expect &&
+   test-tool read-midx .git/objects | grep idx >actual &&
+   test_cmp expect actual &&
+   git multi-pack-index verify
+   )
+'
+
 test_done
-- 
gitgitgadget


[PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index

2019-01-09 Thread Derrick Stolee via GitGitGadget
The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
of repacking all reachable objects. We can also limit ourselves to the
objects that do not appear in newer pack-files.


 2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not
suffer performance degradation. By expiring only pack-files that have no
referenced objects, we know that Git commands that are doing normal
object lookups* will not be interrupted.


 3. Note: if someone concurrently runs a Git command that uses
get_all_packs(), then that command could try to read the pack-files and
pack-indexes that we are deleting during an expire command. Such
commands are usually related to object maintenance (i.e. fsck, gc,
pack-objects) or are related to less-often-used features (i.e.
fast-import, http-backend, server-info).



We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Updates in V2:

 * Added a method, unlink_pack_path() to remove packfiles, but with the
   additional check for a .keep file. This borrows logic from 
   builtin/repack.c.
   
   
 * Modified documentation and commit messages to replace 'verb' with
   'subcommand'. Simplified the documentation. (I left 'verbs' in the title
   of the cover letter for consistency.)
   
   

Updates in V3:

 * There was a bug in the expire logic when simultaneously removing packs
   and adding uncovered packs, specifically around the pack permutation.
   This was hard to see during review because I was using the 'pack_perm'
   array for multiple purposes. First, I was reducing its length, and then I
   was adding to it and resorting. In V3, I significantly overhauled the
   logic here, which required some extra commits before implementing
   'expire'. The final commit includes a test that would cover this case.

Thanks, -Stolee

Derrick Stolee (9):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs

 Documentation/git-multi-pack-index.txt |  26 +-
 builtin/multi-pack-index.c |  12 +-
 builtin/repack.c   |  14 +-
 midx.c | 393 ++---
 midx.h |   2 +
 packfile.c |  28 ++
 packfile.h |   7 +
 t/t5319-multi-pack-index.sh| 133 +
 8 files changed, 497 ins

[PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller 
Helped-by: Szeder Gábor 
Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 
 [verse]
-'git multi-pack-index' [--object-dir=] 
+'git multi-pack-index' [--object-dir=] 
 
 DESCRIPTION
 ---
@@ -23,13 +23,13 @@ OPTIONS
`/packs/multi-pack-index` for the current MIDX file, and
`/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-   When given as the verb, write a new MIDX file to
-   `/packs/multi-pack-index`.
+   Write a new MIDX file.
 
 verify::
-   When given as the verb, verify the contents of the MIDX file
-   at `/packs/multi-pack-index`.
+   Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
gitgitgadget



[PATCH v3 6/9] multi-pack-index: implement 'expire' verb

2019-01-09 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 120 +---
 t/t5319-multi-pack-index.sh |  18 ++
 2 files changed, 128 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 7ae5275c25..6ccbec3f19 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,8 @@
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x8000
 
+#define PACK_EXPIRED UINT_MAX
+
 static char *get_midx_filename(const char *object_dir)
 {
return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -381,6 +383,7 @@ struct pack_info {
uint32_t orig_pack_int_id;
char *pack_name;
struct packed_git *p;
+   unsigned expired : 1;
 };
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -428,6 +431,7 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
 
packs->info[packs->nr].pack_name = xstrdup(file_name);
packs->info[packs->nr].orig_pack_int_id = packs->nr;
+   packs->info[packs->nr].expired = 0;
packs->nr++;
}
 }
@@ -587,13 +591,17 @@ static size_t write_midx_pack_names(struct hashfile *f,
size_t written = 0;
 
for (i = 0; i < num_packs; i++) {
-   size_t writelen = strlen(info[i].pack_name) + 1;
+   size_t writelen;
+
+   if (info[i].expired)
+   continue;
 
if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
BUG("incorrect pack-file order: %s before %s",
info[i - 1].pack_name,
info[i].pack_name);
 
+   writelen = strlen(info[i].pack_name) + 1;
hashwrite(f, info[i].pack_name, writelen);
written += writelen;
}
@@ -675,6 +683,11 @@ static size_t write_midx_object_offsets(struct hashfile 
*f, int large_offset_nee
for (i = 0; i < nr_objects; i++) {
struct pack_midx_entry *obj = list++;
 
+   if (perm[obj->pack_int_id] == PACK_EXPIRED)
+   BUG("object %s is in an expired pack with int-id %d",
+   oid_to_hex(&obj->oid),
+   obj->pack_int_id);
+
hashwrite_be32(f, perm[obj->pack_int_id]);
 
if (large_offset_needed && obj->offset >> 31)
@@ -721,7 +734,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, 
uint32_t nr_large_off
return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index 
*m,
+  struct string_list *packs_to_drop)
 {
unsigned char cur_chunk, num_chunks = 0;
char *midx_name;
@@ -737,6 +751,8 @@ int write_midx_file(const char *object_dir)
struct pack_midx_entry *entries = NULL;
int large_offsets_needed = 0;
int pack_name_concat_len = 0;
+   int dropped_packs = 0;
+   int result = 0;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir)
  midx_name);
}
 
-   packs.m = load_multi_pack_index(object_dir, 1);
+   if (m)
+   packs.m = m;
+   else
+   packs.m = load_multi_pack_index(object_dir, 1);
 
packs.nr = 0;
packs.alloc = packs.m ? packs.m->num_packs : 16;
@@ -759,13 +778,14 @@ int write_midx_file(const char *object_dir)
packs.info[packs.nr].orig_pack_int_id = i;
packs.info[packs.nr].pack_name = 
xstrdup(packs.m->pack_names[i]);
packs.info[packs.nr].p = NULL;
+   packs.info[packs.nr].expired = 0;
 

Re: [PATCH 0/11] jk/loose-object-cache sha1/object_id fixups

2019-01-08 Thread Derrick Stolee

On 1/8/2019 1:07 PM, Junio C Hamano wrote:

Jeff King  writes:


Yeah, they should. I think one of them will need René's patch, which
changes the body of quick_has_loose(). I can roll it as a separate topic
if that's easier (or just wait a week or so until René's cleanups
graduate).

Nah, what I got is already good to work with.  Both series are
straight-forward and I do not expect them needing long fermentation.


I'm just chiming in to say that this series was a very satisfying read, 
and the changes were clear-cut and mechanical.


Thanks!
-Stolee


Git Test Coverage Report (Tues, Jan 8)

2019-01-08 Thread Derrick Stolee
g(p, "=on", end) ||
b755bf6f83 1108) match_placeholder_arg(p, "=true", end)) {

protocol.c
6da1f1a920  37) die(_("Unrecognized protocol version"));
6da1f1a920  39) die(_("Unrecognized protocol_version"));
63bb981502  49) enum protocol_version version = 
parse_protocol_version(git_test_v);

63bb981502  51) if (version == protocol_unknown_version)
63bb981502  52) die("unknown value for %s: %s", git_test_k,
63bb981502  55) return version;

read-cache.c
43bf1db73e 1302) return 0;
43bf1db73e 1323) oidcpy(&backup_prev, &istate->cache[pos]->oid);
43bf1db73e 1343) update_backup_log(istate, &backup_prev, ce);
43bf1db73e 3215) strbuf_release(&sb);
43bf1db73e 3216) return -1;

refs/files-backend.c
c67027c9a9 1892) return;
c67027c9a9 1895) return;

remote-curl.c
6da1f1a920  344) return 0;
34a9469d6a  373) die("invalid server response; expected service, got 
flush packet");

34a9469d6a  397) d->proto_git = 1;

revision.c
497f2693ab  149) return;
497f2693ab  152) return;
497f2693ab  175) break;
04dac00473  197) continue;

send-pack.c
01f9ec64c8 143) return error(_("unable to parse remote unpack status: 
%s"), reader->line);

01f9ec64c8 162) error("invalid ref status from remote: %s", reader->line);
01f9ec64c8 579) receive_unpack_status(&reader);

strbuf.c
bfc3fe33f6  259) die("`pos' is too far after the end of the buffer");
bfc3fe33f6  266) return; /* nothing to do */
bfc3fe33f6  268) die("you want to use way too much memory");
18f8e81091  448) return 0;

submodule.c
26f80ccfc1 1398) strbuf_release(&gitdir);
be76c21282 1521) struct fetch_task *task = task_cb;
be76c21282 1525) fetch_task_release(task);
898c2e65b7 1806) warning(_("Could not unset core.worktree setting in 
submodule '%s'"),


unpack-trees.c
cc14089d7c  206) oidclr(&null_hash);
cc14089d7c  207) new_hash = &null_hash;
6f41cc899b 1716) index_path(NULL, old_hash, ce->name, &st,
6f41cc899b 1972) old_hash && !lstat(ce->name, &st))
6f41cc899b 1973) index_path(NULL, old_hash, ce->name, &st,
cc14089d7c 2291) if (verify_absent(ce, ERROR_WOULD_LOSE_UNTRACKED_REMOVED,
cc14089d7c 2294) make_backup(ce, &old_hash, NULL, o);

upload-pack.c
01f9ec64c8  428) die("git upload-pack: expected SHA1 list, got '%s'", 
reader->line);


wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Ævar Arnfjörð Bjarmason  63bb98150: tests: add a special setup where 
for protocol.version
Anders Waldenborg  18f8e8109: strbuf: separate callback for 
strbuf_expand:ing literals

Anders Waldenborg  4681fe38e: pretty: allow showing specific trailers
Anders Waldenborg  b755bf6f8: pretty: allow %(trailers) options with 
explicit value

Derrick Stolee  04dac0047: list-objects: consume sparse tree walk
Derrick Stolee  0d8e91f58: multi-pack-index: implement 'expire' verb
Derrick Stolee  497f2693a: revision: implement sparse algorithm
Derrick Stolee  5532d59aa: multi-pack-index: prepare 'repack' subcommand
Derrick Stolee  913368875: repack: refactor pack deletion for future use
Derrick Stolee  c9b358598: midx: implement midx_repack()
Jeff King  34a9469d6: remote-curl: refactor smart-http discovery
Joel Teichroeb  cdca49bc4: stash: convert drop and clear to builtin
Joel Teichroeb  e1d01876a: stash: convert pop to builtin
Joel Teichroeb  f6bbd7812: stash: convert apply to builtin
Johannes Schindelin  26799a208: stash: optionally use the scripted 
version again
Johannes Schindelin  2ead83aef: rebase: move `reset_head()` into a 
better spot

Johannes Schindelin  3bd5f0710: built-in rebase: call `git am` directly
Johannes Schindelin  81ef8ee75: rebase: introduce a shortcut for 
--reschedule-failed-exec
Johannes Schindelin  bec65d5b7: tests: add a special setup where 
stash.useBuiltin is off
Johannes Schindelin  d421afa0c: rebase: introduce 
--reschedule-failed-exec

Jonathan Tan  4d0feb763: builtin/fetch-pack: support protocol version 2
Josh Steadmon  6da1f1a92: protocol: advertise multiple supported 
versions
Junio C Hamano  d6af6af1f: Merge branch 
'sb/submodule-recursive-fetch-gets-the-tip' into pu

Linus Torvalds  080448fbe: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Masaya Suzuki  01f9ec64c: Use packet_reader instead of packet_read_line
Matthew DeVore  adbdcc076: list-objects-filter: teach tree:# how to 
handle >0
Matthew DeVore  e34ec45cc: tree:: skip some trees even when 
collecting omits
Nguyễn Thái Ngọc Duy  102b7856e: backup-log.c: add API for walking 
backup log

[PATCH 0/1] git-gc.txt: fix typo about gc.writeCommitGraph

2019-01-08 Thread Derrick Stolee via GitGitGadget
Thanks to Stefan Haller for sending me a private message about this typo.

Derrick Stolee (1):
  git-gc.txt: fix typo about gc.writeCommitGraph

 Documentation/git-gc.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: c7e8ce6d1dd02f6569ea785eebc8692e8e2edf72
Published-As: 
https://github.com/gitgitgadget/git/releases/tag/pr-105%2Fderrickstolee%2Fgc-doc%2Fupstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-105/derrickstolee/gc-doc/upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/105
-- 
gitgitgadget


[PATCH 1/1] git-gc.txt: fix typo about gc.writeCommitGraph

2019-01-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Reported-by: Stefan Haller 
Signed-off-by: Derrick Stolee 
---
 Documentation/git-gc.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index c20ee6c789..a7442499f6 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -137,7 +137,7 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
-The optional configuration variable `gc.commitGraph` determines if
+The optional configuration variable `gc.writeCommitGraph` determines if
 'git gc' should run 'git commit-graph write'. This can be set to a
 boolean value. This defaults to false.
 
-- 
gitgitgadget


Git Test Coverage Report (Saturday, Dec 29)

2018-12-29 Thread Derrick Stolee
3  197) continue;

strbuf.c
18f8e81091  397) return 0;

submodule.c
26f80ccfc1 1396) strbuf_release(&gitdir);
be76c21282 1519) struct fetch_task *task = task_cb;
be76c21282 1523) fetch_task_release(task);

unpack-trees.c
cc14089d7c  206) oidclr(&null_hash);
cc14089d7c  207) new_hash = &null_hash;
6f41cc899b 1716) index_path(NULL, old_hash, ce->name, &st,
6f41cc899b 1972) old_hash && !lstat(ce->name, &st))
6f41cc899b 1973) index_path(NULL, old_hash, ce->name, &st,
cc14089d7c 2291) if (verify_absent(ce, ERROR_WOULD_LOSE_UNTRACKED_REMOVED,
cc14089d7c 2294) make_backup(ce, &old_hash, NULL, o);

wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Anders Waldenborg  18f8e8109: strbuf: separate callback for 
strbuf_expand:ing literals

Anders Waldenborg  4681fe38e: pretty: allow showing specific trailers
Anders Waldenborg  b755bf6f8: pretty: allow %(trailers) options with 
explicit value

Derrick Stolee  04dac0047: list-objects: consume sparse tree walk
Derrick Stolee  497f2693a: revision: implement sparse algorithm
Derrick Stolee  5518ac8ee: multi-pack-index: implement 'expire' verb
Derrick Stolee  b59c49226: multi-pack-index: implement midx_repack()
Derrick Stolee  e86a2be88: multi-pack-index: prepare 'repack' verb
Johannes Schindelin  81ef8ee75: rebase: introduce a shortcut for 
--reschedule-failed-exec
Johannes Schindelin  d421afa0c: rebase: introduce 
--reschedule-failed-exec
Josh Steadmon  24c10f747: protocol: advertise multiple supported 
versions

Linus Torvalds  74e8221b5: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Nguyễn Thái Ngọc Duy  102b7856e: backup-log.c: add API for walking 
backup log
Nguyễn Thái Ngọc Duy  1fcfdf84c: apply: support backup log with 
--keep-backup
Nguyễn Thái Ngọc Duy  43bf1db73: read-cache.c: new flag for 
add_index_entry() to write to backup log

Nguyễn Thái Ngọc Duy  45f3e0cd9: backup-log: add diff command
Nguyễn Thái Ngọc Duy  6a05b9ab7: backup-log: add cat command
Nguyễn Thái Ngọc Duy  6f41cc899: reset --hard: keep backup of 
overwritten files

Nguyễn Thái Ngọc Duy  7f1d166ee: backup-log: add log command
Nguyễn Thái Ngọc Duy  937d6bee9: config --edit: support backup log
Nguyễn Thái Ngọc Duy  b2069b6eb: backup-log: keep all blob 
references around

Nguyễn Thái Ngọc Duy  b86e9ac72: gc: prune backup logs
Nguyễn Thái Ngọc Duy  bde028c66: backup-log: add prune command
Nguyễn Thái Ngọc Duy  c67027c9a: refs: keep backup of deleted reflog
Nguyễn Thái Ngọc Duy  cc14089d7: unpack-trees.c: keep backup of 
ignored files being overwritten

Nguyễn Thái Ngọc Duy  fdbbdf809: backup-log: add "update" subcommand
Stefan Beller  26f80ccfc: submodule: migrate get_next_submodule to 
use repository structs

Stefan Beller  be76c2128: fetch: ensure submodule objects fetched
Thomas Gummerer  20e835df5: checkout: add --cached option
Thomas Gummerer  4ed0c8eb4: checkout: introduce --{,no-}overlay option



Uncovered code in 'jch' not in 'next'


apply.c
0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) ||
0f086e6dca 3356) lstat(ce->name, st))

builtin/branch.c
0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD"));
0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of 
refs/heads/"), refname);


builtin/pull.c
b19eee9066 647) argv_array_push(&args, opt_cleanup);

builtin/remote.c
f39a9c6547 builtin/remote.c 1551) die(_("--save-to-push cannot be used 
with other options"));
f39a9c6547 builtin/remote.c 1575) die(_("--save-to-push can only be used 
when only one url is defined"));


commit-graph.c
721351787e  128) return NULL;
721351787e  131) return NULL;
721351787e  187) free(graph);
721351787e  188) return NULL;
721351787e  223) free(graph);
721351787e  224) return NULL;

hex.c
47edb64997  93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1)
47edb64997  95) return hash_to_hex_algop_r(buffer, sha1, 
&hash_algos[GIT_HASH_SHA1]);

47edb64997 116) char *hash_to_hex(const unsigned char *hash)
47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo);

pathspec.c
22af33bece 671) name = to_free = xmemdupz(name, namelen);

read-cache.c
ee70c12820 1730) if (advice_unknown_index_extension) {
ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext);
ee70c12820 1732) advise(_("This is likely due to the file having been 
written by a newer\n"

ec36c42a63 3508) const char *index = NULL;
ec36c42a63 3514) if (!offset)
ec36c42a63 3515) return NULL;
ec

Re: [PATCH 00/23] sb/more-repo-in-api

2018-12-27 Thread Derrick Stolee

On 12/14/2018 7:09 PM, Stefan Beller wrote:

I realized next has not been rewound, so I can resend sb/more-repo-in-api,
which I hereby do. The changes are minimal and address the only comment
by Jonathan so far.


Sorry I'm very late to look at this, but your series looks good to me. 
I've got some work on the way that will use these arbitrary 
repositories. (Specifically, moving 'generation' out of 'struct commit' 
and into a commit slab to store reachability index values.)


Thanks,

-Stolee



Re: [PATCH] log: add %S option (like --source) to log --format

2018-12-27 Thread Derrick Stolee

On 12/19/2018 3:33 AM, issac.tro...@gmail.com wrote:

From: Issac Trotts 

Make it possible to write for example

 git log --format="%H,%S"

where the %S at the end is a new placeholder that prints out the ref
(tag/branch) for each commit.

Using %d might seem like an alternative but it only shows the ref for the last
commit in the branch.

Signed-off-by: Issac Trotts 

---

This change is based on a question from Stack Overflow:
https://stackoverflow.com/questions/12712775/git-get-source-information-in-format
---
  Documentation/pretty-formats.txt |  2 ++
  builtin/log.c|  2 +-
  log-tree.c   |  1 +
  pretty.c | 15 ++
  pretty.h |  1 +
  t/t4205-log-pretty-formats.sh| 50 
  6 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt
index 417b638cd..de6953108 100644
--- a/Documentation/pretty-formats.txt
+++ b/Documentation/pretty-formats.txt
@@ -134,6 +134,8 @@ The placeholders are:
  - '%cI': committer date, strict ISO 8601 format
  - '%d': ref names, like the --decorate option of linkgit:git-log[1]
  - '%D': ref names without the " (", ")" wrapping.
+- '%S': ref name given on the command line by which the commit was reached
+  (like `git log --source`), only works with `git log`


This "only works with `git log`" made me think about what would happen 
with `git rev-list --pretty=format:"%h %S"` and the answer (on my 
machine) was a segfault.



  - '%e': encoding
  - '%s': subject
  - '%f': sanitized subject line, suitable for a filename
diff --git a/builtin/log.c b/builtin/log.c
index e8e51068b..be3025657 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -203,7 +203,7 @@ static void cmd_log_init_finish(int argc, const char 
**argv, const char *prefix,
rev->diffopt.filter || rev->diffopt.flags.follow_renames)
rev->always_show_header = 0;
  
-	if (source) {

+   if (source || (rev->pretty_given && (rev->commit_format == CMIT_FMT_USERFORMAT) 
&& w.source)) {
init_revision_sources(&revision_sources);
rev->sources = &revision_sources;
}


Likely, you'll want to duplicate this initialization in the revision 
machinery. Keep this one in builtin/log.c as it was before, but add 
something like this initialization to setup_revisions(). Add a test for 
rev-list.


[snip]


@@ -1194,6 +1195,17 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
load_ref_decorations(NULL, DECORATE_SHORT_REFS);
format_decorations_extended(sb, commit, c->auto_color, "", ", ", 
"");
return 1;
+   case 'S':   /* tag/branch like --source */
+   if (c->pretty_ctx->rev->sources == NULL) {

Use "if (!c->pretty_ctx->rev->sources".

+   return 0;
+   }
+   slot = revision_sources_at(c->pretty_ctx->rev->sources, commit);
+   if (slot && *slot) {
I'm not sure this check for 'slot' being non-null is necessary, as we 
would already get a failure in the commit-slab code (for 
revision_sources_at()) if the slab is not initialized.

+   strbuf_addstr(sb, *slot);
+   return 1;
+   } else {
+   die(_("failed to get info for %%S"));


Here, you die() when you fail to get a slot but above you return 0 when 
the sources are not initialized.


I don't see another use of die() in this method. Is that the right way 
to handle failure here? (I'm legitimately asking because I have 
over-used 'die()' in the past and am still unclear on when it is 
appropriate.)



  '
  
+test_expect_success 'set up %S tests' '

+   git checkout --orphan source-a &&
+   test_commit one &&
+   test_commit two &&
+   git checkout -b source-b HEAD^ &&
+   test_commit three
+'
+
+test_expect_success 'log --format=%S paints branch names' '
+   cat >expect <<-\EOF &&
+   source-b
+   source-a
+   source-b
+   EOF
+   git log --format=%S source-a source-b >actual &&
+   test_cmp expect actual
+'
+
+test_expect_success 'log --format=%S paints tag names' '
+   git tag -m tagged source-tag &&
+   cat >expect <<-\EOF &&
+   source-tag
+   source-a
+   source-tag
+   EOF
+   git log --format=%S source-tag source-a >actual &&
+   test_cmp expect actual
+'
+
+test_expect_success 'log --format=%S paints symmetric ranges' '
+   cat >expect <<-\EOF &&
+   source-b
+   source-a
+   EOF
+   git log --format=%S source-a...source-b >actual &&
+   test_cmp expect actual
+'
+
+test_expect_success '%S in git log --format works with other placeholders 
(part 1)' '
+   git log --format="source-b %h" source-b >expect &&
+   git log --format="%S %h" source-b >actual &&
+   test_cmp expect act

Re: [PATCH v1 0/4] HPE NonStop Port Commits

2018-12-27 Thread Derrick Stolee

On 12/26/2018 6:05 PM, randall.s.bec...@rogers.com wrote:

From: "Randall S. Becker" 

This set of patches is a distilled version of the minimal
set of changes to git that will allow it to run as client
and server on HPE NonStop NSE and NSX systems. NSR systems
are no longer under support so references to them have
been removed. Each patch in this set is independent but
required for correctness.

Randall S. Becker (4):
   transport-helper: use xread instead of read
   config.mak.uname: support for modern HPE NonStop config.
   git-compat-util.h: add FLOSS headers for HPE NonStop
   compat/regex/regcomp.c: define intptr_t and uintptr_t on NonStop


These patches look correct to me. Just one question on patch 3 
(git-compat-util.h: add FLOSS headers for HPE NonStop).


I'm not able to comment on patch 2 (config.mak.uname: support for modern 
HPE NonStop config.), but it looks to be wrapped in a platform-specific 
'if', so should be fine if you are happy with it.


Thanks,

-Stolee



Re: [PATCH v1 3/4] git-compat-util.h: add FLOSS headers for HPE NonStop

2018-12-27 Thread Derrick Stolee

On 12/26/2018 6:05 PM, randall.s.bec...@rogers.com wrote:

The NSIG define is also not defined on __TANDEM, so we define it
here as 100 if it is not defined only for __TANDEM builds.

[snip]

+#if ! defined NSIG


Why didn't you use "#ifndef" here?

Taking a look at the file, I see both "#ifdef" and "#if defined" but no 
"#if ! defined".


Thanks,

-Stolee



[PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller 
Helped-by: Szeder Gábor 
Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 
 [verse]
-'git multi-pack-index' [--object-dir=] 
+'git multi-pack-index' [--object-dir=] 
 
 DESCRIPTION
 ---
@@ -23,13 +23,13 @@ OPTIONS
`/packs/multi-pack-index` for the current MIDX file, and
`/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-   When given as the verb, write a new MIDX file to
-   `/packs/multi-pack-index`.
+   Write a new MIDX file.
 
 verify::
-   When given as the verb, verify the contents of the MIDX file
-   at `/packs/multi-pack-index`.
+   Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
gitgitgadget



[PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index

2018-12-21 Thread Derrick Stolee via GitGitGadget
The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
of repacking all reachable objects. We can also limit ourselves to the
objects that do not appear in newer pack-files.


 2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not
suffer performance degradation. By expiring only pack-files that have no
referenced objects, we know that Git commands that are doing normal
object lookups* will not be interrupted.


 3. Note: if someone concurrently runs a Git command that uses
get_all_packs(), then that command could try to read the pack-files and
pack-indexes that we are deleting during an expire command. Such
commands are usually related to object maintenance (i.e. fsck, gc,
pack-objects) or are related to less-often-used features (i.e.
fast-import, http-backend, server-info).



We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Updates in V2:

 * Added a method, unlink_pack_path() to remove packfiles, but with the
   additional check for a .keep file. This borrows logic from 
   builtin/repack.c.
   
   
 * Modified documentation and commit messages to replace 'verb' with
   'subcommand'. Simplified the documentation. (I left 'verbs' in the title
   of the cover letter for consistency.)
   
   

Thanks, -Stolee

Derrick Stolee (7):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: refactor permutation logic
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()

 Documentation/git-multi-pack-index.txt |  26 ++-
 builtin/multi-pack-index.c |  12 +-
 builtin/repack.c   |  14 +-
 midx.c | 217 +++--
 midx.h |   2 +
 packfile.c |  28 
 packfile.h |   7 +
 t/t5319-multi-pack-index.sh|  98 +++
 8 files changed, 376 insertions(+), 28 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-92/derrickstolee/midx-expire/upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/92

Range-diff vs v1:

 -:  -- > 1:  a697df120c repack: refactor pack deletion for future use
 -:  -- > 2:  55df6b20ff Docs: rearrange subcommands for 
multi-pack-index
 1:  1e34b48a20 ! 3:  2529afe89e multi-pack-index: prepare for 'expire' verb
 @@ -1,6 +1,6 @@
  Autho

[PATCH v2 5/7] multi-pack-index: implement 'expire' verb

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 82 +++--
 t/t5319-multi-pack-index.sh | 15 +++
 2 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 3bd7183a53..043cd1fd97 100644
--- a/midx.c
+++ b/midx.c
@@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, 
uint32_t nr_large_off
return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index 
*m,
+  struct string_list *packs_to_drop)
 {
unsigned char cur_chunk, num_chunks = 0;
char *midx_name;
@@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir)
uint32_t nr_entries, num_large_offsets = 0;
struct pack_midx_entry *entries = NULL;
int large_offsets_needed = 0;
+   int result = 0;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir)
  midx_name);
}
 
-   packs.m = load_multi_pack_index(object_dir, 1);
+   if (m)
+   packs.m = m;
+   else
+   packs.m = load_multi_pack_index(object_dir, 1);
 
packs.nr = 0;
packs.alloc_list = packs.m ? packs.m->num_packs : 16;
@@ -787,7 +792,25 @@ int write_midx_file(const char *object_dir)
ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
if (packs.m) {
+   int drop_index = 0, missing_drops = 0;
for (i = 0; i < packs.m->num_packs; i++) {
+   if (packs_to_drop && drop_index < packs_to_drop->nr) {
+   int cmp = strcmp(packs.m->pack_names[i],
+
packs_to_drop->items[drop_index].string);
+
+   if (!cmp) {
+   drop_index++;
+   continue;
+   } else if (cmp > 0) {
+   error(_("did not see pack-file %s to 
drop"),
+ 
packs_to_drop->items[drop_index].string);
+   drop_index++;
+   i--;
+   missing_drops++;
+   continue;
+   }
+   }
+
ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
ALLOC_GROW(packs.names, packs.nr + 1, 
packs.alloc_names);
ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
@@ -798,6 +821,12 @@ int write_midx_file(const char *object_dir)
packs.pack_name_concat_len += 
strlen(packs.names[packs.nr]) + 1;
packs.nr++;
}
+
+   if (packs_to_drop && (drop_index < packs_to_drop->nr || 
missing_drops)) {
+   error(_("did not see all pack-files to drop"));
+   result = 1;
+   goto cleanup;
+   }
}
 
for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
@@ -932,7 +961,12 @@ int write_midx_file(const char *object_dir)
free(packs.perm);
free(entries);
free(midx_name);
-   return 0;
+   return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+   return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1034,5 +1068,45 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-   return 0;
+   uint32_t i, *count, result = 0;
+   struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+   struct multi_pac

[PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c |  4 ++-
 midx.c |  5 +++
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 47 ++
 5 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
Verify the contents of the MIDX file.
 
+expire::
+   Delete the pack-files that are tracked  by the MIDX file, but
+   have no objects referenced by the MIDX. Rewrite the MIDX file
+   afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify)"),
+   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
return verify_midx_file(opts.object_dir);
+   if (!strcmp(argv[0], "expire"))
+   return expire_midx_packs(opts.object_dir);
 
die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+   mkdir dup &&
+   (
+   cd dup &&
+   git init &&
+   for i in $(test_seq 1 20)
+   do
+   test_commit $i
+   done &&
+   git branch A HEAD &&
+   git branch B HEAD~8 &&
+   git branch C HEAD~13 &&
+   git branch D HEAD~16 &&
+   git branch E HEAD~18 &&
+   git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+   refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+   refs/heads/D
+   ^refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+   refs/heads/C
+   ^refs/heads/D
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+   refs/heads/B
+   ^refs/heads/C
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+   refs/heads/A
+   

[PATCH v2 7/7] midx: implement midx_repack()

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 109 +++-
 t/t5319-multi-pack-index.sh |  25 +
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 127c43f7b0..f6bc111438 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -,7 +1112,113 @@ int expire_midx_packs(const char *object_dir)
return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+   timestamp_t mtime;
+   uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+   const struct time_and_id *a, *b;
+
+   a = (const struct time_and_id *)a_;
+   b = (const struct time_and_id *)b_;
+
+   if (a->mtime < b->mtime)
+   return -1;
+   if (a->mtime > b->mtime)
+   return 1;
return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   int result = 0;
+   uint32_t i, packs_to_repack;
+   size_t total_size;
+   struct time_and_id *pack_ti;
+   unsigned char *include_pack;
+   struct child_process cmd = CHILD_PROCESS_INIT;
+   struct strbuf base_name = STRBUF_INIT;
+   struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+   if (!m)
+   return 0;
+
+   include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+   pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+   for (i = 0; i < m->num_packs; i++) {
+   pack_ti[i].pack_int_id = i;
+
+   if (prepare_midx_pack(m, i))
+   continue;
+
+   pack_ti[i].mtime = m->packs[i]->mtime;
+   }
+   QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+   total_size = 0;
+   packs_to_repack = 0;
+   for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+   int pack_int_id = pack_ti[i].pack_int_id;
+   struct packed_git *p = m->packs[pack_int_id];
+
+   if (!p)
+   continue;
+   if (p->pack_size >= batch_size)
+   continue;
+
+   packs_to_repack++;
+   total_size += p->pack_size;
+   include_pack[pack_int_id] = 1;
+   }
+
+   if (total_size < batch_size || packs_to_repack < 2)
+   goto cleanup;
+
+   argv_array_push(&cmd.args, "pack-objects");
+
+   strbuf_addstr(&base_name, object_dir);
+   strbuf_addstr(&base_name, "/pack/pack");
+   argv_array_push(&cmd.args, base_name.buf);
+   strbuf_release(&base_name);
+
+   cmd.git_cmd = 1;
+   cmd.in = cmd.out = -1;
+
+   if (start_command(&cmd)) {
+   error(_("could not start pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   for (i = 0; i < m->num_objects; i++) {
+   struct object_id oid;
+   uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+   if (!include_pack[pack_int_id])
+   continue;
+
+   nth_midxed_object_oid(&oid, m, i);
+   xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+   xwrite(cmd.in, "\n", 1);
+   }
+   close(cmd.in);
+
+   if (finish_command(&cmd)) {
+   error(_("could not finish pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   result = write_midx_internal(object_dir, m, NULL);
+   m = NULL;
+
+cleanup:
+   if (m)
+   close_midx(m);
+   free(include_pack);
+   free(pack_ti);
+   return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
i

[PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 11 +++
 builtin/multi-pack-index.c | 10 +-
 midx.c |  5 +
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 11 +++
 5 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 6186c4c936..cc63531cc0 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@ expire::
have no objects referenced by the MIDX. Rewrite the MIDX file
afterward to remove all references to these pack-files.
 
+repack::
+   Collect a batch of pack-files whose size are all at most the
+   size given by --batch-size, but whose sizes sum to larger
+   than --batch-size. The batch is selected by greedily adding
+   small pack-files starting with the oldest pack-files that fit
+   the size. Create a new pack-file containing the objects the
+   multi-pack-index indexes into those pack-files, and rewrite
+   the multi-pack-index to contain that pack-file. A later run
+   of 'git multi-pack-index expire' will delete the pack-files
+   that were part of this batch.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
+   N_("git multi-pack-index [--object-dir=] 
(write|verify|expire|repack --batch-size=)"),
NULL
 };
 
 static struct opts_multi_pack_index {
const char *object_dir;
+   unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
static struct option builtin_multi_pack_index_options[] = {
OPT_FILENAME(0, "object-dir", &opts.object_dir,
  N_("object directory containing set of packfile and 
pack-index pairs")),
+   OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+ N_("during repack, collect pack-files of smaller size into a 
batch that is larger than this size")),
OPT_END(),
};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
return 1;
}
 
+   if (!strcmp(argv[0], "repack"))
+   return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+   if (opts.batch_size)
+   die(_("--batch-size option is only for 'repack' verb"));
+
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 043cd1fd97..127c43f7b0 100644
--- a/midx.c
+++ b/midx.c
@@ -1110,3 +1110,8 @@ int expire_midx_packs(const char *object_dir)
string_list_clear(&packs_to_drop, 0);
return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct reposit

[PATCH v2 1/7] repack: refactor pack deletion for future use

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c | 14 ++
 packfile.c   | 28 
 packfile.h   |  7 +++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list 
*fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-   const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-   int i;
struct strbuf buf = STRBUF_INIT;
-   size_t plen;
-
-   strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-   plen = buf.len;
-
-   for (i = 0; i < ARRAY_SIZE(exts); i++) {
-   strbuf_setlen(&buf, plen);
-   strbuf_addstr(&buf, exts[i]);
-   unlink(buf.buf);
-   }
+   strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+   unlink_pack_path(buf.buf, 1);
strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+   static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", 
".promisor"};
+   int i;
+   struct strbuf buf = STRBUF_INIT;
+   size_t plen;
+
+   strbuf_addstr(&buf, pack_name);
+   strip_suffix_mem(buf.buf, &buf.len, ".pack");
+   plen = buf.len;
+
+   if (!force_delete) {
+   strbuf_addstr(&buf, ".keep");
+   if (!access(buf.buf, F_OK)) {
+   strbuf_release(&buf);
+   return;
+   }
+   }
+
+   for (i = 0; i < ARRAY_SIZE(exts); i++) {
+   strbuf_setlen(&buf, plen);
+   strbuf_addstr(&buf, exts[i]);
+   unlink(buf.buf);
+   }
+
+   strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, 
int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
gitgitgadget



[PATCH v2 4/7] midx: refactor permutation logic

2018-12-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When writing a multi-pack-index, we keep track of an integer
permutation, tracking the list of pack-files that we know about
(both from the existing multi-pack-index and the new pack-files
being introduced) and converting them into a sorted order for
the new multi-pack-index.

In anticipation of dropping pack-files from the existing multi-
pack-index, refactor the logic around how we track this permutation.

First, insert the permutation into the pack_list structure. This
allows us to grow the permutation dynamically as we add packs.

Second, fill the permutation with values corresponding to their
position in the list of pack-files, sorted as follows:

  1. The pack-files in the existing multi-pack-index,
 sorted lexicographically.

  2. The pack-files not in the existing multi-pack-index,
 sorted as discovered from the filesystem.

There is a subtle thing in how we initialize this permutation,
specifically how we use 'i' for the initial value. This will
matter more when we implement the logic for dropping existing
packs, as we will create holes in the ordering.

Signed-off-by: Derrick Stolee 
---
 midx.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..3bd7183a53 100644
--- a/midx.c
+++ b/midx.c
@@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f,
 struct pack_list {
struct packed_git **list;
char **names;
+   uint32_t *perm;
uint32_t nr;
uint32_t alloc_list;
uint32_t alloc_names;
+   uint32_t alloc_perm;
size_t pack_name_concat_len;
struct multi_pack_index *m;
 };
@@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
 
ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+   ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm);
 
packs->list[packs->nr] = add_packed_git(full_path,
full_path_len,
@@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
return;
}
 
+   packs->perm[packs->nr] = packs->nr;
packs->names[packs->nr] = xstrdup(file_name);
packs->pack_name_concat_len += strlen(file_name) + 1;
packs->nr++;
@@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t 
nr_packs, uint32_t *p
ALLOC_ARRAY(pairs, nr_packs);
 
for (i = 0; i < nr_packs; i++) {
-   pairs[i].pack_int_id = i;
+   pairs[i].pack_int_id = perm[i];
pairs[i].pack_name = pack_names[i];
}
 
@@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir)
struct hashfile *f = NULL;
struct lock_file lk;
struct pack_list packs;
-   uint32_t *pack_perm = NULL;
uint64_t written = 0;
uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
@@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir)
 
packs.nr = 0;
packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-   packs.alloc_names = packs.alloc_list;
+   packs.alloc_perm = packs.alloc_names = packs.alloc_list;
packs.list = NULL;
packs.names = NULL;
+   packs.perm = NULL;
packs.pack_name_concat_len = 0;
ALLOC_ARRAY(packs.list, packs.alloc_list);
ALLOC_ARRAY(packs.names, packs.alloc_names);
+   ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
if (packs.m) {
for (i = 0; i < packs.m->num_packs; i++) {
ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
ALLOC_GROW(packs.names, packs.nr + 1, 
packs.alloc_names);
+   ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
 
+   packs.perm[packs.nr] = i;
packs.list[packs.nr] = NULL;
packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
packs.pack_name_concat_len += 
strlen(packs.names[packs.nr]) + 1;
@@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir)
packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
  (packs.pack_name_concat_len % 
MIDX_CHUNK_ALIGNMENT);
 
-   ALLOC_ARRAY(pack_perm, packs.nr);
-   sort_packs_by_name(packs.names, packs.nr, pack_perm);
+   sort_packs_by_name(packs.names, packs.nr, packs.perm);
 
-   entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, 
&nr_entries);
+   entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, 
&nr_entri

[PATCH 0/1] commit-graph: writing missing parents is a BUG

2018-12-19 Thread Derrick Stolee via GitGitGadget
A user complained that they had the following message in a git command:

fatal: invalid parent position 2147483647

In hex, this value is 0x7fff, corresponding to the GRAPH_MISSING_PARENT
constant. This constant was intended as a way to have the commit-graph store
commits with parents that are not also in the commit-graph. During
development, however, we chose to require the commit-graph to be closed
under reachability. Thus, this value should never be written, and we don't
fall back to parsing usual commits when we see the constant.

This actually happened, and the root cause is unknown. This either means the
reachable closure logic is broken somewhere, or something caused the binary
search to find the parent in our list of commits. This second problem is
more likely, as we have seen RAM issues cause corrupted memory before. I'm
still investigating the root cause, but for now we can hit a BUG() statement
instead of writing bad data.

Thanks, -Stolee

Derrick Stolee (1):
  commit-graph: writing missing parents is a BUG

 commit-graph.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)


base-commit: 0d0ac3826a3bbb9247e39e12623bbcfdd722f24c
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-102%2Fderrickstolee%2Fcommit-graph-bug-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-102/derrickstolee/commit-graph-bug-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/102
-- 
gitgitgadget


[PATCH 1/1] commit-graph: writing missing parents is a BUG

2018-12-19 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When writing a commit-graph, we write GRAPH_MISSING_PARENT if the
parent's object id does not appear in the list of commits to be
written into the commit-graph. This was done as the initial design
allowed commits to have missing parents, but the final version
requires the commit-graph to be closed under reachability. Thus,
this GRAPH_MISSING_PARENT value should never be written.

However, there are reasons why it could be written! These range
from a bug in the reachable-closure code to a memory error causing
the binary search into the list of object ids to fail. In either
case, we should fail fast and avoid writing the commit-graph file
with bad data.

Remove the GRAPH_MISSING_PARENT constant in favor of the constant
GRAPH_EDGE_LAST_MASK, which has the same value.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..c14ada6918 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -34,7 +34,6 @@
 #define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
 
 #define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000
-#define GRAPH_PARENT_MISSING 0x7fff
 #define GRAPH_EDGE_LAST_MASK 0x7fff
 #define GRAPH_PARENT_NONE 0x7000
 
@@ -496,7 +495,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
  commit_to_sha1);
 
if (edge_value < 0)
-   edge_value = GRAPH_PARENT_MISSING;
+   BUG("missing parent %s for commit %s",
+   oid_to_hex(&parent->item->object.oid),
+   oid_to_hex(&(*list)->object.oid));
}
 
hashwrite_be32(f, edge_value);
@@ -514,7 +515,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
  nr_commits,
  commit_to_sha1);
if (edge_value < 0)
-   edge_value = GRAPH_PARENT_MISSING;
+   BUG("missing parent %s for commit %s",
+   oid_to_hex(&parent->item->object.oid),
+   oid_to_hex(&(*list)->object.oid));
}
 
hashwrite_be32(f, edge_value);
@@ -567,7 +570,9 @@ static void write_graph_chunk_large_edges(struct hashfile 
*f,
  commit_to_sha1);
 
if (edge_value < 0)
-   edge_value = GRAPH_PARENT_MISSING;
+   BUG("missing parent %s for commit %s",
+   oid_to_hex(&parent->item->object.oid),
+   oid_to_hex(&(*list)->object.oid));
else if (!parent->next)
edge_value |= GRAPH_LAST_EDGE;
 
@@ -864,7 +869,7 @@ void write_commit_graph(const char *obj_dir,
count_distinct++;
}
 
-   if (count_distinct >= GRAPH_PARENT_MISSING)
+   if (count_distinct >= GRAPH_EDGE_LAST_MASK)
die(_("the commit graph format cannot write %d commits"), 
count_distinct);
 
commits.nr = 0;
@@ -891,7 +896,7 @@ void write_commit_graph(const char *obj_dir,
}
num_chunks = num_extra_edges ? 4 : 3;
 
-   if (commits.nr >= GRAPH_PARENT_MISSING)
+   if (commits.nr >= GRAPH_EDGE_LAST_MASK)
die(_("too many commits to write graph"));
 
compute_generation_numbers(&commits, report_progress);
-- 
gitgitgadget


Re: commit-graph idea: warn when disabled for incompatible repositories

2018-12-18 Thread Derrick Stolee

On 12/18/2018 1:21 PM, Ævar Arnfjörð Bjarmason wrote:

diff --git a/builtin/gc.c b/builtin/gc.c
index 871a56f1c5..702568b70d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -662,9 +662,14 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();

-   if (gc_write_commit_graph)
+   if (gc_write_commit_graph) {
+   int verbose = !quiet && !daemonized;
+   if (verbose && !commit_graph_compatible(the_repository))
+   warning(_("The `gc.writeCommitGraph' setting is on, "
+ "but commit_graph_compatible() = false"));
write_commit_graph_reachable(get_object_directory(), 0,
-!quiet && !daemonized);
+verbose);
+   }


I actually think this is the wrong place to put it. This will cause a 
warning for someone with 'gc.writeCommitGraph' enabled and running GC on 
a shallow clone.


I think the issue was someone running 'git commit-graph write' inside a 
shallow clone that succeeds without doing anything.


Also, I bet you would hit this block if you run the test suite with 
GIT_TEST_COMMIT_GRAPH=1.


Thanks,

-Stolee



if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..60915bf9aa 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -61,7 +61,7 @@ static struct commit_graph *alloc_commit_graph(void)

  extern int read_replace_refs;

-static int commit_graph_compatible(struct repository *r)
+int commit_graph_compatible(struct repository *r)
  {
if (!r->gitdir)
return 0;
diff --git a/commit-graph.h b/commit-graph.h
index 9db40b4d3a..7c92d41a28 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -12,6 +12,8 @@ struct commit;

  char *get_commit_graph_filename(const char *obj_dir);

+int commit_graph_compatible(struct repository *r);
+
  /*
   * Given a commit struct, try to fill the commit struct info, including:
   *  1. tree object


This part looks correct, and necessary for the warning in 
builtin/commit-graph.c


Thanks,

-Stolee



Re: commit-graph idea: warn when disabled for incompatible repositories

2018-12-18 Thread Derrick Stolee

On 12/18/2018 9:22 AM, Thomas Ferris Nicolaisen wrote:

Hey there,


Hi, Thomas!


Accidentally, my first use-case was a local copy of a big repository
(chromium source) that another developer had cloned shallow (I wasn't
even aware of their clone being shallow at that point).

After spending a little time trying to figure out why no commit-graph
file was being created, I noticed that it worked just fine testing in
a fresh git repo.

Then I discovered the .git/shallow file in the big repo. So I did
fetch --unshallow, and commit-graph started working. Taking a 20
second log --graph operation down to less than a second (wooo!).

I saw some recent release notes that mentions that commit-graph is
disabled in incompatible repositories (graft, replace). I assume this
also be the case for shallow clones.


The commit-graph feature is not designed to work well with these 
features, and the "easy" fix we have so far is to completely avoid the 
interaction. The tricky bits come in when the commit parents can 
"change" according to these other features. The commit-graph would need 
to change at the same time, and this is actually very difficult to get 
correct.



Here's the idea that may help others on the same path: Some warning
output when attempting to use commit-graph when it is disabled (either
by configuration or automatically).

I think others that come across commit-graph may have tried such
tricks (like shallow clones) to better work with their repositories,
and it could be frustrating that commit-graph has no apparent effect.


This is a good idea, and I would happily review a patch that added such 
a warning.


The warning would want to be in builtin/commit-graph.c, and use the 
commit_graph_compatible() method from commit-graph.c. (You'll need to 
expose the method in the .h file.)


Thanks!

-Stolee



Re: [PATCH v4 4/6] revision: implement sparse algorithm

2018-12-17 Thread Derrick Stolee

On 12/17/2018 9:26 AM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Dec 17 2018, Derrick Stolee wrote:



As for adding progress to this step, I'm open to it. It can be done as
a sequel series.

Okey. To clarify I wasn't complaining about the lack of progress output,
we didn't have it before, just clarifying (as I've found out now) that
when you're talking about "enumerating objects" in your commit message
it's *not* what we're doing when we show the "Enumerating objects"
progress bar, but an unrelated step prior to that.
Part of the problem is that in builtin/pack-objects.c, we have the 
following:


    if (progress)
    progress_state = start_progress(_("Enumerating 
objects"), 0);

    if (!use_internal_rev_list)
    read_object_list_from_stdin();
    else {
    get_object_list(rp.argc, rp.argv);
    argv_array_clear(&rp);
    }
    cleanup_preferred_base();
    if (include_tag && nr_result)
    for_each_ref(add_ref_tag, NULL);
    stop_progress(&progress_state);

and the logic for walking uninteresting objects is the 
mark_edges_uninteresting() inside get_object_list() (both entirely 
contained in this progress state). Perhaps the best thing to do is to 
untangle the progress for the two modes based on 
'use_internal_rev_list'. Could we then have the progress for 
get_object_list() be "Walking objects" instead?


Thanks,
-Stolee


Re: [PATCH v4 4/6] revision: implement sparse algorithm

2018-12-17 Thread Derrick Stolee

On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote:

On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote:


Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

We spend a long time printing those out before we ever get to
"Enumerating objects".

Which was where I was trying to test this, i.e. is this a lot of work we
perform before we print out the progress bar, and regardless of this
optimization should have other progress output there, so we can see this
time we're spending on this?


It is true that part of the problem is that a 'git push' will sit for a 
while without presenting any feedback until this part of the algorithm 
is complete. The current series intends to significantly reduce this time.


As for adding progress to this step, I'm open to it. It can be done as a 
sequel series.


What would we use to describe this section? "Enumerating remote objects"?

Thanks,
-Stolee


Re: Git blame performance on files with a lot of history

2018-12-14 Thread Derrick Stolee

On 12/14/2018 1:29 PM, Clement Moyroud wrote:

My group at work is migrating a CVS repo to Git. The biggest issue we
face so far is the performance of git blame, especially compared to
CVS on the same file. One file especially causes us trouble: it's a
30k lines file with 25 years of history in 3k+ commits. The complete
repo has 200k+ commits over that same period of time.


I think the 30k lines is the bigger issue than the 200k+ commits. I'm 
not terribly familiar with the blame code, though.



Currently, 'cvs annotate' takes 2.7 seconds, while 'git blame'
(without -M nor -C) takes 145s.

I tried using the commit-graph with the Bloom filter, per
https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5...@gmail.com/.


Thanks for the interest in this prototype feature. Sorry that it doesn't 
appear to help you in this case. It should definitely be a follow-up 
when that feature gets polished to production-quality.

Looking at the blame code, it does not seem to be able to use the
commit graph, so I tried the same rev-list command from the e-mail,
using my own file:
 > GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y
/path/to/git rev-list --count --full-history HEAD -- important/file.C
 3576

Please double-check that you have the 'core.commitGraph' config setting 
enabled, or you will not read the commit-graph at run-time:


    git config core.commitGraph true

I see that the commit introducing GIT_TRACE_BLOOM_FILTER [1] does 
nothing if the commit-graph is not loaded.


Thanks,
-Stolee

[1] 
https://github.com/derrickstolee/git/commit/adc469894b755512c9d02f099700ead2a7a78377


[PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 25 +
 revision.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..f9eb6400f1 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, &iter);
+   while ((oid = oidset_iter_next(&iter))) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (!tree)
+   continue;
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget



[PATCH v4 5/6] pack-objects: create pack.useSparse setting

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 Documentation/config/pack.txt  |  9 +
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
true. You should not generally need to turn this off unless
you are debugging pack bitmaps.
 
+pack.useSparse::
+   When true, git will default to using the '--sparse' option in
+   'git pack-objects' when the '--revs' option is present. This
+   algorithm only walks trees that appear in paths that introduce new
+   objects. This can have significant performance benefits when
+   computing a pack to send a small change. However, it is possible
+   that extra objects are added to the pack-file if the included
+   commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget



[PATCH v4 4/6] revision: implement sparse algorithm

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 139 ++---
 t/t5322-pack-objects-sparse.sh |  21 +++--
 2 files changed, 144 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..63bf6230dc 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reac

[PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 1 +
 t/README   | 4 
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 
read_replace_refs = 0;
 
+   sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
reset_pack_idx_option(&pack_idx_opts);
git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write 
code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to 
topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
topic1  \
topic1^{tree}   \
topic1:f3 | sort >expect_objects.txt &&
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget


[PATCH v4 3/6] pack-objects: add --sparse option

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |  11 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 129 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack, when combined with the "--revs" option. This algorithm
+   only walks trees that appear in paths that introduce new objects.
+   This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk(&revs))
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(&revs, show_edge, 0);
+   mark_edges_uninteresting(&revs, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", &sparse,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", &thin,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1

[PATCH v4 2/6] list-objects: consume sparse tree walk

2018-12-14 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits. This
commit walk is not changing during this series.

The mark_edges_uninteresting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 70 +++---
 list-objects.h |  4 ++-
 6 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk(&revs))
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(&revs, show_edge);
+   mark_edges_uninteresting(&revs, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk(&revs))
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(&revs, show_edge);
+   mark_edges_uninteresting(&revs, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk(&revs))
die("revision walk setup failed");
-   mark_edges_uninteresting(&revs, NULL);
+   mark_edges_uninteresting(&revs, NULL, 0);
objects_to_send = get_delta(&revs, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..fb728f7842 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struc

[PATCH v4 0/6] Add a new "sparse" tree walk algorithm

2018-12-14 Thread Derrick Stolee via GitGitGadget
One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Updates in V4:

 * Switched to using hashmap instead of string_list for the path/oidset
   dictionary. (This is due to some fear that the string_list performance
   would degrade for a very wide tree, but I am unable to measure a
   performance difference.)
 * Some cleanup of code snippets across commits.
 * Some grammar cleanup in the commit messages.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt  |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c   |   2 +-
 builtin/pack-objects.c |  10 +-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  70 +++---
 list-objects.h |   4 +-
 revision.c | 144 +
 revision.h |   2 +
 t/README   |   4 +
 t/t5322-pack-objects-sparse.sh | 139 
 12 files changed, 382 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v3:

 1:  60617681f7 ! 1:  817e30a287 revision: add mark_tree_uninteresting_sparse
 @@ -35,6 +35,9 @@
  + while ((oid = oidset_iter_next(&iter))) {
  + struct tree *tree = lookup_tree(r, oid);
  +
 ++ if (!tree)
 ++ continue;
 ++
  + if (tree->object.flags & UNINTERESTING) {
  + /*
  +  * Remove the flag so the next call
 2:  4527addacb ! 2:  39dc89beb9 list-objects: consume sparse tree walk
 @@ -11,9 +11,10 @@
  We walk these commits until we discover a frontier of commits such
  that every commit walk starting at interesting commits ends in a root
  commit or unintersting commit. We then need to discover which
 -non-commit objects are reachable from  uninteresting commits.
 +non-commit objects are reachable from  uninteresting commits. This
 +commit walk is not changing during this series.
  
 -The mark_edges_unintersting() method in list-objects.c iterates on
 +The mark_edges_uninteresting() method in list-objects.c iterates on
  the commit list and does the following:
   

Re: [PATCH v2 1/2] .gitattributes: ensure t/oid-info/* has eol=lf

2018-12-13 Thread Derrick Stolee

On 12/12/2018 9:38 PM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


From: Derrick Stolee 

The new test_oid machinery in the test library requires reading
some information from t/oid-info/hash-info and t/oid-info/oid.
The shell logic that reads from these files is sensitive to CRLF
line endings, causing a problem when the test suite is run on a
Windows machine that converts LF to CRLF.

Exclude the files in this folder from this conversion.

Signed-off-by: Derrick Stolee 
---

It seems that this step is identical to v1, to which SZEDER Gábor
had trouble with (cf. <20181212133945.gv30...@szeder.dev>).  I am
guessing that the review and v2 e-mails have crossed.

FWIW, I personally do not think "is sensitive to CRLF" is too bad,
as my attempt to clarify it does not make it much better, e.g.

The logic to read from these files in shell uses built-in
"read" command, which leaves CR at the end of these text
files when they are checked out with CRLF line endings, at
least when run with bash shipped with Git for Windows.  This
results in an unexpected value in the variable these lines
are read into, leading the tests to fail.

So, I'll keep what I queued when I received v1 for now.


Sorry for missing the edit to the message. You are correct that v2 just 
added one commit.


I like your rewording, if you feel like editing it.

Thanks,

-Stolee



Re: [PATCH on master v2] revision: use commit graph in get_reference()

2018-12-13 Thread Derrick Stolee

On 12/12/2018 8:27 PM, Jeff King wrote:

On Wed, Dec 12, 2018 at 11:58:12AM -0800, Jonathan Tan wrote:


Yeah, this was the part that took me a bit to figure out, as well. The
optimization here is really just avoiding a call to lookup_commit(),
which will do a single hash-table lookup. I wonder if that's actually
worth this more complex interface (as opposed to just always taking an
oid and then always returning a "struct commit", which could be old or
new).

Avoidance of lookup_commit() is more important than an optimization, I
think. Here, we call lookup_commit() only when we know that that object
is a commit (by its presence in a commit graph). If we just called it
blindly, we might mistakenly create a commit for that hash when it is
actually an object of another type. (We could inline lookup_commit() in
parse_commit_in_graph_one(), removing the object creation part, but that
adds complexity as well.)

I was thinking we would only do so in the happy path when we find a
commit. I.e., something like:

   obj = lookup_object(oid); /* does not auto-vivify */
   if (obj && obj->parsed)
return obj;

   if (we_have_it_in_commit_graph) {
commit = obj || lookup_commit(oid);
fill_in_details_from_commit_graph(commit);
return &commit->obj;
   } else {
return parse_object(oid);
   }

which is more along the lines of that parse_probably_commit() that
Stolee mentioned.


This approach is what I had in mind. Thanks for making it more concrete!

-Stolee



[PATCH v2 0/2] Add more eol=lf to .gitattributes

2018-12-12 Thread Derrick Stolee via GitGitGadget
I noticed that our CI builds (see [1] for an example) were returning success
much faster than they did before Git v2.20.0. Turns out that there was a
test script failure involving the new test hash logic.

error: bug in the test script: bad hash algorithm
make[1]: *** [Makefile:56: t-basic.sh] Error 1
make[1]: *** Waiting for unfinished jobs

This failure was due to an LF -> CRLF conversion in some data files in
t/oid-info/. Don't munge these files, and the tests can continue.

UPDATED IN V2: Unfortunately, I didn't check the full test suite after
getting beyond this point, and found another LF -> CRLF conversion problem
in t4256-am-format-flowed.sh due to example patches in t/t4256/1/. Add these
in a second commit. Thanks, dscho, for helping with the correct placement.

Thanks, -Stolee

[1] https://gvfs.visualstudio.com/ci/_build/results?buildId=4815

Derrick Stolee (1):
  .gitattributes: ensure t/oid-info/* has eol=lf

Johannes Schindelin (1):
  t4256: mark support files as LF-only

 .gitattributes   | 1 +
 t/.gitattributes | 1 +
 2 files changed, 2 insertions(+)


base-commit: 5d826e972970a784bd7a7bdf587512510097b8c7
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-98%2Fderrickstolee%2Ftest-oid-fix-windows-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-98/derrickstolee/test-oid-fix-windows-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/98

Range-diff vs v1:

 1:  4a22502a31 = 1:  4a22502a31 .gitattributes: ensure t/oid-info/* has eol=lf
 -:  -- > 2:  4275b8a581 t4256: mark support files as LF-only

-- 
gitgitgadget


[PATCH v2 1/2] .gitattributes: ensure t/oid-info/* has eol=lf

2018-12-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The new test_oid machinery in the test library requires reading
some information from t/oid-info/hash-info and t/oid-info/oid.
The shell logic that reads from these files is sensitive to CRLF
line endings, causing a problem when the test suite is run on a
Windows machine that converts LF to CRLF.

Exclude the files in this folder from this conversion.

Signed-off-by: Derrick Stolee 
---
 .gitattributes | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitattributes b/.gitattributes
index acf853e029..3738cea7eb 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -13,3 +13,4 @@
 /Documentation/gitk.txt conflict-marker-size=32
 /Documentation/user-manual.txt conflict-marker-size=32
 /t/t-*.sh conflict-marker-size=32
+/t/oid-info/* eol=lf
-- 
gitgitgadget



[PATCH 0/1] .gitattributes: ensure t/oid-info/* has eol=lf

2018-12-11 Thread Derrick Stolee via GitGitGadget
I noticed that our CI builds (see [1] for an example) were returning success
much faster than they did before Git v2.20.0. Turns out that there was a
test script failure involving the new test hash logic.

error: bug in the test script: bad hash algorithm
make[1]: *** [Makefile:56: t-basic.sh] Error 1
make[1]: *** Waiting for unfinished jobs

The reason this passed was because we were running this command in our
build:

make GIT_TEST_OPTS=--quiet -j 10 -C t ||
make GIT_TEST_OPTS=\"-i -v -x\" -k -C t failed 

The first make failed, but the test script did not record any failed tests
and hence the second command succeeded.

The test bug relates to this function added by 2c02b110d "t: add test
functions to translate hash-related values":

+# Load key-value pairs from stdin suitable for use with test_oid.  Blank lines
+# and lines starting with "#" are ignored.  Keys must be shell identifier
+# characters.
+#
+# Examples:
+# rawsz sha1:20
+# rawsz sha256:32
+test_oid_cache () {
+   local tag rest k v &&
+
+   { test -n "$test_hash_algo" || test_detect_hash; } &&
+   while read tag rest
+   do
+   case $tag in
+   \#*)
+   continue;;
+   ?*)
+   # non-empty
+   ;;
+   *)
+   # blank line
+   continue;;
+   esac &&
+
+   k="${rest%:*}" &&
+   v="${rest#*:}" &&
+
+   if ! expr "$k" : '[a-z0-9][a-z0-9]*$' >/dev/null
+   then
+   error 'bug in the test script: bad hash algorithm'
+   fi &&
+   eval "test_oid_${k}_$tag=\"\$v\""
+   done
+}

The verbose output included these values. Note how '\r' appears in the
variable values.

++ test_oid_init
++ test -n ''
++ test_detect_hash
++ test_hash_algo=sha1
++ test_oid_cache
++ local tag rest k v
++ test -n sha1
++ read tag rest
++ case $tag in
++ k=sha1
++ v=$'20\r'
++ expr sha1 : '[a-z0-9][a-z0-9]*$'
++ eval 'test_oid_sha1_rawsz="$v"'
+++ test_oid_sha1_rawsz=$'20\r'
++ read tag rest
++ case $tag in
++ k=sha256
++ v=$'32\r'
++ expr sha256 : '[a-z0-9][a-z0-9]*$'
++ eval 'test_oid_sha256_rawsz="$v"'
+++ test_oid_sha256_rawsz=$'32\r'
++ read tag rest
++ case $tag in
++ k=
++ v=
++ expr '' : '[a-z0-9][a-z0-9]*$'

[1] https://gvfs.visualstudio.com/ci/_build/results?buildId=4815

Derrick Stolee (1):
  .gitattributes: ensure t/oid-info/* has eol=lf

 .gitattributes | 1 +
 1 file changed, 1 insertion(+)


base-commit: 5d826e972970a784bd7a7bdf587512510097b8c7
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-98%2Fderrickstolee%2Ftest-oid-fix-windows-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-98/derrickstolee/test-oid-fix-windows-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/98
-- 
gitgitgadget


[PATCH 1/1] .gitattributes: ensure t/oid-info/* has eol=lf

2018-12-11 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The new test_oid machinery in the test library requires reading
some information from t/oid-info/hash-info and t/oid-info/oid.
The shell logic that reads from these files is sensitive to CRLF
line endings, causing a problem when the test suite is run on a
Windows machine that converts LF to CRLF.

Exclude the files in this folder from this conversion.

Signed-off-by: Derrick Stolee 
---
 .gitattributes | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitattributes b/.gitattributes
index acf853e029..3738cea7eb 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -13,3 +13,4 @@
 /Documentation/gitk.txt conflict-marker-size=32
 /Documentation/user-manual.txt conflict-marker-size=32
 /t/t-*.sh conflict-marker-size=32
+/t/oid-info/* eol=lf
-- 
gitgitgadget


Re: [PATCH 5/5] midx: implement midx_repack()

2018-12-11 Thread Derrick Stolee

On 12/10/2018 9:32 PM, Stefan Beller wrote:

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
 wrote:

From: Derrick Stolee 

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

Makes sense.

With this operation we only coalesce some packfiles into a new
pack file. So to perform the "complete" repack this command
has to be run repeatedly until there is at most one packfile
left that is smaller than batch size.


Well, the batch size essentially means "If a pack-file is larger than 
, then leave it be. I'm happy with packs that large." This assumes 
that the reason the pack is that large is because it was already 
combined with other packs or contains a lot of objects.




Imagine the following scenario:

   There are 5 packfiles A, B, C, D, E,
   created last Monday thru Friday (A is oldest, E youngest).
   The sizes are [A=4, B=6, C=5, D=5, E=4]

   You'd issue a repack with batch size=10, such that
   A and B would be repacked into F, which is
   created today, size is less or equal than 10.

   You issue another repack tomorrow, which then would
   coalesce C and D to G, which is
   dated tomorrow, size is less or equal to 10 as well.

   You issue a third repack, which then takes E
   (as it is the oldest) and would probably find F as the
   next oldest (assuming it is less than 10), to repack
   into H.

   H is then compromised of A, B and E, and G is C+D.

In a way these repacks, always picking up the oldest,
sound like you "roll forward" objects into new packs.
As the new packs are newest (we have no packs from
the future), we'd cycle through different packs to look at
for packing on each repacking.

It is however more likely that content is more similar
on a temporal basis. (e.g. I am boldly claiming that
[ABC, DE] would take less space than [ABE, CD]
as produced above).

(The obvious solution to this hypothetical would be
to backdate the resulting pack to the youngest pack
that is input to the new pack, but I dislike fudging with
the time a file is created/touched, so let's not go there)


This raises a good point about what happens when we "roll over" into the 
"repacked" packs.


I'm not claiming that this is an optimal way to save space, but is a way 
to incrementally collect small packs into slightly larger packs, all 
while not interrupting concurrent Git commands. Reducing pack count 
improves data locality, which is my goal here. In our environment, we do 
see reduced space as a benefit, even if it is not optimal.




Would the object count make sense as input instead of
the pack date?



While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this

referenced


came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Ah, that also explains why the above idea is toast.

Would it make sense to extend or annotate the midx file
to give hints at which packs are easy to combine?

I guess such an "annotation worker" could run in a separate
thread / pool with the lowest priority as this seems like a
decent fallback for the lack of any better information how
to pick the packfiles.


One idea I had earlier (and is in 
Documentation/technical/multi-pack-index.txt) is to have the midx track 
metadata about pack-files. We could avoid this "rollover" problem by 
tracking which packs were repacked using this mechanism. This could 
create a "pack generation" value, and we could collect a batch of packs 
that have the same generation. This does seem a bit overcomplicated for 
the potential benefit, and could waste better use of that metadata 
concept. For instance, we could use the metadata to track the 
information given by ".keep" and ".promisor" files.


Thanks,

-Stolee



Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb

2018-12-11 Thread Derrick Stolee

On 12/10/2018 8:54 PM, Stefan Beller wrote:

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
 wrote:

From: Derrick Stolee 

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' verb to 'git multi-pack-index' that takes a
'--batch-size' option. The verb will inspect the multi-pack-index
for referenced pack-files whose size is smaller than the batch
size, until collecting a list of pack-files whose sizes sum to
larger than the batch size. Then, a new pack-file will be created
containing the objects from those pack-files that are referenced
by the multi-pack-index. The resulting pack is likely to actually
be smaller than the batch size due to compression and the fact
that there may be objects in the pack-files that have duplicate
copies in other pack-files.

This highlights an optimization problem: How do we pick the
batches optimally?
Ideally we'd pick packs that have an overlap of many
same objects to dedup them completely, next best would
be to have objects that are very similar, such that they delta
very well.
(Assuming that the sum of the resulting pack sizes is a metric
we'd want to optimize for eventually)

For now it seems we just take a random cut of "small" packs.


The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

This is another clever trick that makes the test correct
despite no implementation yet. :-)


+repack::
+   When given as the verb, collect a batch of pack-files whose
+   size are all at most the size given by --batch-size,

okay.


  but
+   whose sizes sum to larger than --batch-size.

or less if there are not enough packs.


If there are not enough packs to reach the batch size, then we do nothing.


Now it would be interesting if we can specify an upper bound
(e.g. my laptop has 8GB of ram, can I use this incremental
repacking optimally by telling git to make batches of at most
7.5G), the answer seems to follow...


Well, this gets us to the "unsigned long" problem and how it pervades 
the OPT_MAGNITUDE() code and things that use that kind of parameter. 
This means that Windows users can specify a maximum of (1 << 32) - 1. 
This size is intended to be large enough to make a reasonable change in 
the pack organization without rewriting the entire repo (e.g. 1GB). If 
there is another word than "repack" that means "collect objects from a 
subset of pack-files and place them into a new pack-file, doing a small 
amount of redeltification in the process" then I am open to suggestions.


I tried (briefly) to fix this dependence, but it got too large for me to 
handle at the same time as this change. I'll consider revisiting it 
again later.



   The batch is
+   selected by greedily adding small pack-files starting with
+   the oldest pack-files that fit the size. Create a new pack-
+   file containing the objects the multi-pack-index indexes
+   into thos pack-files, and rewrite the multi-pack-index to

those


+   contain that pack-file. A later run of 'git multi-pack-index
+   expire' will delete the pack-files that were part of this
+   batch.

... but the optimization seems to be rather about getting rid
of the oldest packs first instead of getting as close to the batch
size. (e.g. another way to look at this is to "find the permutation
of all packs that (each are smaller than batch size), but in sum
are the smallest threshold above the batch size).


You are describing the subset-sum problem, with an additional 
optimization component. While there are dynamic programming approaches 
that are usually effective (if the sum is small), this problem is 
NP-complete, and hence could lead to complications.



I guess that the strategy of picking the oldest is just easiest
to implement and should be sufficient for now, but memory
bounds might be interesting to keep in mind, just as the
optimal packing from above.


It is easy to implement, and fast. Further, we do have a heuristic that 
the pack modified time correlates with the time the objects were 
introduced to the repository and hence may compress well when placed 
together.


Thanks,

-Stolee



Re: [PATCH 1/5] multi-pack-index: prepare for 'expire' verb

2018-12-11 Thread Derrick Stolee

On 12/10/2018 8:59 PM, SZEDER Gábor wrote:

On Mon, Dec 10, 2018 at 05:35:28PM -0800, Stefan Beller wrote:

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
 wrote:

+expire::
+   When given as the verb,

Can it be given in another way? Or rather "if the verb is expire",
then ...
(I just checked the current man page, and both write and verify use
this pattern as well. I find it strange as this first part of the sentence
conveys little information, but is repeated 3 times now (once for
each verb)).

Maybe we can restructure the man page to have it more like

 The following verbs are available:
 +write::
 +create a new MIDX file, writing to /packs/multi-pack-index.
 +
 +verify::
 +verify the contents ...

I think a s/verb/subcommand/ would help a lot, too, because that's
what we call it everywhere else.


Thanks, both. V2 will include a new patch that reformats the doc to use 
these suggestions, then extend it for the new subcommand.


-Stolee



[PATCH 3/5] multi-pack-index: implement 'expire' verb

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 87 +++--
 t/t5319-multi-pack-index.sh | 15 +++
 2 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 3bd7183a53..50e4cd7270 100644
--- a/midx.c
+++ b/midx.c
@@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, 
uint32_t nr_large_off
return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index 
*m,
+  struct string_list *packs_to_drop)
 {
unsigned char cur_chunk, num_chunks = 0;
char *midx_name;
@@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir)
uint32_t nr_entries, num_large_offsets = 0;
struct pack_midx_entry *entries = NULL;
int large_offsets_needed = 0;
+   int result = 0;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir)
  midx_name);
}
 
-   packs.m = load_multi_pack_index(object_dir, 1);
+   if (m)
+   packs.m = m;
+   else
+   packs.m = load_multi_pack_index(object_dir, 1);
 
packs.nr = 0;
packs.alloc_list = packs.m ? packs.m->num_packs : 16;
@@ -787,7 +792,24 @@ int write_midx_file(const char *object_dir)
ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
if (packs.m) {
+   int drop_index = 0, missing_drops = 0;
for (i = 0; i < packs.m->num_packs; i++) {
+   if (packs_to_drop && drop_index < packs_to_drop->nr) {
+   int cmp = strcmp(packs.m->pack_names[i],
+
packs_to_drop->items[drop_index].string);
+
+   if (!cmp) {
+   drop_index++;
+   continue;
+   } else if (cmp > 0) {
+   error(_("did not see pack-file %s to 
drop"),
+ 
packs_to_drop->items[drop_index].string);
+   drop_index++;
+   i--;
+   missing_drops++;
+   }
+   }
+
ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
ALLOC_GROW(packs.names, packs.nr + 1, 
packs.alloc_names);
ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
@@ -798,6 +820,12 @@ int write_midx_file(const char *object_dir)
packs.pack_name_concat_len += 
strlen(packs.names[packs.nr]) + 1;
packs.nr++;
}
+
+   if (packs_to_drop && (drop_index < packs_to_drop->nr || 
missing_drops)) {
+   error(_("did not see all pack-files to drop"));
+   result = 1;
+   goto cleanup;
+   }
}
 
for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
@@ -932,7 +960,12 @@ cleanup:
free(packs.perm);
free(entries);
free(midx_name);
-   return 0;
+   return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+   return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1034,5 +1067,51 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-   return 0;
+   uint32_t i, *count, result = 0;
+   size_t dirlen;
+   struct strbuf buf = STRBUF_INIT;
+   struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+   struct multi_pack_index *m = load

[PATCH 5/5] midx: implement midx_repack()

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 109 +++-
 t/t5319-multi-pack-index.sh |  25 +
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 4caf148464..3718e78132 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1116,7 +1117,113 @@ int expire_midx_packs(const char *object_dir)
return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+   timestamp_t mtime;
+   uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+   const struct time_and_id *a, *b;
+
+   a = (const struct time_and_id *)a_;
+   b = (const struct time_and_id *)b_;
+
+   if (a->mtime < b->mtime)
+   return -1;
+   if (a->mtime > b->mtime)
+   return 1;
return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   int result = 0;
+   uint32_t i, packs_to_repack;
+   size_t total_size;
+   struct time_and_id *pack_ti;
+   unsigned char *include_pack;
+   struct child_process cmd = CHILD_PROCESS_INIT;
+   struct strbuf base_name = STRBUF_INIT;
+   struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+   if (!m)
+   return 0;
+
+   include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+   pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+   for (i = 0; i < m->num_packs; i++) {
+   pack_ti[i].pack_int_id = i;
+
+   if (prepare_midx_pack(m, i))
+   continue;
+
+   pack_ti[i].mtime = m->packs[i]->mtime;
+   }
+   QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+   total_size = 0;
+   packs_to_repack = 0;
+   for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+   int pack_int_id = pack_ti[i].pack_int_id;
+   struct packed_git *p = m->packs[pack_int_id];
+
+   if (!p)
+   continue;
+   if (p->pack_size >= batch_size)
+   continue;
+
+   packs_to_repack++;
+   total_size += p->pack_size;
+   include_pack[pack_int_id] = 1;
+   }
+
+   if (total_size < batch_size || packs_to_repack < 2)
+   goto cleanup;
+
+   argv_array_push(&cmd.args, "pack-objects");
+
+   strbuf_addstr(&base_name, object_dir);
+   strbuf_addstr(&base_name, "/pack/pack");
+   argv_array_push(&cmd.args, base_name.buf);
+   strbuf_release(&base_name);
+
+   cmd.git_cmd = 1;
+   cmd.in = cmd.out = -1;
+
+   if (start_command(&cmd)) {
+   error(_("could not start pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   for (i = 0; i < m->num_objects; i++) {
+   struct object_id oid;
+   uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+   if (!include_pack[pack_int_id])
+   continue;
+
+   nth_midxed_object_oid(&oid, m, i);
+   xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+   xwrite(cmd.in, "\n", 1);
+   }
+   close(cmd.in);
+
+   if (finish_command(&cmd)) {
+   error(_("could not finish pack-objects"));
+   result = 1;
+   goto cleanup;
+   }
+
+   result = write_midx_internal(object_dir, m, NULL);
+   m = NULL;
+
+cleanup:
+   if (m)
+   close_midx(m);
+   free(include_pack);
+   free(pack_ti);
+   return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh

[PATCH 2/5] midx: refactor permutation logic

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When writing a multi-pack-index, we keep track of an integer
permutation, tracking the list of pack-files that we know about
(both from the existing multi-pack-index and the new pack-files
being introduced) and converting them into a sorted order for
the new multi-pack-index.

In anticipation of dropping pack-files from the existing multi-
pack-index, refactor the logic around how we track this permutation.

First, insert the permutation into the pack_list structure. This
allows us to grow the permutation dynamically as we add packs.

Second, fill the permutation with values corresponding to their
position in the list of pack-files, sorted as follows:

  1. The pack-files in the existing multi-pack-index,
 sorted lexicographically.

  2. The pack-files not in the existing multi-pack-index,
 sorted as discovered from the filesystem.

There is a subtle thing in how we initialize this permutation,
specifically how we use 'i' for the initial value. This will
matter more when we implement the logic for dropping existing
packs, as we will create holes in the ordering.

Signed-off-by: Derrick Stolee 
---
 midx.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..3bd7183a53 100644
--- a/midx.c
+++ b/midx.c
@@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f,
 struct pack_list {
struct packed_git **list;
char **names;
+   uint32_t *perm;
uint32_t nr;
uint32_t alloc_list;
uint32_t alloc_names;
+   uint32_t alloc_perm;
size_t pack_name_concat_len;
struct multi_pack_index *m;
 };
@@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
 
ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+   ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm);
 
packs->list[packs->nr] = add_packed_git(full_path,
full_path_len,
@@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t 
full_path_len,
return;
}
 
+   packs->perm[packs->nr] = packs->nr;
packs->names[packs->nr] = xstrdup(file_name);
packs->pack_name_concat_len += strlen(file_name) + 1;
packs->nr++;
@@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t 
nr_packs, uint32_t *p
ALLOC_ARRAY(pairs, nr_packs);
 
for (i = 0; i < nr_packs; i++) {
-   pairs[i].pack_int_id = i;
+   pairs[i].pack_int_id = perm[i];
pairs[i].pack_name = pack_names[i];
}
 
@@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir)
struct hashfile *f = NULL;
struct lock_file lk;
struct pack_list packs;
-   uint32_t *pack_perm = NULL;
uint64_t written = 0;
uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
@@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir)
 
packs.nr = 0;
packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-   packs.alloc_names = packs.alloc_list;
+   packs.alloc_perm = packs.alloc_names = packs.alloc_list;
packs.list = NULL;
packs.names = NULL;
+   packs.perm = NULL;
packs.pack_name_concat_len = 0;
ALLOC_ARRAY(packs.list, packs.alloc_list);
ALLOC_ARRAY(packs.names, packs.alloc_names);
+   ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
if (packs.m) {
for (i = 0; i < packs.m->num_packs; i++) {
ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
ALLOC_GROW(packs.names, packs.nr + 1, 
packs.alloc_names);
+   ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
 
+   packs.perm[packs.nr] = i;
packs.list[packs.nr] = NULL;
packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
packs.pack_name_concat_len += 
strlen(packs.names[packs.nr]) + 1;
@@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir)
packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
  (packs.pack_name_concat_len % 
MIDX_CHUNK_ALIGNMENT);
 
-   ALLOC_ARRAY(pack_perm, packs.nr);
-   sort_packs_by_name(packs.names, packs.nr, pack_perm);
+   sort_packs_by_name(packs.names, packs.nr, packs.perm);
 
-   entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, 
&nr_entries);
+   entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, 
&nr_entri

[PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index

2018-12-10 Thread Derrick Stolee via GitGitGadget
The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new verbs for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
of repacking all reachable objects. We can also limit ourselves to the
objects that do not appear in newer pack-files.


 2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not
suffer performance degradation. By expiring only pack-files that have no
referenced objects, we know that Git commands that are doing normal
object lookups* will not be interrupted.


 3. Note: if someone concurrently runs a Git command that uses
get_all_packs(), then that command could try to read the pack-files and
pack-indexes that we are deleting during an expire command. Such
commands are usually related to object maintenance (i.e. fsck, gc,
pack-objects) or are related to less-often-used features (i.e.
fast-import, http-backend, server-info).



We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Thanks, -Stolee

Derrick Stolee (5):
  multi-pack-index: prepare for 'expire' verb
  midx: refactor permutation logic
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' verb
  midx: implement midx_repack()

 Documentation/git-multi-pack-index.txt |  20 +++
 builtin/multi-pack-index.c |  12 +-
 midx.c | 222 +++--
 midx.h |   2 +
 t/t5319-multi-pack-index.sh|  98 +++
 5 files changed, 343 insertions(+), 11 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-92/derrickstolee/midx-expire/upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/92
-- 
gitgitgadget


[PATCH 4/5] multi-pack-index: prepare 'repack' verb

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' verb to 'git multi-pack-index' that takes a
'--batch-size' option. The verb will inspect the multi-pack-index
for referenced pack-files whose size is smaller than the batch
size, until collecting a list of pack-files whose sizes sum to
larger than the batch size. Then, a new pack-file will be created
containing the objects from those pack-files that are referenced
by the multi-pack-index. The resulting pack is likely to actually
be smaller than the batch size due to compression and the fact
that there may be objects in the pack-files that have duplicate
copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 12 
 builtin/multi-pack-index.c | 10 +-
 midx.c |  5 +
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 11 +++
 5 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 822d83c845..e43e7da71e 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -39,6 +39,18 @@ expire::
Rewrite the MIDX file afterward to remove all references to
these pack-files.
 
+repack::
+   When given as the verb, collect a batch of pack-files whose
+   size are all at most the size given by --batch-size, but
+   whose sizes sum to larger than --batch-size. The batch is
+   selected by greedily adding small pack-files starting with
+   the oldest pack-files that fit the size. Create a new pack-
+   file containing the objects the multi-pack-index indexes
+   into thos pack-files, and rewrite the multi-pack-index to
+   contain that pack-file. A later run of 'git multi-pack-index
+   expire' will delete the pack-files that were part of this
+   batch.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
+   N_("git multi-pack-index [--object-dir=] 
(write|verify|expire|repack --batch-size=)"),
NULL
 };
 
 static struct opts_multi_pack_index {
const char *object_dir;
+   unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
static struct option builtin_multi_pack_index_options[] = {
OPT_FILENAME(0, "object-dir", &opts.object_dir,
  N_("object directory containing set of packfile and 
pack-index pairs")),
+   OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+ N_("during repack, collect pack-files of smaller size into a 
batch that is larger than this size")),
OPT_END(),
};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
return 1;
}
 
+   if (!strcmp(argv[0], "repack"))
+   return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+   if (opts.batch_size)
+   die(_("--batch-size option is only for 'repack' verb"));
+
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 50e4cd7270..4caf148464 100644
--- a/midx.c
+++ b/midx.c
@@ -1115,3 +1115,8 @@ int expire_midx_packs(const char *object_dir)
string_list_clear(&packs_to_drop, 0);
return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int veri

[PATCH 1/5] multi-pack-index: prepare for 'expire' verb

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' verb to the multi-pack-index builtin.
This verb will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' verb is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt |  8 +
 builtin/multi-pack-index.c |  4 ++-
 midx.c |  5 +++
 midx.h |  1 +
 t/t5319-multi-pack-index.sh| 47 ++
 5 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index f7778a2c85..822d83c845 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,14 @@ verify::
When given as the verb, verify the contents of the MIDX file
at `/packs/multi-pack-index`.
 
+expire::
+   When given as the verb, delete the pack-files that are tracked
+   by the MIDX file at `/packs/multi-pack-index` but have
+   no objects referenced by the MIDX. All objects in these pack-
+   files have another copy in a more-recently modified pack-file.
+   Rewrite the MIDX file afterward to remove all references to
+   these pack-files.
+
 
 EXAMPLES
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] (write|verify)"),
+   N_("git multi-pack-index [--object-dir=] (write|verify|expire)"),
NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
return verify_midx_file(opts.object_dir);
+   if (!strcmp(argv[0], "expire"))
+   return expire_midx_packs(opts.object_dir);
 
die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+   mkdir dup &&
+   (
+   cd dup &&
+   git init &&
+   for i in $(test_seq 1 20)
+   do
+   test_commit $i
+   done &&
+   git branch A HEAD &&
+   git branch B HEAD~8 &&
+   git branch C HEAD~13 &&
+   git branch D HEAD~16 &&
+   git branch E HEAD~18 &&
+   git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+   refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+   refs/heads/D
+   ^refs/heads/E
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+   refs/heads/C
+   ^refs/heads/D
+   EOF
+   git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+   

[PATCH v3 5/6] pack-objects: create pack.useSparse setting

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 Documentation/config/pack.txt  |  9 +
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
true. You should not generally need to turn this off unless
you are debugging pack bitmaps.
 
+pack.useSparse::
+   When true, git will default to using the '--sparse' option in
+   'git pack-objects' when the '--revs' option is present. This
+   algorithm only walks trees that appear in paths that introduce new
+   objects. This can have significant performance benefits when
+   computing a pack to send a small change. However, it is possible
+   that extra objects are added to the pack-file if the included
+   commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget



[PATCH v3 2/6] list-objects: consume sparse tree walk

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 55 +++---
 list-objects.h |  4 ++-
 revision.c |  3 +++
 7 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk(&revs))
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(&revs, show_edge);
+   mark_edges_uninteresting(&revs, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk(&revs))
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(&revs, show_edge);
+   mark_edges_uninteresting(&revs, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk(&revs))
die("revision walk setup failed");
-   mark_edges_uninteresting(&revs, NULL);
+   mark_edges_uninteresting(&revs, NULL, 0);
objects_to_send = get_delta(&revs, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..4fbdeca0a4 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struct oidset *set)
+

[PATCH v3 4/6] revision: implement sparse algorithm

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 116 ++---
 t/t5322-pack-objects-sparse.sh |  21 --
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..971f1bb095 100644
--- a/revision.c
+++ b/revision.c
@@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+struct paths_and_oids {
+   struct string_list list;
+};
+
+static void paths_and_oids_init(st

[PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 1 +
 t/README   | 4 
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 
read_replace_refs = 0;
 
+   sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
reset_pack_idx_option(&pack_idx_opts);
git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write 
code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to 
topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
topic1  \
topic1^{tree}   \
topic1:f3 | sort >expect_objects.txt &&
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget


[PATCH v3 3/6] pack-objects: add --sparse option

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |  11 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 129 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack, when combined with the "--revs" option. This algorithm
+   only walks trees that appear in paths that introduce new objects.
+   This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk(&revs))
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(&revs, show_edge, 0);
+   mark_edges_uninteresting(&revs, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", &sparse,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", &thin,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1

[PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse

2018-12-10 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 22 ++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, &iter);
+   while ((oid = oidset_iter_next(&iter))) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget



[PATCH v3 0/6] Add a new "sparse" tree walk algorithm

2018-12-10 Thread Derrick Stolee via GitGitGadget
One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt  |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c   |   2 +-
 builtin/pack-objects.c |  10 ++-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  55 +++-
 list-objects.h |   4 +-
 revision.c | 121 +
 revision.h |   2 +
 t/README   |   4 +
 t/t5322-pack-objects-sparse.sh | 139 +
 12 files changed, 351 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v2:

 1:  60617681f7 = 1:  60617681f7 revision: add mark_tree_uninteresting_sparse
 2:  4527addacb = 2:  4527addacb list-objects: consume sparse tree walk
 3:  9644f6ff04 ! 3:  4ef318bdb2 pack-objects: add --sparse option
 @@ -32,7 +32,9 @@
   
  +--sparse::
  + Use the "sparse" algorithm to determine which objects to include in
 -+ the pack. This can have significant performance benefits when computing
 ++ the pack, when combined with the "--revs" option. This algorithm
 ++ only walks trees that appear in paths that introduce new objects.
 ++ This can have significant performance benefits when computing
  + a pack to send a small change. However, it is possible that extra
  + objects are added to the pack-file if the included commits contain
  + certain types of direct renames.
 4:  c99957d06f = 4:  571b2e2784 revision: implement sparse algorithm
 5:  d6912188be ! 5:  33d2c04dd6 pack-objects: create pack.useSparse setting
 @@ -19,6 +19,26 @@
  
  Signed-off-by: Derrick Stolee 
  
 +diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
 +--- a/Documentation/config/pack.txt
  b/Documentation/config/pack.txt
 +@@
 +  true. You should not generally need to turn this off unless
 +  you are debugging pack bitmaps.
 + 
 ++pack.useSparse::
 ++ When true, git will default to using the '--sparse' option in
 ++ 'git pack-objects' when the '--revs' option

Git Test Coverage Report (Sunday, Dec 9)

2018-12-09 Thread Derrick Stolee

Here is today's coverage report.

Thanks,

-Stolee

[1] https://dev.azure.com/git/git/_build/results?buildId=287

---

pu: dfcf84ebfa17eb0bb3b57806fa530e87d8c8f1b8
jch: dd824ca506dbdf198282714b0bd21665c5825b4d
next: bc1bbc6f855c3b5ef7fcbd0f688f647c4e5b208b
master: 5d826e972970a784bd7a7bdf587512510097b8c7
master@{1}: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb

Uncovered code in 'pu' not in 'jch'
--

builtin/blame.c
74e8221b52 builtin/blame.c    928) blame_date_width = sizeof("Thu Oct 19 
16:00");

74e8221b52 builtin/blame.c    929) break;

builtin/remote.c
b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used 
with other options"));
b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used 
when only one url is defined"));


commit-graph.c
721351787e  128) return NULL;
721351787e  131) return NULL;
721351787e  187) free(graph);
721351787e  188) return NULL;
721351787e  223) free(graph);
721351787e  224) return NULL;

date.c
74e8221b52  113) die("Timestamp too large for this system: %"PRItime, time);
74e8221b52  216) if (tm->tm_mon == human_tm->tm_mon) {
74e8221b52  217) if (tm->tm_mday > human_tm->tm_mday) {
74e8221b52  219) } else if (tm->tm_mday == human_tm->tm_mday) {
74e8221b52  220) hide.date = hide.wday = 1;
74e8221b52  221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) {
74e8221b52  223) hide.date = 1;
74e8221b52  231) gettimeofday(&now, NULL);
74e8221b52  232) show_date_relative(time, tz, &now, buf);
74e8221b52  233) return;
74e8221b52  246) hide.seconds = 1;
74e8221b52  247) hide.tz |= !hide.date;
74e8221b52  248) hide.wday = hide.time = !hide.year;
74e8221b52  262) strbuf_rtrim(buf);
74e8221b52  287) gettimeofday(&now, NULL);
74e8221b52  290) human_tz = local_time_tzoffset(now.tv_sec, &human_tm);
74e8221b52  886) static int auto_date_style(void)
74e8221b52  888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : 
DATE_NORMAL;

74e8221b52  909) return DATE_HUMAN;
74e8221b52  911) return auto_date_style();

pretty.c
4681fe38e1 1069) return 0;
b755bf6f83 1107) match_placeholder_arg(p, "=on", end) ||
b755bf6f83 1108) match_placeholder_arg(p, "=true", end)) {

protocol.c
24c10f7473  37) die(_("Unrecognized protocol version"));
24c10f7473  39) die(_("Unrecognized protocol_version"));

remote-curl.c
24c10f7473  403) return 0;

strbuf.c
18f8e81091  397) return 0;

submodule.c
26f80ccfc1 1396) strbuf_release(&gitdir);
be76c21282 1519) struct fetch_task *task = task_cb;
be76c21282 1523) fetch_task_release(task);

wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Anders Waldenborg  18f8e8109: strbuf: separate callback for 
strbuf_expand:ing literals

Anders Waldenborg  4681fe38e: pretty: allow showing specific trailers
Anders Waldenborg  b755bf6f8: pretty: allow %(trailers) options with 
explicit value
Denton Liu  b7f4e371e: remote: add --save-to-push option to git 
remote set-url
Josh Steadmon  24c10f747: protocol: advertise multiple supported 
versions
Josh Steadmon  721351787: commit-graph, fuzz: add fuzzer for 
commit-graph

Linus Torvalds  74e8221b5: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Stefan Beller  26f80ccfc: submodule: migrate get_next_submodule to 
use repository structs

Stefan Beller  be76c2128: fetch: ensure submodule objects fetched



Uncovered code in 'jch' not in 'next'


apply.c
0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) ||
0f086e6dca 3356) lstat(ce->name, st))

builtin/branch.c
0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD"));
0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of 
refs/heads/"), refname);


hex.c
47edb64997  93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1)
47edb64997  95) return hash_to_hex_algop_r(buffer, sha1, 
&hash_algos[GIT_HASH_SHA1]);

47edb64997 116) char *hash_to_hex(const unsigned char *hash)
47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo);

pathspec.c
22af33bece 671) name = to_free = xmemdupz(name, namelen);

read-cache.c
ee70c12820 1730) if (advice_unknown_index_extension) {
ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext);
ee70c12820 1732) advise(_("This is likely due to the file having been 
written by a newer\n"

ec36c42a63 3508) const char *index = NULL;
ec36c42a63 3514) if (!offset)
ec36c42a63 3515) return NULL;
ec36c42a63 3516) while (offset <= mmap_size - the_hash_algo->rawsz - 8) {
ec36c42a63 3517) extsize = get_be32(mmap + offset + 4);
ec36c42a63 3518) if (CACHE_EXT((mmap + offset)) == 
CACHE_EXT_INDEXENTRYOFFSETTABLE) {

ec36c42a63 3519) index = mmap + offset + 4 + 4;
ec36c42a63 3520) break;
ec36c42a63 3522) offset += 8;
ec36c42a63 3523) offset += extsize;
ec36c42a63

Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-07 Thread Derrick Stolee

On 12/6/2018 6:36 PM, Jonathan Tan wrote:

AFAICT oid_object_info doesn't take advantage of the commit graph,
but just looks up the object header, which is still less than completely
parsing it. Then lookup_commit is overly strict, as it may return
NULL as when there still is a type mismatch (I don't think a mismatch
could happen here, as both rely on just the object store, and not the
commit graph.), so this would be just defensive programming for
the sake of it. I dunno.

 struct commit *c;

 if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
 (c = lookup_commit(revs->repo, oid)) &&
 !repo_parse_commit(revs->repo, c))
 object = (struct object *) c;
 else
 object = parse_object(revs->repo, oid);

I like this way better - I'll do it in the next version.


If we do _not_ have a commit-graph or if the commit-graph does not have
that commit, this will have the same performance problem, right?

Should we instead create a direct dependence on the commit-graph, and try
to parse the oid from the graph directly? If it succeeds, then we learn
that the object is a commit, in addition to all of the parsing work. This
means we could avoid oid_object_info() loading data if we succeed. We
would fall back to parse_object() if it fails.

I was thinking this should be a simple API call to parse_commit_in_graph(),
but that requires a struct commit filled with an oid, which is not the
best idea if we don't actually know it is a commit yet.

The approach I recommend would then be more detailed:

1. Modify find_commit_in_graph() to take a struct object_id instead of a
   struct commit. This helps find the integer position in the graph. That
   position can be used in fill_commit_in_graph() to load the commit
   contents. Keep find_commit_in_graph() static as it should not be a
   public function.

2. Create a public function with prototype

struct commit *try_parse_commit_from_graph(struct repository *r, struct 
object_id *oid)


   that returns a commit struct fully parsed if and only if the repository
   has that oid. It can call find_commit_in_graph(), then 
lookup_commit() and

   fill_commit_in_graph() to create the commit and parse the data.

3. In replace of the snippet above, do:

    struct commit *c;

    if ((c = try_parse_commit_from_graph(revs->repo, oid))
        object = (struct object *)c;
    else
        object = parse_object(revs->repo, oid);

A similar pattern _could_ be used in parse_object(), but I don't recommend
doing this pattern unless we have a reasonable suspicion that we are going
to parse commits more often than other objects. (It adds an O(log(# 
commits))

binary search to each object.)

A final thought: consider making this "try the commit graph first, but fall
back to parse_object()" a library function with a name like

    struct object *parse_probably_commit(struct repository *r, struct 
object_id *oid)


so other paths that are parsing a lot of commits (but also maybe tags) could
use the logic.

Thanks!
-Stolee


Re: [PATCH v2 2/3] commit-graph: fix buffer read-overflow

2018-12-07 Thread Derrick Stolee

On 12/6/2018 3:20 PM, Josh Steadmon wrote:

+
+# usage: corrupt_and_zero_graph_then_verify   
 
+# Manipulates the commit-graph file at  by inserting the 
data,
+# then zeros the file starting at . Finally, runs
+# 'git commit-graph verify' and places the output in the file 'err'. Tests 
'err'
+# for the given string.
+corrupt_and_zero_graph_then_verify() {


This method is very similar to to 'corrupt_graph_and_verify()', the only 
difference being the zero_pos, which zeroes the graph.


Could it instead be a modification of corrupt_graph_and_verify() where 
$4 is interpreted as zero_pos, and if it is blank we don't do the 
truncation?



+test_expect_success 'detect truncated graph' '
+   corrupt_and_zero_graph_then_verify $GRAPH_BYTE_CHUNK_COUNT "\xff" \
+   $GRAPH_CHUNK_LOOKUP_OFFSET "chunk lookup table entry missing"
+'
+


Thanks for this! I think it's valuable to keep explicit tests around 
that were discovered from your fuzz tests. Specifically, I can repeat 
the test when I get around to the next file format.


Thanks,
-Stolee


Re: [PATCH 2/2] commit-graph: fix buffer read-overflow

2018-12-06 Thread Derrick Stolee

On 12/5/2018 5:32 PM, Josh Steadmon wrote:
  
+		if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) {

+   error(_("chunk lookup table entry missing; graph file may be 
incomplete"));
+   free(graph);
+   return NULL;
+   }


Something I forgot earlier: there are several tests in 
t5318-commit-graph.sh that use 'git commit-graph verify' to ensure we 
hit these error conditions on a corrupted commit-graph file. Could you 
try adding a test there that looks for this error message?


Thanks,
-Stolee


Re: git, monorepos, and access control

2018-12-05 Thread Derrick Stolee

On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Dec 05 2018, Coiner, John wrote:


I'm an engineer with AMD. I'm looking at whether we could switch our
internal version control to a monorepo, possibly one based on git and
VFSForGit.

Has anyone looked at adding access control to git, at a per-directory
granularity? Is this a feature that the git community would possibly
welcome?

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

(Avar has a lot of good ideas in his message, so I'm just going to add
on a few here.)

This directory-level security is not a goal for VFS for Git, and I don't
see itbecoming a priority as it breaks a number of design decisions we
made in our object storage and communication models.

The best I can think about when considering Git as an approach would be
to use submodules for your security-related content, and then have server-
side security for access to those repos. Of course, submodules are not
supported in VFS for Git, either.

The Gerrit service has _branch_ level security, which is related to the
reachability questions that a directory security would need. However,
the problem is quite different. Gerrit does have a lot of experience in
dealing with submodules, though, so that's probably a good place to
start.

Thanks,
-Stolee


Re: How de-duplicate similar repositories with alternates

2018-12-04 Thread Derrick Stolee

On 12/4/2018 8:35 AM, Ævar Arnfjörð Bjarmason wrote:

  The leftover space
spent is the commit-grah (not de-duped like objects are), and...


The commit-graph could be shared, as the commits in each enlistment can 
be parsed from local with GENERATION_NUMBER_INFINITY, giving us similar 
speedups.


The issue is: who updates the file? As the commit-graph gets behind, 
performance will degrade. But it seems like you'd need similar 
maintenance on the alternate object store, anyway.


Thanks,
-Stolee


Re: How de-duplicate similar repositories with alternates

2018-12-04 Thread Derrick Stolee

On 12/4/2018 2:06 AM, Jeff King wrote:

On Thu, Nov 29, 2018 at 10:55:49AM -0800, Stefan Beller wrote:


I view alternates as a historic artefact as the deduping
of objects client side can be done using worktrees, and on the
serverside - I think - most of the git hosters use namespaces
and put a fork network into the same repository and use pack islands.

By contrast, object storage is pretty easy to share. It scales
reasonably well, and the security model is much simpler due to the
immutable nature of object names.


And for the client side, we use alternates as an important way to scale 
VFS for Git to multiple enlistments on the same machine. VFS for Git 
manages a "shared object cache" (the alternate) that is updated in the 
background (including multi-pack-index and commit-graph).


Using worktrees for the same effect would add complications to the user 
interactions, not only when creating an enlistment but the fact that two 
enlistments cannot check out the same ref will confuse users.


Thanks,
-Stolee


Git Test Coverage Report (Friday Nov 30)

2018-11-30 Thread Derrick Stolee

Here is today's test coverage report.

Thanks,
-Stolee

[1] https://dev.azure.com/git/git/_build/results?buildId=277

---

pu: 5a1a9a96d55fbb80426189a921d7b6cc66564c78
jch: 71c29cabb7379fe9abaacbbbd1350268d0c18b4f
next: a9faaff8c120bf4783cb892c157871fe524b3608
master: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb
master@{1}: 7f4e64169352e03476b0ea64e7e2973669e491a2

Uncovered code in 'pu' not in 'jch'
--

builtin/blame.c
74e8221b52 builtin/blame.c    928) blame_date_width = sizeof("Thu Oct 19 
16:00");

74e8221b52 builtin/blame.c    929) break;

builtin/remote.c
b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used 
with other options"));
b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used 
when only one url is defined"));


date.c
74e8221b52  113) die("Timestamp too large for this system: %"PRItime, time);
74e8221b52  216) if (tm->tm_mon == human_tm->tm_mon) {
74e8221b52  217) if (tm->tm_mday > human_tm->tm_mday) {
74e8221b52  219) } else if (tm->tm_mday == human_tm->tm_mday) {
74e8221b52  220) hide.date = hide.wday = 1;
74e8221b52  221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) {
74e8221b52  223) hide.date = 1;
74e8221b52  231) gettimeofday(&now, NULL);
74e8221b52  232) show_date_relative(time, tz, &now, buf);
74e8221b52  233) return;
74e8221b52  246) hide.seconds = 1;
74e8221b52  247) hide.tz |= !hide.date;
74e8221b52  248) hide.wday = hide.time = !hide.year;
74e8221b52  262) strbuf_rtrim(buf);
74e8221b52  287) gettimeofday(&now, NULL);
74e8221b52  290) human_tz = local_time_tzoffset(now.tv_sec, &human_tm);
74e8221b52  886) static int auto_date_style(void)
74e8221b52  888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : 
DATE_NORMAL;

74e8221b52  909) return DATE_HUMAN;
74e8221b52  911) return auto_date_style();

protocol.c
24c10f7473  37) die(_("Unrecognized protocol version"));
24c10f7473  39) die(_("Unrecognized protocol_version"));

remote-curl.c
24c10f7473  403) return 0;

sha1-array.c
bba406749a 91) oidcpy(&oids[dst], &oids[src]);

strbuf.c
10a40f5700  397) return 0;

submodule.c
e2419f7e30 1376) strbuf_release(&gitdir);
7454fe5cb6 1499) struct get_next_submodule_task *task = task_cb;
7454fe5cb6 1503) get_next_submodule_task_release(task);
7454fe5cb6 1530) return 0;
7454fe5cb6 1534) goto out;
7454fe5cb6 1549) return 0;

wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Anders Waldenborg  10a40f570: strbuf: separate callback for 
strbuf_expand:ing literals
Denton Liu  b7f4e371e: remote: add --save-to-push option to git 
remote set-url
Josh Steadmon  24c10f747: protocol: advertise multiple supported 
versions

Linus Torvalds  74e8221b5: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Stefan Beller  7454fe5cb: fetch: try fetching submodules if needed 
objects were not fetched

Stefan Beller  bba406749: sha1-array: provide oid_array_filter
Stefan Beller  e2419f7e3: submodule: migrate get_next_submodule to 
use repository structs




Uncovered code in 'jch' not in 'next'


apply.c
0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) ||
0f086e6dca 3356) lstat(ce->name, st))

builtin/branch.c
0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD"));
0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of 
refs/heads/"), refname);


hex.c
47edb64997  93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1)
47edb64997  95) return hash_to_hex_algop_r(buffer, sha1, 
&hash_algos[GIT_HASH_SHA1]);

47edb64997 116) char *hash_to_hex(const unsigned char *hash)
47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo);

pathspec.c
22af33bece 671) name = to_free = xmemdupz(name, namelen);

read-cache.c
ee70c12820 1730) if (advice_unknown_index_extension) {
ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext);
ee70c12820 1732) advise(_("This is likely due to the file having been 
written by a newer\n"


sequencer.c
18e711a162 2387) opts->quiet = 1;

sha1-file.c
2f90b9d9b4 sha1-file.c  172) int hash_algo_by_name(const char *name)
2f90b9d9b4 sha1-file.c  175) if (!name)
2f90b9d9b4 sha1-file.c  176) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  177) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  178) if (!strcmp(name, hash_algos[i].name))
2f90b9d9b4 sha1-file.c  179) return i;
2f90b9d9b4 sha1-file.c  180) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  183) int hash_algo_by_id(uint32_t format_id)
2f90b9d9b4 sha1-file.c  186) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  187) if (format_id == hash_algos[i].format_id)
2f90b9d9b4 sha1-file.c  188) return i;
2f90b9d9b4 sha1-file.c  189) return GIT_HASH_UNKNOWN;

tree.c
e092073d64 104) commit = lookup_commit(r, e

Re: [PATCH 3/5] pack-objects: add --sparse option

2018-11-30 Thread Derrick Stolee

On 11/29/2018 9:39 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


While _eventually_ we should make this opt-out, we shouldn't do that
until it has cooked a while.

I actually do not agree.  If the knob gives enough benefit, the
users will learn about it viva voce, and in a few more releases
we'll hear "enough users complain that they have to turn it on,
let's make it the default".  If that does not happen, the knob
does not deserve to be turned on in the first place.


That's a good philosophy. I'll keep it in mind.

Having said that, I won't be commenting on this shiny new toy before
the final.  I want to see more people help tying the loose ends and
give it final varnish to the upcoming release, as it seems to have
become rockier and larger release than we originally anticipated.


Yeah, no rush on this one. I just wanted to get some initial feedback 
about the idea.


Thanks,
-Stolee


[PATCH v2 5/6] pack-objects: create pack.useSparse setting

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget



<    3   4   5   6   7   8   9   10   11   12   >