Re: [PATCH 5/6] commit-graph: implement file format version 2
On 1/24/2019 4:40 AM, Ævar Arnfjörð Bjarmason wrote: On Wed, Jan 23 2019, Derrick Stolee via GitGitGadget wrote: The new format modifies the header of the commit-graph to solve these problems. We use the 4-byte hash format id, freeing up a byte in our 32-bit alignment to introduce a reachability index version. We can also fail to read the commit-graph if the eighth byte is non-zero. I haven't tested, but it looks from the patch like we can transparently read existing v1 data and then will write v2 the next time. Would be helpful for reviewers if this was noted explicitly in the commit message. Can do. Should there be a GIT_TEST_COMMIT_GRAPH_VERSION=[12] going forward to test the non-default version, or do you feel confident the tests added here test the upgrade path & old code well enough? You're right that we should have an explicit "upgrade" test: 1. Write a v1 commit-graph 2. Add a commit 3. Write a v2 commit-graph As for a new GIT_TEST_ variable, we should only need to continue relying on GIT_TEST_COMMIT_GRAPH to test v2. I can add a 'graph_git_behavior' call on an explicitly v1 commit-graph file to get most of the coverage we need. The 'git commit-graph read' subcommand needs updating to show the new data. Let's say "The ... subcommand has been updated to show the new data". This sounds like a later patch is going to do that, but in fact it's done here. Will clean up. Set the default file format version to 2, and adjust the tests to expect the new 'git commit-graph read' output. Signed-off-by: Derrick Stolee --- .../technical/commit-graph-format.txt | 26 +- builtin/commit-graph.c| 9 commit-graph.c| 47 --- commit-graph.h| 1 + t/t5318-commit-graph.sh | 9 +++- 5 files changed, 83 insertions(+), 9 deletions(-) diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index 16452a0504..e367aa94b1 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -31,13 +31,22 @@ and hash type. All 4-byte numbers are in network order. +There are two versions available, 1 and 2. These currently differ only in +the header. Shouldn't this be s/currently/ / ? Won't we add a version 3 if we make new changes? When we add a new reachability index version, then the content of the data chunk will change. Since we have a separate byte for versioning that data, we do not need a v3 for the file format as a whole. A similar statement applies to the unused byte reserved for the incremental file format: we won't need to increase the file format version as we will make that number non-zero and add a chunk with extra data. Thanks, -Stolee
Re: [PATCH 0/9] drop some unused parameters
On 1/24/2019 8:11 AM, Jeff King wrote: 2. Patches that drop unused parameters (i.e., code cleanup). I've read the patches and they do exactly this, along with explanations of why they don't fit in the first category. Thanks, -Stolee
Re: [PATCH 5/8] show_date_relative(): drop unused "tz" parameter
On 1/24/2019 8:12 AM, Jeff King wrote: The timestamp we receive is in epoch time, so there's no need for a timezone parameter to interpret it. The matching show_date() uses "tz" to show dates in author local time, but relative dates show only the absolute time difference. The author's location is irrelevant, barring relativistic effects from using Git close to the speed of light. I fully support more humor in our commit messages. -Stolee
[PATCH 6/6] commit-graph: test verifying a corrupt v2 header
From: Derrick Stolee The commit-graph file format v2 changes the v1 data only in the header information. Add tests that check the 'verify' subcommand catches corruption in the v2 header. Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 31 +++ 1 file changed, 31 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 3ff5e3b48d..be7bbf911a 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -497,6 +497,37 @@ test_expect_success 'git fsck (checks commit-graph)' ' test_must_fail git fsck ' +test_expect_success 'rewrite commmit-graph with version 2' ' + rm -f .git/objects/info/commit-graph && + git commit-graph write --reachable --version=2 && + git commit-graph verify +' + +GRAPH_BYTE_CHUNK_COUNT=5 +GRAPH_BYTE_REACH_INDEX=6 +GRAPH_BYTE_UNUSED=7 +GRAPH_BYTE_HASH=8 + +test_expect_success 'detect low chunk count (v2)' ' + corrupt_graph_and_verify $GRAPH_CHUNK_COUNT "\02" \ + "missing the .* chunk" +' + +test_expect_success 'detect incorrect reachability index' ' + corrupt_graph_and_verify $GRAPH_REACH_INDEX "\03" \ + "reachability index version" +' + +test_expect_success 'detect non-zero unused byte' ' + corrupt_graph_and_verify $GRAPH_BYTE_UNUSED "\01" \ + "unsupported value" +' + +test_expect_success 'detect bad hash version (v2)' ' + corrupt_graph_and_verify $GRAPH_BYTE_HASH "\00" \ + "hash algorithm" +' + test_expect_success 'setup non-the_repository tests' ' rm -rf repo && git init repo && -- gitgitgadget
[PATCH 3/6] commit-graph: create new version flags
From: Derrick Stolee In anticipation of a new commit-graph file format version, create a flag for the write_commit_graph() and write_commit_graph_reachable() methods to take a version number. When there is no specified version, the implementation selects a default value. Currently, the only valid value is 1. The file format will change the header information, so place the existing header logic inside a switch statement with only one case. Signed-off-by: Derrick Stolee --- commit-graph.c | 58 +- commit-graph.h | 1 + 2 files changed, 40 insertions(+), 19 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 28fe2378be..f7f45893fd 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -25,9 +25,6 @@ #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16) -#define GRAPH_VERSION_1 0x1 -#define GRAPH_VERSION GRAPH_VERSION_1 - #define GRAPH_EXTRA_EDGES_NEEDED 0x8000 #define GRAPH_PARENT_MISSING 0x7fff #define GRAPH_EDGE_LAST_MASK 0x7fff @@ -118,30 +115,35 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) } graph_version = *(unsigned char*)(data + 4); - if (graph_version != GRAPH_VERSION) { + if (graph_version != 1) { error(_("graph version %X does not match version %X"), - graph_version, GRAPH_VERSION); - goto cleanup_fail; - } - - hash_version = *(unsigned char*)(data + 5); - if (hash_version != oid_version()) { - error(_("hash version %X does not match version %X"), - hash_version, oid_version()); + graph_version, 1); goto cleanup_fail; } graph = alloc_commit_graph(); + switch (graph_version) { + case 1: + hash_version = *(unsigned char*)(data + 5); + if (hash_version != oid_version()) { + error(_("hash version %X does not match version %X"), + hash_version, oid_version()); + goto cleanup_fail; + } + + graph->num_chunks = *(unsigned char*)(data + 6); + chunk_lookup = data + 8; + break; + } + graph->hash_len = the_hash_algo->rawsz; - graph->num_chunks = *(unsigned char*)(data + 6); graph->graph_fd = fd; graph->data = graph_map; graph->data_len = graph_size; last_chunk_id = 0; last_chunk_offset = 8; - chunk_lookup = data + 8; for (i = 0; i < graph->num_chunks; i++) { uint32_t chunk_id = get_be32(chunk_lookup + 0); uint64_t chunk_offset = get_be64(chunk_lookup + 4); @@ -792,10 +794,22 @@ int write_commit_graph(const char *obj_dir, int res = 0; int append = flags & COMMIT_GRAPH_APPEND; int report_progress = flags & COMMIT_GRAPH_PROGRESS; + int version = 0; + int header_size = 0; if (!commit_graph_compatible(the_repository)) return 0; + if (flags & COMMIT_GRAPH_VERSION_1) + version = 1; + if (!version) + version = 1; + if (version != 1) { + error(_("unsupported commit-graph version %d"), + version); + return 1; + } + oids.nr = 0; approx_nr_objects = approximate_object_count(); oids.alloc = approx_nr_objects / 32; @@ -980,10 +994,16 @@ int write_commit_graph(const char *obj_dir, hashwrite_be32(f, GRAPH_SIGNATURE); - hashwrite_u8(f, GRAPH_VERSION); - hashwrite_u8(f, oid_version()); - hashwrite_u8(f, num_chunks); - hashwrite_u8(f, 0); /* unused padding byte */ + hashwrite_u8(f, version); + + switch (version) { + case 1: + hashwrite_u8(f, oid_version()); + hashwrite_u8(f, num_chunks); + hashwrite_u8(f, 0); /* unused padding byte */ + header_size = 8; + break; + } chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT; chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP; @@ -994,7 +1014,7 @@ int write_commit_graph(const char *obj_dir, chunk_ids[3] = 0; chunk_ids[4] = 0; - chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH; + chunk_offsets[0] = header_size + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH; chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE; chunk_offsets[2] = chunk_offsets[1] + hashsz * commits.nr; chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * commits.nr; diff --git a/commit-graph.h b/commit-graph.h index 83fa548138..e03df54e33 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -62,6 +62,7 @@ int generation_numbers_enabled(struct repository *r); #define COMMIT_GRAPH_APPEND
[PATCH 5/6] commit-graph: implement file format version 2
From: Derrick Stolee The commit-graph file format had some shortcomings which we now correct: 1. The hash algorithm was determined by a single byte, instead of the 4-byte format identifier. 2. There was no way to update the reachability index we used. We currently only support generation numbers, but that will change in the future. 3. Git did not fail with error if the unused eighth byte was non-zero, so we could not use that to indicate an incremental file format without breaking compatibility across versions. The new format modifies the header of the commit-graph to solve these problems. We use the 4-byte hash format id, freeing up a byte in our 32-bit alignment to introduce a reachability index version. We can also fail to read the commit-graph if the eighth byte is non-zero. The 'git commit-graph read' subcommand needs updating to show the new data. Set the default file format version to 2, and adjust the tests to expect the new 'git commit-graph read' output. Signed-off-by: Derrick Stolee --- .../technical/commit-graph-format.txt | 26 +- builtin/commit-graph.c| 9 commit-graph.c| 47 --- commit-graph.h| 1 + t/t5318-commit-graph.sh | 9 +++- 5 files changed, 83 insertions(+), 9 deletions(-) diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index 16452a0504..e367aa94b1 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -31,13 +31,22 @@ and hash type. All 4-byte numbers are in network order. +There are two versions available, 1 and 2. These currently differ only in +the header. + HEADER: +All commit-graph files use the first five bytes for the same purpose. + 4-byte signature: The signature is: {'C', 'G', 'P', 'H'} 1-byte version number: - Currently, the only valid version is 1. + Currently, the valid version numbers are 1 and 2. + +The remainder of the header changes depending on the version. + +Version 1: 1-byte Hash Version (1 = SHA-1) We infer the hash length (H) from this value. @@ -47,6 +56,21 @@ HEADER: 1-byte (reserved for later use) Current clients should ignore this value. +Version 2: + + 1-byte number (C) of "chunks" + + 1-byte reachability index version number: + Currently, the only valid number is 1. + + 1-byte (reserved for later use) + Current clients expect this value to be zero, and will not + try to read the commit-graph file if it is non-zero. + + 4-byte format identifier for the hash algorithm: + If this identifier does not agree with the repository's current + hash algorithm, then the client will not read the commit graph. + CHUNK LOOKUP: (C + 1) * 12 bytes listing the table of contents for the chunks: diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index b1bed84260..28787d0c9c 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -102,6 +102,11 @@ static int graph_read(int argc, const char **argv) *(unsigned char*)(graph->data + 5), *(unsigned char*)(graph->data + 6), *(unsigned char*)(graph->data + 7)); + + if (*(unsigned char *)(graph->data + 4) == 2) + printf("hash algorithm: %X\n", + get_be32(graph->data + 8)); + printf("num_commits: %u\n", graph->num_commits); printf("chunks:"); @@ -162,6 +167,10 @@ static int graph_write(int argc, const char **argv) case 1: flags |= COMMIT_GRAPH_VERSION_1; break; + + case 2: + flags |= COMMIT_GRAPH_VERSION_2; + break; } read_replace_refs = 0; diff --git a/commit-graph.c b/commit-graph.c index f7f45893fd..aeb6cae656 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -90,7 +90,8 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) uint64_t last_chunk_offset; uint32_t last_chunk_id; uint32_t graph_signature; - unsigned char graph_version, hash_version; + unsigned char graph_version, hash_version, reach_index_version; + uint32_t hash_id; if (fd < 0) return NULL; @@ -115,9 +116,9 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) } graph_version = *(unsigned char*)(data + 4); - if (graph_version != 1) { - error(_("graph version %X does not match version %X"), - graph_version, 1); + if (!graph_version || graph_version > 2) { + error(_("unsupported graph version %X"), + g
[PATCH 2/6] commit-graph: collapse parameters into flags
From: Derrick Stolee The write_commit_graph() and write_commit_graph_reachable() methods currently take two boolean parameters: 'append' and 'report_progress'. We will soon expand the possible options to send to these methods, so instead of complicating the parameter list, first simplify it. Collapse these parameters into a 'flags' parameter, and adjust the callers to provide flags as necessary. Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 8 +--- builtin/commit.c | 2 +- builtin/gc.c | 4 ++-- commit-graph.c | 9 + commit-graph.h | 8 +--- 5 files changed, 18 insertions(+), 13 deletions(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index b12d46fdc8..0c92421f75 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -127,6 +127,7 @@ static int graph_write(int argc, const char **argv) struct string_list *commit_hex = NULL; struct string_list lines; int result; + int flags = COMMIT_GRAPH_PROGRESS; static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", &opts.obj_dir, @@ -151,11 +152,13 @@ static int graph_write(int argc, const char **argv) die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); + if (opts.append) + flags |= COMMIT_GRAPH_APPEND; read_replace_refs = 0; if (opts.reachable) - return write_commit_graph_reachable(opts.obj_dir, opts.append, 1); + return write_commit_graph_reachable(opts.obj_dir, flags); string_list_init(&lines, 0); if (opts.stdin_packs || opts.stdin_commits) { @@ -175,8 +178,7 @@ static int graph_write(int argc, const char **argv) result = write_commit_graph(opts.obj_dir, pack_indexes, commit_hex, - opts.append, - 1); + flags); UNLEAK(lines); return result; diff --git a/builtin/commit.c b/builtin/commit.c index 04b0717b35..3228de4e3c 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -1668,7 +1668,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix) "not exceeded, and then \"git reset HEAD\" to recover.")); if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && - write_commit_graph_reachable(get_object_directory(), 0, 0)) + write_commit_graph_reachable(get_object_directory(), 0)) return 1; repo_rerere(the_repository, 0); diff --git a/builtin/gc.c b/builtin/gc.c index 9c6c9c9007..198872206b 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -663,8 +663,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) clean_pack_garbage(); if (gc_write_commit_graph && - write_commit_graph_reachable(get_object_directory(), 0, -!quiet && !daemonized)) + write_commit_graph_reachable(get_object_directory(), +!quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0)) return 1; if (auto_gc && too_many_loose_objects()) diff --git a/commit-graph.c b/commit-graph.c index 162b9f2a85..28fe2378be 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -755,15 +755,14 @@ static int add_ref_to_list(const char *refname, return 0; } -int write_commit_graph_reachable(const char *obj_dir, int append, -int report_progress) +int write_commit_graph_reachable(const char *obj_dir, int flags) { struct string_list list = STRING_LIST_INIT_DUP; int result; for_each_ref(add_ref_to_list, &list); result = write_commit_graph(obj_dir, NULL, &list, - append, report_progress); + flags); string_list_clear(&list, 0); return result; @@ -772,7 +771,7 @@ int write_commit_graph_reachable(const char *obj_dir, int append, int write_commit_graph(const char *obj_dir, struct string_list *pack_indexes, struct string_list *commit_hex, - int append, int report_progress) + int flags) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -791,6 +790,8 @@ int write_commit_graph(const char *obj_dir, struct strbuf progress_title = STRBUF_INIT; unsigned long approx_nr_objects; int res = 0; + int append = flags & COMMIT_GRAPH_APPEND; + int report_progress = flags & COMMIT_GRAP
[PATCH 4/6] commit-graph: add --version= option
From: Derrick Stolee Allo the commit-graph builtin to specify the file format version using the '--version=' option. Specify the version exactly in the verification tests as using a different version would change the offsets used in those tests. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 3 +++ builtin/commit-graph.c | 13 +++-- t/t5318-commit-graph.sh| 2 +- 3 files changed, 15 insertions(+), 3 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 624470e198..1d1cc70de4 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -51,6 +51,9 @@ or `--stdin-packs`.) + With the `--append` option, include all commits that are present in the existing commit-graph file. ++ +With the `--version=` option, specify the file format version. Used +only for testing. 'read':: diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0c92421f75..b1bed84260 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph verify [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits] [--version=]"), NULL }; @@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits] [--version=]"), NULL }; @@ -35,6 +35,7 @@ static struct opts_commit_graph { int stdin_packs; int stdin_commits; int append; + int version; } opts; @@ -141,6 +142,8 @@ static int graph_write(int argc, const char **argv) N_("start walk at commits listed by stdin")), OPT_BOOL(0, "append", &opts.append, N_("include all commits already in the commit-graph file")), + OPT_INTEGER(0, "version", &opts.version, + N_("specify the file format version")), OPT_END(), }; @@ -155,6 +158,12 @@ static int graph_write(int argc, const char **argv) if (opts.append) flags |= COMMIT_GRAPH_APPEND; + switch (opts.version) { + case 1: + flags |= COMMIT_GRAPH_VERSION_1; + break; + } + read_replace_refs = 0; if (opts.reachable) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index f4deb13b1d..b79d6263e9 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -328,7 +328,7 @@ test_expect_success 'replace-objects invalidates commit-graph' ' test_expect_success 'git commit-graph verify' ' cd "$TRASH_DIRECTORY/full" && - git rev-parse commits/8 | git commit-graph write --stdin-commits && + git rev-parse commits/8 | git commit-graph write --stdin-commits --version=1 && git commit-graph verify >output ' -- gitgitgadget
[PATCH 1/6] commit-graph: return with errors during write
From: Derrick Stolee The write_commit_graph() method uses die() to report failure and exit when confronted with an unexpected condition. This use of die() in a library function is incorrect and is now replaced by error() statements and an int return type. Now that we use 'goto cleanup' to jump to the terminal condition on an error, we have new paths that could lead to uninitialized values. New initializers are added to correct for this. The builtins 'commit-graph', 'gc', and 'commit' call these methods, so update them to check the return value. Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 19 +++-- builtin/commit.c | 5 ++-- builtin/gc.c | 7 ++--- commit-graph.c | 60 +- commit-graph.h | 10 +++ 5 files changed, 62 insertions(+), 39 deletions(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 4ae502754c..b12d46fdc8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -126,6 +126,7 @@ static int graph_write(int argc, const char **argv) struct string_list *pack_indexes = NULL; struct string_list *commit_hex = NULL; struct string_list lines; + int result; static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", &opts.obj_dir, @@ -153,10 +154,8 @@ static int graph_write(int argc, const char **argv) read_replace_refs = 0; - if (opts.reachable) { - write_commit_graph_reachable(opts.obj_dir, opts.append, 1); - return 0; - } + if (opts.reachable) + return write_commit_graph_reachable(opts.obj_dir, opts.append, 1); string_list_init(&lines, 0); if (opts.stdin_packs || opts.stdin_commits) { @@ -173,14 +172,14 @@ static int graph_write(int argc, const char **argv) UNLEAK(buf); } - write_commit_graph(opts.obj_dir, - pack_indexes, - commit_hex, - opts.append, - 1); + result = write_commit_graph(opts.obj_dir, + pack_indexes, + commit_hex, + opts.append, + 1); UNLEAK(lines); - return 0; + return result; } int cmd_commit_graph(int argc, const char **argv, const char *prefix) diff --git a/builtin/commit.c b/builtin/commit.c index 004b816635..04b0717b35 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -1667,8 +1667,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix) "new_index file. Check that disk is not full and quota is\n" "not exceeded, and then \"git reset HEAD\" to recover.")); - if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) - write_commit_graph_reachable(get_object_directory(), 0, 0); + if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && + write_commit_graph_reachable(get_object_directory(), 0, 0)) + return 1; repo_rerere(the_repository, 0); run_command_v_opt(argv_gc_auto, RUN_GIT_CMD); diff --git a/builtin/gc.c b/builtin/gc.c index 7696017cd4..9c6c9c9007 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -662,9 +662,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); - if (gc_write_commit_graph) - write_commit_graph_reachable(get_object_directory(), 0, -!quiet && !daemonized); + if (gc_write_commit_graph && + write_commit_graph_reachable(get_object_directory(), 0, +!quiet && !daemonized)) + return 1; if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " diff --git a/commit-graph.c b/commit-graph.c index 0f8274d15d..162b9f2a85 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -755,27 +755,30 @@ static int add_ref_to_list(const char *refname, return 0; } -void write_commit_graph_reachable(const char *obj_dir, int append, - int report_progress) +int write_commit_graph_reachable(const char *obj_dir, int append, +int report_progress) { struct string_list list = STRING_LIST_INIT_DUP; + int result; for_each_ref(add_ref_to_list, &list); - write_commit_graph(obj_dir, NULL, &list, append, report_progress); + result = write_commit_graph(obj_dir, NULL, &list, + append, report_progress); string
[PATCH 0/6] Create commit-graph file format v2
The commit-graph file format has some shortcomings that were discussed on-list: 1. It doesn't use the 4-byte format ID from the_hash_algo. 2. There is no way to change the reachability index from generation numbers to corrected commit date [1]. 3. The unused byte in the format could be used to signal the file is incremental, but current clients ignore the value even if it is non-zero. This series adds a new version (2) to the commit-graph file. The fifth byte already specified the file format, so existing clients will gracefully respond to files with a different version number. The only real change now is that the header takes 12 bytes instead of 8, due to using the 4-byte format ID for the hash algorithm. The new bytes reserved for the reachability index version and incremental file formats are now expected to be equal to the defaults. When we update these values to be flexible in the future, if a client understands commit-graph v2 but not those new values, then it will fail gracefully. This series is based on ab/commit-graph-write-progress and bc/sha-256. Thanks, -Stolee [1] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/ Derrick Stolee (6): commit-graph: return with errors during write commit-graph: collapse parameters into flags commit-graph: create new version flags commit-graph: add --version= option commit-graph: implement file format version 2 commit-graph: test verifying a corrupt v2 header Documentation/git-commit-graph.txt| 3 + .../technical/commit-graph-format.txt | 26 ++- builtin/commit-graph.c| 43 +++-- builtin/commit.c | 5 +- builtin/gc.c | 7 +- commit-graph.c| 158 +- commit-graph.h| 16 +- t/t5318-commit-graph.sh | 42 - 8 files changed, 233 insertions(+), 67 deletions(-) base-commit: 91b3ce35eeb93be1f4406e25ccdc4ab983a8e5af Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-112%2Fderrickstolee%2Fgraph%2Fv2-head-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-112/derrickstolee/graph/v2-head-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/112 -- gitgitgadget
Re: [PATCH 10/14] pack-objects: add trace2 regions
On 1/22/2019 4:22 PM, Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee When studying the performance of 'git push' we would like to know how much time is spent at various parts of the command. One area that could cause performance trouble is 'git pack-objects'. Add trace2 regions around the three main actions taken in this command: 1. Enumerate objects. 2. Prepare pack. 3. Write pack-file. Signed-off-by: Derrick Stolee We've had this patch in our private branch for a while, and is how we discovered the need for the sparse push algorithm [1]. We will use it to measure it's effectiveness as new Microsoft users onboard to the new algorithm. Thanks, -Stolee [1] https://public-inbox.org/git/pull.89.v5.git.gitgitgad...@gmail.com/
[PATCH 10/14] pack-objects: add trace2 regions
From: Derrick Stolee When studying the performance of 'git push' we would like to know how much time is spent at various parts of the command. One area that could cause performance trouble is 'git pack-objects'. Add trace2 regions around the three main actions taken in this command: 1. Enumerate objects. 2. Prepare pack. 3. Write pack-file. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 889df2c755..6708529e3c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -33,6 +33,7 @@ #include "object-store.h" #include "dir.h" #include "midx.h" +#include "trace2.h" #define IN_PACK(obj) oe_in_pack(&to_pack, obj) #define SIZE(obj) oe_size(&to_pack, obj) @@ -3472,6 +3473,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) } } + trace2_region_enter("pack-objects", "enumerate-objects", the_repository); prepare_packing_data(the_repository, &to_pack); if (progress) @@ -3486,12 +3488,20 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (include_tag && nr_result) for_each_ref(add_ref_tag, NULL); stop_progress(&progress_state); + trace2_region_leave("pack-objects", "enumerate-objects", the_repository); if (non_empty && !nr_result) return 0; - if (nr_result) + if (nr_result) { + trace2_region_enter("pack-objects", "prepare-pack", the_repository); prepare_pack(window, depth); + trace2_region_leave("pack-objects", "prepare-pack", the_repository); + } + + trace2_region_enter("pack-objects", "write-pack-file", the_repository); write_pack_file(); + trace2_region_leave("pack-objects", "write-pack-file", the_repository); + if (progress) fprintf_ln(stderr, _("Total %"PRIu32" (delta %"PRIu32")," -- gitgitgadget
Re: [PATCH v6 00/10] commit-graph write: progress output improvements
On 1/19/2019 3:21 PM, Ævar Arnfjörð Bjarmason wrote: Improvements since v6: * Integrate my "commit-graph write: use pack order when finding commits" patch, and per Junio's suggestion put it at the start so it's easier to split the two apart. * A signed-off-by of mine was missing on that patch. Fixed. * Replace SZEDER's two patches with his re-roll at https://public-inbox.org/git/20190118170549.30403-1-szeder@gmail.com/ for convenienc, since mine needed rebasing on top of his. * SZEDER rightly pointed out that I was doing some work for nothing in https://public-inbox.org/git/20190118171605.gm...@szeder.dev/; fixed. SZEDER Gábor (2): commit-graph: rename "large edges" to "extra edges" commit-graph: don't call write_graph_chunk_large_edges() unnecessarily Ævar Arnfjörð Bjarmason (8): commit-graph write: use pack order when finding commits commit-graph write: add "Writing out" progress output commit-graph write: more descriptive "writing out" output commit-graph write: show progress for object search commit-graph write: add more descriptive progress output commit-graph write: remove empty line for readability commit-graph write: add itermediate progress commit-graph write: emit a percentage for all progress I'm a few days late, but I took another look through the patches and they look good to me. Thanks! -Stolee
Re: Contributor Summit Topics and Logistics
On 1/22/2019 2:50 AM, Jeff King wrote: For people who want to try to join remotely, I don't think we're going to have a particularly fancy AV setup. But there should at least be a big screen (which we typically do not really use for presenting), and I hope we can provide some connectivity. I'll be visiting the venue the day before (Jan 30th) in the late afternoon (Brussels time) and I'll try to do a test run. If anybody wants to volunteer to be the guinea pig on the other end of the line, I'd welcome it. I would like to join remotely, so I volunteer to do a test run. I'll need to wake up early, so let's set an exact time privately. Topics I would like to hear about: - commit-graph status report (I can lead, if I'm able to join) - multi-pack-index status report (same) - reftable - partial clone - test coverage report, usefulness or improvements Thanks, -Stolee
Re: [PATCH] commit-graph write: use pack order when finding commits
On 1/17/2019 10:09 AM, Derrick Stolee wrote: The code itself looks good, I just need to double-check the performance numbers. I tested this patch on our VFS-specific version using five "prefetch packs". Before: 22.61 s (avg) After: 22.18 s (avg) Improvement: 2% The error rate among the samples is very low (~.15 s) so this result is significant. Considering that we are writing 4.4 million commits -- many more than the objects in the packs that we are reading -- this is an excellent improvement! Thanks! -Stolee
Re: [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index
I've seen comments about some leftover uses of the word 'verb' instead of 'subcommand'. I can definitely clean those up, but I'm hoping that more comments could be made on the technical aspects of this series. Thanks, -Stolee
Re: [PATCH] commit-graph write: use pack order when finding commits
On 1/17/2019 8:23 AM, Ævar Arnfjörð Bjarmason wrote: Slightly optimize the "commit-graph write" step by using FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack(). See commit [1] and [2] for the facility and a similar optimization for "cat-file". On Linux it is around 5% slower to run: echo 1 >/proc/sys/vm/drop_caches && cat .git/objects/pack/* >/dev/null && git cat-file --batch-all-objects --batch-check --unordered Than the same thing with the "cat" omitted. This is as expected, since we're iterating in pack order and the "cat" is extra work. Before this change the opposite was true of "commit-graph write". We were 6% faster if we first ran "cat" to efficiently populate the FS cache for our sole big pack on linux.git than if we had it populated via for_each_object_in_pack(). Now we're 3% faster without the "cat" instead. The tests were done on an unloaded Linux 3.10 system with 10 runs for each. 1. 736eb88fdc ("for_each_packed_object: support iterating in pack-order", 2018-08-10) 2. 0750bb5b51 ("cat-file: support "unordered" output for --batch-all-objects", 2018-08-10) Thanks, Aevar! I plan to give this a test on the Windows repository, as this is the way we generate the commit-graph in VFS for Git. I created a PR on microsoft/git [1] and am generating the installers now [2] [1] https://github.com/Microsoft/git/pull/110 [2] https://dev.azure.com/gvfs/ci/_build/results?buildId=6114 The code itself looks good, I just need to double-check the performance numbers. Thanks, -Stolee
[PATCH v5 3/5] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the performance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. A test is modified to demonstrate this behavior and to verify that the new logic is being exercised. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 138 ++--- t/t5322-pack-objects-sparse.sh | 14 +++- 2 files changed, 139 insertions(+), 13 deletions(-) diff --git a/revision.c b/revision.c index 60421f3a10..5de739384a 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reac
[PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. To ensure that mark_tree_uninteresting walks the tree, we need to remove the UNINTERESTING flag before calling the method. This implementation will be replaced entirely in a later commit. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 25 + revision.h | 2 ++ 2 files changed, 27 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..60421f3a10 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *trees) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(trees, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (!tree) + continue; + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..df684701b9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH v5 4/5] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- Documentation/config/pack.txt | 9 + builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 9f2a6e5d31..3233fafc90 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -118,4 +118,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp required_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE
From: Derrick Stolee Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 1 + t/README | 4 t/t5322-pack-objects-sparse.sh | 4 ++-- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 3233fafc90..7124b5581a 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -64,7 +64,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && -- gitgitgadget
[PATCH v5 2/5] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly different way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. Add a '--sparse' flag in 'git pack-objects' to call this new logic. Add a new test script t/t5322-pack-objects-sparse.sh that tests this option. The tests currently demonstrate that the resulting object list is the same as the old algorithm. This includes a case where both algorithms pack an object that is not needed by a remote due to limits on the explored set of trees. When the sparse algorithm is changed in a later commit, we will add a test that demonstrates a change of behavior in some cases. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 5 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 70 +++--- list-objects.h | 4 +- t/t5322-pack-objects-sparse.sh | 113 + 8 files changed, 192 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@
[PATCH v5 0/5] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Updates in V4: * Switched to using hashmap instead of string_list for the path/oidset dictionary. (This is due to some fear that the string_list performance would degrade for a very wide tree, but I am unable to measure a performance difference.) * Some cleanup of code snippets across commits. * Some grammar cleanup in the commit messages. Updates in V5: * Renamed the generic "set" to "trees" * Used a less-restrictive test condition when testing the old algorithm, except for the case where we want to verify that the new algorithm is being run. * Removed a global variable that was not used. Thanks, Ramsay and Junio! Derrick Stolee (5): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 70 +++--- list-objects.h | 4 +- revision.c | 143 + revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 136 +++ 12 files changed, 378 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v5 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v5 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v4: 1: 817e30a287 ! 1: 02ef702884 revision: add mark_tree_uninteresting_sparse @@ -9,7 +9,10 @@ The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. -This will walk the same number of trees as the old mechanism. +This will walk the same number of trees as the old mechanism. To +ensure that mark_tree_uninteresting walks the tree, we need to +remove the UNINTERESTING flag before calling the method. This +implementation will be replaced entirely in a later commit. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not @@ -26,12 +29,12 @@ } +void mark_trees_uninteresting_sparse(struct repository *r, -+ struct oidse
Git Test Coverage Report (Wed Jan 16)
rebase--interactive.c Alban Gruin 8414c890a: sequencer: refactor sequencer_add_exec_commands() to work on a todo_list Alban Gruin 98b29e060: sequencer: refactor skip_unnecessary_picks() to work on a todo_list Alban Gruin c1c074e0c: sequencer: change complete_action() to use the refactored functions Alban Gruin c27b32f0e: sequencer: refactor check_todo_list() to work on a todo_list Alban Gruin cf18b3f6c: sequencer: introduce todo_list_write_to_file() Alban Gruin e5b1c9d92: rebase-interactive: rewrite edit_todo_list() to handle the initial edit Alban Gruin febebd99b: sequencer: refactor rearrange_squash() to work on a todo_list Anders Waldenborg 18f8e8109: strbuf: separate callback for strbuf_expand:ing literals Anders Waldenborg 4681fe38e: pretty: allow showing specific trailers Anders Waldenborg b755bf6f8: pretty: allow %(trailers) options with explicit value Derrick Stolee 8446350be: revision: add mark_tree_uninteresting_sparse Derrick Stolee 9192bb22f: list-objects: consume sparse tree walk Derrick Stolee 9949aaeef: revision: implement sparse algorithm Jeff King 34a9469d6: remote-curl: refactor smart-http discovery Jiang Xin a338d1039: pack-redundant: consistent sort method Jiang Xin cb7e0336f: pack-redundant: rename pack_list.all_objects Joel Teichroeb cdca49bc4: stash: convert drop and clear to builtin Joel Teichroeb e1d01876a: stash: convert pop to builtin Joel Teichroeb f6bbd7812: stash: convert apply to builtin Johannes Schindelin 26799a208: stash: optionally use the scripted version again Johannes Schindelin bec65d5b7: tests: add a special setup where stash.useBuiltin is off Josh Steadmon 6da1f1a92: protocol: advertise multiple supported versions Liam Beguin 0cce4a275: rebase -i -x: add exec commands via the rebase--helper Linus Torvalds 080448fbe: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Nguyễn Thái Ngọc Duy 0d6caa2d0: merge-recursive.c: remove implicit dependency on the_index Nguyễn Thái Ngọc Duy 6f11fd5ed: config: add --move-to Nguyễn Thái Ngọc Duy 8f7c7f555: config.c: add repo_config_set_worktree_gently() Nguyễn Thái Ngọc Duy a12c1ff3a: config: factor out set_config_source_file() Nguyễn Thái Ngọc Duy e1ff0a32e: read-cache.c: kill read_index() Paul-Sebastian Ungureanu 1f5a011d9: stash: convert create to builtin Paul-Sebastian Ungureanu 51809c70c: stash: convert `stash--helper.c` into `stash.c` Paul-Sebastian Ungureanu 847eb0b0a: stash: convert store to builtin Paul-Sebastian Ungureanu 9a95010a1: stash: make push -q quiet Paul-Sebastian Ungureanu b4493f269: stash: convert show to builtin Paul-Sebastian Ungureanu bfc3fe33f: strbuf.c: add `strbuf_insertf()` and `strbuf_vinsertf()` Paul-Sebastian Ungureanu fa38428f7: stash: convert push to builtin René Scharfe 878056005: sequencer: factor out strbuf_read_file_or_whine() Uncovered code in 'jch' not in 'next' builtin/archive.c 01f9ec64c8 builtin/archive.c 63) if (starts_with(reader.line, "NACK ")) 01f9ec64c8 builtin/archive.c 64) die(_("git archive: NACK %s"), reader.line + 5); builtin/bisect--helper.c 5e82c3dd22 builtin/bisect--helper.c 162) if (get_oid_commit(commit, &oid)) 5e82c3dd22 builtin/bisect--helper.c 163) return error(_("'%s' is not a valid commit"), commit); 5e82c3dd22 builtin/bisect--helper.c 164) strbuf_addstr(&branch, commit); 5e82c3dd22 builtin/bisect--helper.c 172) strbuf_release(&branch); 5e82c3dd22 builtin/bisect--helper.c 173) argv_array_clear(&argv); 5e82c3dd22 builtin/bisect--helper.c 174) return error(_("could not check out original" 0f30233a11 builtin/bisect--helper.c 215) retval = error(_("Bad bisect_write argument: %s"), state); 0f30233a11 builtin/bisect--helper.c 216) goto finish; 0f30233a11 builtin/bisect--helper.c 220) retval = error(_("couldn't get the oid of the rev '%s'"), rev); 0f30233a11 builtin/bisect--helper.c 221) goto finish; 0f30233a11 builtin/bisect--helper.c 226) retval = -1; 0f30233a11 builtin/bisect--helper.c 227) goto finish; 0f30233a11 builtin/bisect--helper.c 232) retval = error_errno(_("couldn't open the file '%s'"), git_path_bisect_log()); 0f30233a11 builtin/bisect--helper.c 233) goto finish; 129a6cf344 builtin/bisect--helper.c 329) yesno = git_prompt(_("Are you sure [Y/n]? "), PROMPT_ECHO); 129a6cf344 builtin/bisect--helper.c 330) if (starts_with(yesno, "N") || starts_with(yesno, "n")) 129a6cf344 builtin/bisect--helper.c 331) retval = -1; 129a6cf344 builtin/bisect--helper.c 332) goto finish; 129a6cf344 builtin/bisect--helper.c 338) retval = error(_(need_bisect_start_warning), 450ebb7359 builtin/bisect--helper.c 3
Re: [PATCH v4 3/6] pack-objects: add --sparse option
On 1/11/2019 5:30 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Somehow checking the "these are exactly what we expect" feels brittle. In a history with three relevant commits A---B---C, packing B..C could omit trees and blobs in C that appear in A but not in B, but traditionally, because we stop traversal at B and do not even look at A, we do not notice that such objects that need to complete C's tree are already available in a repository that has B. The approach of test in this patch feels similar to saying "we must send these duplicates because we must stay dumb", even though with a perfect knowledge about the history, e.g. bitmap, we would be able to omit these objects in A that appear in C but not in B. My intention with this test was to show that the existing algorithm also sends "extra" objects in certain situations, which is later contrasted by a test where the new algorithm sends objects the old algorithm did not. Instead of "we must stay dumb" I wanted to say "we are not dumber (in this case)". I understand your brittle feeling. If the logic changed in either to be smarter, then we would need to change the test. In some sense, that does mean we have extra coverage of "this is how it works now" so we would understand the behavior change of that hypothetical future change. I think we want to test test both "are we sending enough, even though we might be wasting some bandwidth by not noticing the other side already have some?" and "are we still avoiding from becoming overly stupid, even though we may be cheating to save traversal cost and risking to redundantly send some objects?" IOW, a set of tests to make sure that truly new objects are all sent by seeing what is sent is a strict superset of what we expect. And another set of tests to make sure that objects that are so obviously (this criterion may be highly subjective) be present in the receiving repository (e.g. the tree object of commit B and what it contains when seinding B..C) are never sent, even when using an algorithm that are tuned for traversal cost over bandwidth consumption. To properly test "these objects are included," do we have an established pattern for saying "this file is a subset of that file"? Or, should I use something like `comm expected actual >common && test_cmp expected common`? The code change in this step looks all trivially good, and it may want to be squashed into the previous step to become a single patch. Otherwise, [2/6] would be adding a dead code that nobody exercises until this step. Will squash. Thanks, -Stolee
Re: [PATCH] revision.c: fix sparse warnings (sparse algorithm)
On 1/13/2019 3:55 PM, Ramsay Jones wrote: Signed-off-by: Ramsay Jones --- Hi Derrick, If you need to re-roll your 'ds/push-sparse-tree-walk' branch, could you please squash this into the relevant patch [commit 9949aaeef4 ("revision: implement sparse algorithm", 2018-12-14)]. This commit caused both 'sparse' and my 'static-check.pl' script to complain about the visibility of the 'map_flags' variable (it is a file local variable), so one solution would be to mark it 'static'. However, it is simply not being used, so ... Thanks! ATB, Ramsay Jones Thanks, Ramsay! I'm rerolling the series today, so I will make this change. -Stolee
Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
On 1/11/2019 3:25 PM, Junio C Hamano wrote: Junio C Hamano writes: So, I assumed that the implementation was wrong, but it is the other way around. You do mean to pick only already uninteresting trees out of "set" and mark its reachables. One thing that would make me worried is what help the callers of this function will get (or they may have to devise the way themselves) to avoid having to traverse the same tree number of times. A tree can be made uninteresting after a traversal of another tree that contains it, but the logic in this function + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } ignores the fact that it is already UNINTERESTING (in fact, in a sense it is even worse---it cannot be used to make a not-yet UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit and forces the traversal of that tree. The only thing I see that would work as a saving grace is that mark_tree_uninteresting() itself would honor existing UNINTERESING bit and refuses to recurse into its subtrees, but that means blobs at the top-level of such a tree would be marked UNINTERESING while those in its subtrees can be left not-UNINTERESING, which sounds like inviting a mess. It does *not* immediately mean this function is misdesigned. It just means that the caller needs to carefully follow whatever calling convention this series will establish in the later patches (which we haven't seen yet at this point). I'm sorry that this implementation is particularly confusing. It is created only as a placeholder so we can wire up the --sparse option to use this method without being "wrong" but is then removed entirely in PATCH 4/6: ... void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + unsigned has_interesting = 0, has_uninteresting = 0; + struct hashmap map; + struct hashmap_iter map_iter; + struct path_and_oids_entry *entry; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* -* Remove the flag so the next call -* is not a no-op. The flag is added -* in mark_tree_unintersting(). -*/ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; + } ... You are definitely correct that "set" is not a valuable variable name. It could instead be "tree_oids" to be slightly more informative. Thanks, -Stolee
[PATCH v3 1/9] repack: refactor pack deletion for future use
From: Derrick Stolee The repack builtin deletes redundant pack-files and their associated .idx, .promisor, .bitmap, and .keep files. We will want to re-use this logic in the future for other types of repack, so pull the logic into 'unlink_pack_path()' in packfile.c. The 'ignore_keep' parameter is enabled for the use in repack, but will be important for a future caller. Signed-off-by: Derrick Stolee --- builtin/repack.c | 14 ++ packfile.c | 28 packfile.h | 7 +++ 3 files changed, 37 insertions(+), 12 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 45583683ee..3d445b34b4 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, static void remove_redundant_pack(const char *dir_name, const char *base_name) { - const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"}; - int i; struct strbuf buf = STRBUF_INIT; - size_t plen; - - strbuf_addf(&buf, "%s/%s", dir_name, base_name); - plen = buf.len; - - for (i = 0; i < ARRAY_SIZE(exts); i++) { - strbuf_setlen(&buf, plen); - strbuf_addstr(&buf, exts[i]); - unlink(buf.buf); - } + strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); + unlink_pack_path(buf.buf, 1); strbuf_release(&buf); } diff --git a/packfile.c b/packfile.c index d1e6683ffe..bacecb4d0d 100644 --- a/packfile.c +++ b/packfile.c @@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o) } } +void unlink_pack_path(const char *pack_name, int force_delete) +{ + static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"}; + int i; + struct strbuf buf = STRBUF_INIT; + size_t plen; + + strbuf_addstr(&buf, pack_name); + strip_suffix_mem(buf.buf, &buf.len, ".pack"); + plen = buf.len; + + if (!force_delete) { + strbuf_addstr(&buf, ".keep"); + if (!access(buf.buf, F_OK)) { + strbuf_release(&buf); + return; + } + } + + for (i = 0; i < ARRAY_SIZE(exts); i++) { + strbuf_setlen(&buf, plen); + strbuf_addstr(&buf, exts[i]); + unlink(buf.buf); + } + + strbuf_release(&buf); +} + /* * The LRU pack is the one with the oldest MRU window, preferring packs * with no used windows, or the oldest mtime if it has no windows allocated. diff --git a/packfile.h b/packfile.h index 6c4037605d..5b7bcdb1dd 100644 --- a/packfile.h +++ b/packfile.h @@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **); extern void clear_delta_base_cache(void); extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local); +/* + * Unlink the .pack and associated extension files. + * Does not unlink if 'force_delete' is false and the pack-file is + * marked as ".keep". + */ +extern void unlink_pack_path(const char *pack_name, int force_delete); + /* * Make sure that a pointer access into an mmap'd index file is within bounds, * and can provide at least 8 bytes of data. -- gitgitgadget
[PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
From: Derrick Stolee In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The verb will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 11 +++ builtin/multi-pack-index.c | 10 +- midx.c | 5 + midx.h | 1 + t/t5319-multi-pack-index.sh| 11 +++ 5 files changed, 37 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 6186c4c936..cc63531cc0 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -36,6 +36,17 @@ expire:: have no objects referenced by the MIDX. Rewrite the MIDX file afterward to remove all references to these pack-files. +repack:: + Collect a batch of pack-files whose size are all at most the + size given by --batch-size, but whose sizes sum to larger + than --batch-size. The batch is selected by greedily adding + small pack-files starting with the oldest pack-files that fit + the size. Create a new pack-file containing the objects the + multi-pack-index indexes into those pack-files, and rewrite + the multi-pack-index to contain that pack-file. A later run + of 'git multi-pack-index expire' will delete the pack-files + that were part of this batch. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 145de3a46c..d87a2235e3 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,12 +5,13 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire|repack --batch-size=)"), NULL }; static struct opts_multi_pack_index { const char *object_dir; + unsigned long batch_size; } opts; int cmd_multi_pack_index(int argc, const char **argv, @@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv, static struct option builtin_multi_pack_index_options[] = { OPT_FILENAME(0, "object-dir", &opts.object_dir, N_("object directory containing set of packfile and pack-index pairs")), + OPT_MAGNITUDE(0, "batch-size", &opts.batch_size, + N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")), OPT_END(), }; @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv, return 1; } + if (!strcmp(argv[0], "repack")) + return midx_repack(opts.object_dir, (size_t)opts.batch_size); + if (opts.batch_size) + die(_("--batch-size option is only for 'repack' verb")); + if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) diff --git a/midx.c b/midx.c index 6ccbec3f19..30ff4430ab 100644 --- a/midx.c +++ b/midx.c @@ -1106,3 +1106,8 @@ int expire_midx_packs(const char *object_dir) string_list_clear(&packs_to_drop, 0); return result; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + return 0; +} diff --git a/midx.h b/midx.h index e3a2b740b5..394a21ee96 100644 --- a/midx.h +++ b/midx.h @@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir); void clear_midx_file(struct reposit
[PATCH v3 5/9] midx: refactor permutation logic and pack sorting
From: Derrick Stolee In anticipation of the expire subcommand, refactor the way we sort the packfiles by name. This will greatly simplify our approach to dropping expired packs from the list. First, create 'struct pack_info' to replace 'struct pack_pair'. This struct contains the necessary information about a pack, including its name, a pointer to its packfile struct (if not already in the multi-pack-index), and the original pack-int-id. Second, track the pack information using an array of pack_info structs in the pack_list struct. This simplifies the logic around the multiple arrays we were tracking in that struct. Finally, update get_sorted_entries() to not permute the pack-int-id and instead supply the permutation to write_midx_object_offsets(). This requires sorting the packs after get_sorted_entries(). Signed-off-by: Derrick Stolee --- midx.c | 150 - 1 file changed, 63 insertions(+), 87 deletions(-) diff --git a/midx.c b/midx.c index f087bbbe82..7ae5275c25 100644 --- a/midx.c +++ b/midx.c @@ -377,12 +377,23 @@ static size_t write_midx_header(struct hashfile *f, return MIDX_HEADER_SIZE; } +struct pack_info { + uint32_t orig_pack_int_id; + char *pack_name; + struct packed_git *p; +}; + +static int pack_info_compare(const void *_a, const void *_b) +{ + struct pack_info *a = (struct pack_info *)_a; + struct pack_info *b = (struct pack_info *)_b; + return strcmp(a->pack_name, b->pack_name); +} + struct pack_list { - struct packed_git **list; - char **names; + struct pack_info *info; uint32_t nr; - uint32_t alloc_list; - uint32_t alloc_names; + uint32_t alloc; struct multi_pack_index *m; }; @@ -395,66 +406,32 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, if (packs->m && midx_contains_pack(packs->m, file_name)) return; - ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list); - ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names); + ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc); - packs->list[packs->nr] = add_packed_git(full_path, - full_path_len, - 0); + packs->info[packs->nr].p = add_packed_git(full_path, + full_path_len, + 0); - if (!packs->list[packs->nr]) { + if (!packs->info[packs->nr].p) { warning(_("failed to add packfile '%s'"), full_path); return; } - if (open_pack_index(packs->list[packs->nr])) { + if (open_pack_index(packs->info[packs->nr].p)) { warning(_("failed to open pack-index '%s'"), full_path); - close_pack(packs->list[packs->nr]); - FREE_AND_NULL(packs->list[packs->nr]); + close_pack(packs->info[packs->nr].p); + FREE_AND_NULL(packs->info[packs->nr].p); return; } - packs->names[packs->nr] = xstrdup(file_name); + packs->info[packs->nr].pack_name = xstrdup(file_name); + packs->info[packs->nr].orig_pack_int_id = packs->nr; packs->nr++; } } -struct pack_pair { - uint32_t pack_int_id; - char *pack_name; -}; - -static int pack_pair_compare(const void *_a, const void *_b) -{ - struct pack_pair *a = (struct pack_pair *)_a; - struct pack_pair *b = (struct pack_pair *)_b; - return strcmp(a->pack_name, b->pack_name); -} - -static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm) -{ - uint32_t i; - struct pack_pair *pairs; - - ALLOC_ARRAY(pairs, nr_packs); - - for (i = 0; i < nr_packs; i++) { - pairs[i].pack_int_id = i; - pairs[i].pack_name = pack_names[i]; - } - - QSORT(pairs, nr_packs, pack_pair_compare); - - for (i = 0; i < nr_packs; i++) { - pack_names[i] = pairs[i].pack_name; - perm[pairs[i].pack_int_id] = i; - } - - free(pairs); -} - struct pack_midx_entry { struct object_id oid; uint32_t pack_int_id; @@ -480,7 +457,6 @@ static int midx_oid_compare(const void *_a, const void *_b) } static int nth_midxed_pack_midx_entry(struct multi_pack_index *m, -
[PATCH v3 8/9] midx: implement midx_repack()
From: Derrick Stolee To repack using a multi-pack-index, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, adding the packs to a list if they are smaller than the given pack-size. Finally, collect the objects from the multi-pack- index that are in those packs and send them to 'git pack-objects'. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. This object-size idea could be a direction for future expansion in this area. Signed-off-by: Derrick Stolee --- midx.c | 109 +++- t/t5319-multi-pack-index.sh | 25 + 2 files changed, 133 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 30ff4430ab..f0f16690e6 100644 --- a/midx.c +++ b/midx.c @@ -8,6 +8,7 @@ #include "sha1-lookup.h" #include "midx.h" #include "progress.h" +#include "run-command.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -1107,7 +1108,113 @@ int expire_midx_packs(const char *object_dir) return result; } -int midx_repack(const char *object_dir, size_t batch_size) +struct time_and_id { + timestamp_t mtime; + uint32_t pack_int_id; +}; + +static int compare_by_mtime(const void *a_, const void *b_) { + const struct time_and_id *a, *b; + + a = (const struct time_and_id *)a_; + b = (const struct time_and_id *)b_; + + if (a->mtime < b->mtime) + return -1; + if (a->mtime > b->mtime) + return 1; return 0; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + int result = 0; + uint32_t i, packs_to_repack; + size_t total_size; + struct time_and_id *pack_ti; + unsigned char *include_pack; + struct child_process cmd = CHILD_PROCESS_INIT; + struct strbuf base_name = STRBUF_INIT; + struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); + + if (!m) + return 0; + + include_pack = xcalloc(m->num_packs, sizeof(unsigned char)); + pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id)); + + for (i = 0; i < m->num_packs; i++) { + pack_ti[i].pack_int_id = i; + + if (prepare_midx_pack(m, i)) + continue; + + pack_ti[i].mtime = m->packs[i]->mtime; + } + QSORT(pack_ti, m->num_packs, compare_by_mtime); + + total_size = 0; + packs_to_repack = 0; + for (i = 0; total_size < batch_size && i < m->num_packs; i++) { + int pack_int_id = pack_ti[i].pack_int_id; + struct packed_git *p = m->packs[pack_int_id]; + + if (!p) + continue; + if (p->pack_size >= batch_size) + continue; + + packs_to_repack++; + total_size += p->pack_size; + include_pack[pack_int_id] = 1; + } + + if (total_size < batch_size || packs_to_repack < 2) + goto cleanup; + + argv_array_push(&cmd.args, "pack-objects"); + + strbuf_addstr(&base_name, object_dir); + strbuf_addstr(&base_name, "/pack/pack"); + argv_array_push(&cmd.args, base_name.buf); + strbuf_release(&base_name); + + cmd.git_cmd = 1; + cmd.in = cmd.out = -1; + + if (start_command(&cmd)) { + error(_("could not start pack-objects")); + result = 1; + goto cleanup; + } + + for (i = 0; i < m->num_objects; i++) { + struct object_id oid; + uint32_t pack_int_id = nth_midxed_pack_int_id(m, i); + + if (!include_pack[pack_int_id]) + continue; + + nth_midxed_object_oid(&oid, m, i); + xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz); + xwrite(cmd.in, "\n", 1); + } + close(cmd.in); + + if (finish_command(&cmd)) { + error(_("could not finish pack-objects")); + result = 1; + goto cleanup; + } + + result = write_midx_internal(object_dir, m, NULL); + m = NULL; + +cleanup: + if (m) + close_midx(m); + free(include_pack); + free(pack_ti); + return result; +} diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh i
[PATCH v3 4/9] midx: simplify computation of pack name lengths
From: Derrick Stolee Before writing the multi-pack-index, we compute the length of the pack-index names concatenated together. This forms the data in the pack name chunk, and we precompute it to compute chunk offsets. The value is also modified to fit alignment needs. Previously, this computation was coupled with adding packs from the existing multi-pack-index and the remaining packs in the object dir not already covered by the multi-pack-index. In anticipation of this becoming more complicated with the 'expire' command, simplify the computation by centralizing it to a single loop before writing the file. Signed-off-by: Derrick Stolee --- midx.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/midx.c b/midx.c index bb825ef816..f087bbbe82 100644 --- a/midx.c +++ b/midx.c @@ -383,7 +383,6 @@ struct pack_list { uint32_t nr; uint32_t alloc_list; uint32_t alloc_names; - size_t pack_name_concat_len; struct multi_pack_index *m; }; @@ -418,7 +417,6 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, } packs->names[packs->nr] = xstrdup(file_name); - packs->pack_name_concat_len += strlen(file_name) + 1; packs->nr++; } } @@ -762,6 +760,7 @@ int write_midx_file(const char *object_dir) uint32_t nr_entries, num_large_offsets = 0; struct pack_midx_entry *entries = NULL; int large_offsets_needed = 0; + int pack_name_concat_len = 0; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -777,7 +776,6 @@ int write_midx_file(const char *object_dir) packs.alloc_names = packs.alloc_list; packs.list = NULL; packs.names = NULL; - packs.pack_name_concat_len = 0; ALLOC_ARRAY(packs.list, packs.alloc_list); ALLOC_ARRAY(packs.names, packs.alloc_names); @@ -788,7 +786,6 @@ int write_midx_file(const char *object_dir) packs.list[packs.nr] = NULL; packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]); - packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1; packs.nr++; } } @@ -798,10 +795,6 @@ int write_midx_file(const char *object_dir) if (packs.m && packs.nr == packs.m->num_packs) goto cleanup; - if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT) - packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT - - (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT); - ALLOC_ARRAY(pack_perm, packs.nr); sort_packs_by_name(packs.names, packs.nr, pack_perm); @@ -814,6 +807,13 @@ int write_midx_file(const char *object_dir) large_offsets_needed = 1; } + for (i = 0; i < packs.nr; i++) + pack_name_concat_len += strlen(packs.names[i]) + 1; + + if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT) + pack_name_concat_len += MIDX_CHUNK_ALIGNMENT - + (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT); + hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); FREE_AND_NULL(midx_name); @@ -831,7 +831,7 @@ int write_midx_file(const char *object_dir) cur_chunk++; chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT; - chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len; + chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len; cur_chunk++; chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP; -- gitgitgadget
[PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand
From: Derrick Stolee The multi-pack-index tracks objects in a collection of pack-files. Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers. It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file. Introduce a new 'expire' subcommand to the multi-pack-index builtin. This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files. More details about the specifics will follow as the method is implemented. Add a test that verifies the 'expire' subcommand is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and should not be removed during an 'expire' operation. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 5 +++ builtin/multi-pack-index.c | 4 ++- midx.c | 5 +++ midx.h | 1 + t/t5319-multi-pack-index.sh| 47 ++ 5 files changed, 61 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 1af406aca2..6186c4c936 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -31,6 +31,11 @@ write:: verify:: Verify the contents of the MIDX file. +expire:: + Delete the pack-files that are tracked by the MIDX file, but + have no objects referenced by the MIDX. Rewrite the MIDX file + afterward to remove all references to these pack-files. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index fca70f8e4f..145de3a46c 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), NULL }; @@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv, return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) return verify_midx_file(opts.object_dir); + if (!strcmp(argv[0], "expire")) + return expire_midx_packs(opts.object_dir); die(_("unrecognized verb: %s"), argv[0]); } diff --git a/midx.c b/midx.c index 730ff84dff..bb825ef816 100644 --- a/midx.c +++ b/midx.c @@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir) return verify_midx_error; } + +int expire_midx_packs(const char *object_dir) +{ + return 0; +} diff --git a/midx.h b/midx.h index 774f652530..e3a2b740b5 100644 --- a/midx.h +++ b/midx.h @@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(struct repository *r); int verify_midx_file(const char *object_dir); +int expire_midx_packs(const char *object_dir); void close_midx(struct multi_pack_index *m); diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 70926b5bc0..948effc1ee 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' ' "incorrect object offset" ' +test_expect_success 'setup expire tests' ' + mkdir dup && + ( + cd dup && + git init && + for i in $(test_seq 1 20) + do + test_commit $i + done && + git branch A HEAD && + git branch B HEAD~8 && + git branch C HEAD~13 && + git branch D HEAD~16 && + git branch E HEAD~18 && + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && + refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-D <<-EOF && + refs/heads/D + ^refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-C <<-EOF && + refs/heads/C + ^refs/heads/D + EOF + git pack-objects --revs .git/objects/pack/pack-B <<-EOF && + refs/heads/B + ^refs/heads/C + EOF + git pack-objects --revs .git/objects/pack/pack-A <<-EOF && + refs/heads/A +
[PATCH v3 9/9] multi-pack-index: test expire while adding packs
From: Derrick Stolee During development of the multi-pack-index expire subcommand, a version went out that improperly computed the pack order if a new pack was introduced while other packs were being removed. Part of the subtlety of the bug involved the new pack being placed before other packs that already existed in the multi-pack-index. Add a test to t5319-multi-pack-index.sh that catches this issue. The test adds new packs that cause another pack to be expired, and creates new packs that are lexicographically sorted before and after the existing packs. Signed-off-by: Derrick Stolee --- t/t5319-multi-pack-index.sh | 32 1 file changed, 32 insertions(+) diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 99afb5ec51..1213549d25 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -449,4 +449,36 @@ test_expect_success 'expire removes repacked packs' ' ) ' +test_expect_success 'expire works when adding new packs' ' + ( + cd dup && + git pack-objects --revs .git/objects/pack/pack-combined <<-EOF && + refs/heads/A + ^refs/heads/B + EOF + git pack-objects --revs .git/objects/pack/pack-combined <<-EOF && + refs/heads/B + ^refs/heads/C + EOF + git pack-objects --revs .git/objects/pack/pack-combined <<-EOF && + refs/heads/C + ^refs/heads/D + EOF + git multi-pack-index write && + git pack-objects --revs .git/objects/pack/a-pack <<-EOF && + refs/heads/D + ^refs/heads/E + EOF + git multi-pack-index write && + git pack-objects --revs .git/objects/pack/z-pack <<-EOF && + refs/heads/E + EOF + git multi-pack-index expire && + ls .git/objects/pack/ | grep idx >expect && + test-tool read-midx .git/objects | grep idx >actual && + test_cmp expect actual && + git multi-pack-index verify + ) +' + test_done -- gitgitgadget
[PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index
The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new subcommands for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. 3. Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info). We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Updates in V2: * Added a method, unlink_pack_path() to remove packfiles, but with the additional check for a .keep file. This borrows logic from builtin/repack.c. * Modified documentation and commit messages to replace 'verb' with 'subcommand'. Simplified the documentation. (I left 'verbs' in the title of the cover letter for consistency.) Updates in V3: * There was a bug in the expire logic when simultaneously removing packs and adding uncovered packs, specifically around the pack permutation. This was hard to see during review because I was using the 'pack_perm' array for multiple purposes. First, I was reducing its length, and then I was adding to it and resorting. In V3, I significantly overhauled the logic here, which required some extra commits before implementing 'expire'. The final commit includes a test that would cover this case. Thanks, -Stolee Derrick Stolee (9): repack: refactor pack deletion for future use Docs: rearrange subcommands for multi-pack-index multi-pack-index: prepare for 'expire' subcommand midx: simplify computation of pack name lengths midx: refactor permutation logic and pack sorting multi-pack-index: implement 'expire' verb multi-pack-index: prepare 'repack' subcommand midx: implement midx_repack() multi-pack-index: test expire while adding packs Documentation/git-multi-pack-index.txt | 26 +- builtin/multi-pack-index.c | 12 +- builtin/repack.c | 14 +- midx.c | 393 ++--- midx.h | 2 + packfile.c | 28 ++ packfile.h | 7 + t/t5319-multi-pack-index.sh| 133 + 8 files changed, 497 ins
[PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index
From: Derrick Stolee We will add new subcommands to the multi-pack-index, and that will make the documentation a bit messier. Clean up the 'verb' descriptions by renaming the concept to 'subcommand' and removing the reference to the object directory. Helped-by: Stefan Beller Helped-by: Szeder Gábor Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index f7778a2c85..1af406aca2 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes SYNOPSIS [verse] -'git multi-pack-index' [--object-dir=] +'git multi-pack-index' [--object-dir=] DESCRIPTION --- @@ -23,13 +23,13 @@ OPTIONS `/packs/multi-pack-index` for the current MIDX file, and `/packs` for the pack-files to index. +The following subcommands are available: + write:: - When given as the verb, write a new MIDX file to - `/packs/multi-pack-index`. + Write a new MIDX file. verify:: - When given as the verb, verify the contents of the MIDX file - at `/packs/multi-pack-index`. + Verify the contents of the MIDX file. EXAMPLES -- gitgitgadget
[PATCH v3 6/9] multi-pack-index: implement 'expire' verb
From: Derrick Stolee The 'git multi-pack-index expire' command looks at the existing mult-pack-index, counts the number of objects referenced in each pack-file, deletes the pack-fils with no referenced objects, and rewrites the multi-pack-index to no longer reference those packs. Refactor the write_midx_file() method to call write_midx_internal() which now takes an existing 'struct multi_pack_index' and a list of pack-files to drop (as specified by the names of their pack- indexes). As we write the new multi-pack-index, we drop those file names from the list of known pack-files. The expire_midx_packs() method removes the unreferenced pack-files after carefully closing the packs to avoid open handles. Test that a new pack-file that covers the contents of two other pack-files leads to those pack-files being deleted during the expire command. Be sure to read the multi-pack-index to ensure it no longer references those packs. Signed-off-by: Derrick Stolee --- midx.c | 120 +--- t/t5319-multi-pack-index.sh | 18 ++ 2 files changed, 128 insertions(+), 10 deletions(-) diff --git a/midx.c b/midx.c index 7ae5275c25..6ccbec3f19 100644 --- a/midx.c +++ b/midx.c @@ -33,6 +33,8 @@ #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t)) #define MIDX_LARGE_OFFSET_NEEDED 0x8000 +#define PACK_EXPIRED UINT_MAX + static char *get_midx_filename(const char *object_dir) { return xstrfmt("%s/pack/multi-pack-index", object_dir); @@ -381,6 +383,7 @@ struct pack_info { uint32_t orig_pack_int_id; char *pack_name; struct packed_git *p; + unsigned expired : 1; }; static int pack_info_compare(const void *_a, const void *_b) @@ -428,6 +431,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, packs->info[packs->nr].pack_name = xstrdup(file_name); packs->info[packs->nr].orig_pack_int_id = packs->nr; + packs->info[packs->nr].expired = 0; packs->nr++; } } @@ -587,13 +591,17 @@ static size_t write_midx_pack_names(struct hashfile *f, size_t written = 0; for (i = 0; i < num_packs; i++) { - size_t writelen = strlen(info[i].pack_name) + 1; + size_t writelen; + + if (info[i].expired) + continue; if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0) BUG("incorrect pack-file order: %s before %s", info[i - 1].pack_name, info[i].pack_name); + writelen = strlen(info[i].pack_name) + 1; hashwrite(f, info[i].pack_name, writelen); written += writelen; } @@ -675,6 +683,11 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee for (i = 0; i < nr_objects; i++) { struct pack_midx_entry *obj = list++; + if (perm[obj->pack_int_id] == PACK_EXPIRED) + BUG("object %s is in an expired pack with int-id %d", + oid_to_hex(&obj->oid), + obj->pack_int_id); + hashwrite_be32(f, perm[obj->pack_int_id]); if (large_offset_needed && obj->offset >> 31) @@ -721,7 +734,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off return written; } -int write_midx_file(const char *object_dir) +static int write_midx_internal(const char *object_dir, struct multi_pack_index *m, + struct string_list *packs_to_drop) { unsigned char cur_chunk, num_chunks = 0; char *midx_name; @@ -737,6 +751,8 @@ int write_midx_file(const char *object_dir) struct pack_midx_entry *entries = NULL; int large_offsets_needed = 0; int pack_name_concat_len = 0; + int dropped_packs = 0; + int result = 0; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir) midx_name); } - packs.m = load_multi_pack_index(object_dir, 1); + if (m) + packs.m = m; + else + packs.m = load_multi_pack_index(object_dir, 1); packs.nr = 0; packs.alloc = packs.m ? packs.m->num_packs : 16; @@ -759,13 +778,14 @@ int write_midx_file(const char *object_dir) packs.info[packs.nr].orig_pack_int_id = i; packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]); packs.info[packs.nr].p = NULL; + packs.info[packs.nr].expired = 0;
Re: [PATCH 0/11] jk/loose-object-cache sha1/object_id fixups
On 1/8/2019 1:07 PM, Junio C Hamano wrote: Jeff King writes: Yeah, they should. I think one of them will need René's patch, which changes the body of quick_has_loose(). I can roll it as a separate topic if that's easier (or just wait a week or so until René's cleanups graduate). Nah, what I got is already good to work with. Both series are straight-forward and I do not expect them needing long fermentation. I'm just chiming in to say that this series was a very satisfying read, and the changes were clear-cut and mechanical. Thanks! -Stolee
Git Test Coverage Report (Tues, Jan 8)
g(p, "=on", end) || b755bf6f83 1108) match_placeholder_arg(p, "=true", end)) { protocol.c 6da1f1a920 37) die(_("Unrecognized protocol version")); 6da1f1a920 39) die(_("Unrecognized protocol_version")); 63bb981502 49) enum protocol_version version = parse_protocol_version(git_test_v); 63bb981502 51) if (version == protocol_unknown_version) 63bb981502 52) die("unknown value for %s: %s", git_test_k, 63bb981502 55) return version; read-cache.c 43bf1db73e 1302) return 0; 43bf1db73e 1323) oidcpy(&backup_prev, &istate->cache[pos]->oid); 43bf1db73e 1343) update_backup_log(istate, &backup_prev, ce); 43bf1db73e 3215) strbuf_release(&sb); 43bf1db73e 3216) return -1; refs/files-backend.c c67027c9a9 1892) return; c67027c9a9 1895) return; remote-curl.c 6da1f1a920 344) return 0; 34a9469d6a 373) die("invalid server response; expected service, got flush packet"); 34a9469d6a 397) d->proto_git = 1; revision.c 497f2693ab 149) return; 497f2693ab 152) return; 497f2693ab 175) break; 04dac00473 197) continue; send-pack.c 01f9ec64c8 143) return error(_("unable to parse remote unpack status: %s"), reader->line); 01f9ec64c8 162) error("invalid ref status from remote: %s", reader->line); 01f9ec64c8 579) receive_unpack_status(&reader); strbuf.c bfc3fe33f6 259) die("`pos' is too far after the end of the buffer"); bfc3fe33f6 266) return; /* nothing to do */ bfc3fe33f6 268) die("you want to use way too much memory"); 18f8e81091 448) return 0; submodule.c 26f80ccfc1 1398) strbuf_release(&gitdir); be76c21282 1521) struct fetch_task *task = task_cb; be76c21282 1525) fetch_task_release(task); 898c2e65b7 1806) warning(_("Could not unset core.worktree setting in submodule '%s'"), unpack-trees.c cc14089d7c 206) oidclr(&null_hash); cc14089d7c 207) new_hash = &null_hash; 6f41cc899b 1716) index_path(NULL, old_hash, ce->name, &st, 6f41cc899b 1972) old_hash && !lstat(ce->name, &st)) 6f41cc899b 1973) index_path(NULL, old_hash, ce->name, &st, cc14089d7c 2291) if (verify_absent(ce, ERROR_WOULD_LOSE_UNTRACKED_REMOVED, cc14089d7c 2294) make_backup(ce, &old_hash, NULL, o); upload-pack.c 01f9ec64c8 428) die("git upload-pack: expected SHA1 list, got '%s'", reader->line); wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Ævar Arnfjörð Bjarmason 63bb98150: tests: add a special setup where for protocol.version Anders Waldenborg 18f8e8109: strbuf: separate callback for strbuf_expand:ing literals Anders Waldenborg 4681fe38e: pretty: allow showing specific trailers Anders Waldenborg b755bf6f8: pretty: allow %(trailers) options with explicit value Derrick Stolee 04dac0047: list-objects: consume sparse tree walk Derrick Stolee 0d8e91f58: multi-pack-index: implement 'expire' verb Derrick Stolee 497f2693a: revision: implement sparse algorithm Derrick Stolee 5532d59aa: multi-pack-index: prepare 'repack' subcommand Derrick Stolee 913368875: repack: refactor pack deletion for future use Derrick Stolee c9b358598: midx: implement midx_repack() Jeff King 34a9469d6: remote-curl: refactor smart-http discovery Joel Teichroeb cdca49bc4: stash: convert drop and clear to builtin Joel Teichroeb e1d01876a: stash: convert pop to builtin Joel Teichroeb f6bbd7812: stash: convert apply to builtin Johannes Schindelin 26799a208: stash: optionally use the scripted version again Johannes Schindelin 2ead83aef: rebase: move `reset_head()` into a better spot Johannes Schindelin 3bd5f0710: built-in rebase: call `git am` directly Johannes Schindelin 81ef8ee75: rebase: introduce a shortcut for --reschedule-failed-exec Johannes Schindelin bec65d5b7: tests: add a special setup where stash.useBuiltin is off Johannes Schindelin d421afa0c: rebase: introduce --reschedule-failed-exec Jonathan Tan 4d0feb763: builtin/fetch-pack: support protocol version 2 Josh Steadmon 6da1f1a92: protocol: advertise multiple supported versions Junio C Hamano d6af6af1f: Merge branch 'sb/submodule-recursive-fetch-gets-the-tip' into pu Linus Torvalds 080448fbe: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Masaya Suzuki 01f9ec64c: Use packet_reader instead of packet_read_line Matthew DeVore adbdcc076: list-objects-filter: teach tree:# how to handle >0 Matthew DeVore e34ec45cc: tree:: skip some trees even when collecting omits Nguyễn Thái Ngọc Duy 102b7856e: backup-log.c: add API for walking backup log
[PATCH 0/1] git-gc.txt: fix typo about gc.writeCommitGraph
Thanks to Stefan Haller for sending me a private message about this typo. Derrick Stolee (1): git-gc.txt: fix typo about gc.writeCommitGraph Documentation/git-gc.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) base-commit: c7e8ce6d1dd02f6569ea785eebc8692e8e2edf72 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-105%2Fderrickstolee%2Fgc-doc%2Fupstream-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-105/derrickstolee/gc-doc/upstream-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/105 -- gitgitgadget
[PATCH 1/1] git-gc.txt: fix typo about gc.writeCommitGraph
From: Derrick Stolee Reported-by: Stefan Haller Signed-off-by: Derrick Stolee --- Documentation/git-gc.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index c20ee6c789..a7442499f6 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -137,7 +137,7 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. -The optional configuration variable `gc.commitGraph` determines if +The optional configuration variable `gc.writeCommitGraph` determines if 'git gc' should run 'git commit-graph write'. This can be set to a boolean value. This defaults to false. -- gitgitgadget
Git Test Coverage Report (Saturday, Dec 29)
3 197) continue; strbuf.c 18f8e81091 397) return 0; submodule.c 26f80ccfc1 1396) strbuf_release(&gitdir); be76c21282 1519) struct fetch_task *task = task_cb; be76c21282 1523) fetch_task_release(task); unpack-trees.c cc14089d7c 206) oidclr(&null_hash); cc14089d7c 207) new_hash = &null_hash; 6f41cc899b 1716) index_path(NULL, old_hash, ce->name, &st, 6f41cc899b 1972) old_hash && !lstat(ce->name, &st)) 6f41cc899b 1973) index_path(NULL, old_hash, ce->name, &st, cc14089d7c 2291) if (verify_absent(ce, ERROR_WOULD_LOSE_UNTRACKED_REMOVED, cc14089d7c 2294) make_backup(ce, &old_hash, NULL, o); wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Anders Waldenborg 18f8e8109: strbuf: separate callback for strbuf_expand:ing literals Anders Waldenborg 4681fe38e: pretty: allow showing specific trailers Anders Waldenborg b755bf6f8: pretty: allow %(trailers) options with explicit value Derrick Stolee 04dac0047: list-objects: consume sparse tree walk Derrick Stolee 497f2693a: revision: implement sparse algorithm Derrick Stolee 5518ac8ee: multi-pack-index: implement 'expire' verb Derrick Stolee b59c49226: multi-pack-index: implement midx_repack() Derrick Stolee e86a2be88: multi-pack-index: prepare 'repack' verb Johannes Schindelin 81ef8ee75: rebase: introduce a shortcut for --reschedule-failed-exec Johannes Schindelin d421afa0c: rebase: introduce --reschedule-failed-exec Josh Steadmon 24c10f747: protocol: advertise multiple supported versions Linus Torvalds 74e8221b5: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Nguyễn Thái Ngọc Duy 102b7856e: backup-log.c: add API for walking backup log Nguyễn Thái Ngọc Duy 1fcfdf84c: apply: support backup log with --keep-backup Nguyễn Thái Ngọc Duy 43bf1db73: read-cache.c: new flag for add_index_entry() to write to backup log Nguyễn Thái Ngọc Duy 45f3e0cd9: backup-log: add diff command Nguyễn Thái Ngọc Duy 6a05b9ab7: backup-log: add cat command Nguyễn Thái Ngọc Duy 6f41cc899: reset --hard: keep backup of overwritten files Nguyễn Thái Ngọc Duy 7f1d166ee: backup-log: add log command Nguyễn Thái Ngọc Duy 937d6bee9: config --edit: support backup log Nguyễn Thái Ngọc Duy b2069b6eb: backup-log: keep all blob references around Nguyễn Thái Ngọc Duy b86e9ac72: gc: prune backup logs Nguyễn Thái Ngọc Duy bde028c66: backup-log: add prune command Nguyễn Thái Ngọc Duy c67027c9a: refs: keep backup of deleted reflog Nguyễn Thái Ngọc Duy cc14089d7: unpack-trees.c: keep backup of ignored files being overwritten Nguyễn Thái Ngọc Duy fdbbdf809: backup-log: add "update" subcommand Stefan Beller 26f80ccfc: submodule: migrate get_next_submodule to use repository structs Stefan Beller be76c2128: fetch: ensure submodule objects fetched Thomas Gummerer 20e835df5: checkout: add --cached option Thomas Gummerer 4ed0c8eb4: checkout: introduce --{,no-}overlay option Uncovered code in 'jch' not in 'next' apply.c 0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) || 0f086e6dca 3356) lstat(ce->name, st)) builtin/branch.c 0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD")); 0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of refs/heads/"), refname); builtin/pull.c b19eee9066 647) argv_array_push(&args, opt_cleanup); builtin/remote.c f39a9c6547 builtin/remote.c 1551) die(_("--save-to-push cannot be used with other options")); f39a9c6547 builtin/remote.c 1575) die(_("--save-to-push can only be used when only one url is defined")); commit-graph.c 721351787e 128) return NULL; 721351787e 131) return NULL; 721351787e 187) free(graph); 721351787e 188) return NULL; 721351787e 223) free(graph); 721351787e 224) return NULL; hex.c 47edb64997 93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1) 47edb64997 95) return hash_to_hex_algop_r(buffer, sha1, &hash_algos[GIT_HASH_SHA1]); 47edb64997 116) char *hash_to_hex(const unsigned char *hash) 47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo); pathspec.c 22af33bece 671) name = to_free = xmemdupz(name, namelen); read-cache.c ee70c12820 1730) if (advice_unknown_index_extension) { ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext); ee70c12820 1732) advise(_("This is likely due to the file having been written by a newer\n" ec36c42a63 3508) const char *index = NULL; ec36c42a63 3514) if (!offset) ec36c42a63 3515) return NULL; ec
Re: [PATCH 00/23] sb/more-repo-in-api
On 12/14/2018 7:09 PM, Stefan Beller wrote: I realized next has not been rewound, so I can resend sb/more-repo-in-api, which I hereby do. The changes are minimal and address the only comment by Jonathan so far. Sorry I'm very late to look at this, but your series looks good to me. I've got some work on the way that will use these arbitrary repositories. (Specifically, moving 'generation' out of 'struct commit' and into a commit slab to store reachability index values.) Thanks, -Stolee
Re: [PATCH] log: add %S option (like --source) to log --format
On 12/19/2018 3:33 AM, issac.tro...@gmail.com wrote: From: Issac Trotts Make it possible to write for example git log --format="%H,%S" where the %S at the end is a new placeholder that prints out the ref (tag/branch) for each commit. Using %d might seem like an alternative but it only shows the ref for the last commit in the branch. Signed-off-by: Issac Trotts --- This change is based on a question from Stack Overflow: https://stackoverflow.com/questions/12712775/git-get-source-information-in-format --- Documentation/pretty-formats.txt | 2 ++ builtin/log.c| 2 +- log-tree.c | 1 + pretty.c | 15 ++ pretty.h | 1 + t/t4205-log-pretty-formats.sh| 50 6 files changed, 70 insertions(+), 1 deletion(-) diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt index 417b638cd..de6953108 100644 --- a/Documentation/pretty-formats.txt +++ b/Documentation/pretty-formats.txt @@ -134,6 +134,8 @@ The placeholders are: - '%cI': committer date, strict ISO 8601 format - '%d': ref names, like the --decorate option of linkgit:git-log[1] - '%D': ref names without the " (", ")" wrapping. +- '%S': ref name given on the command line by which the commit was reached + (like `git log --source`), only works with `git log` This "only works with `git log`" made me think about what would happen with `git rev-list --pretty=format:"%h %S"` and the answer (on my machine) was a segfault. - '%e': encoding - '%s': subject - '%f': sanitized subject line, suitable for a filename diff --git a/builtin/log.c b/builtin/log.c index e8e51068b..be3025657 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -203,7 +203,7 @@ static void cmd_log_init_finish(int argc, const char **argv, const char *prefix, rev->diffopt.filter || rev->diffopt.flags.follow_renames) rev->always_show_header = 0; - if (source) { + if (source || (rev->pretty_given && (rev->commit_format == CMIT_FMT_USERFORMAT) && w.source)) { init_revision_sources(&revision_sources); rev->sources = &revision_sources; } Likely, you'll want to duplicate this initialization in the revision machinery. Keep this one in builtin/log.c as it was before, but add something like this initialization to setup_revisions(). Add a test for rev-list. [snip] @@ -1194,6 +1195,17 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */ load_ref_decorations(NULL, DECORATE_SHORT_REFS); format_decorations_extended(sb, commit, c->auto_color, "", ", ", ""); return 1; + case 'S': /* tag/branch like --source */ + if (c->pretty_ctx->rev->sources == NULL) { Use "if (!c->pretty_ctx->rev->sources". + return 0; + } + slot = revision_sources_at(c->pretty_ctx->rev->sources, commit); + if (slot && *slot) { I'm not sure this check for 'slot' being non-null is necessary, as we would already get a failure in the commit-slab code (for revision_sources_at()) if the slab is not initialized. + strbuf_addstr(sb, *slot); + return 1; + } else { + die(_("failed to get info for %%S")); Here, you die() when you fail to get a slot but above you return 0 when the sources are not initialized. I don't see another use of die() in this method. Is that the right way to handle failure here? (I'm legitimately asking because I have over-used 'die()' in the past and am still unclear on when it is appropriate.) ' +test_expect_success 'set up %S tests' ' + git checkout --orphan source-a && + test_commit one && + test_commit two && + git checkout -b source-b HEAD^ && + test_commit three +' + +test_expect_success 'log --format=%S paints branch names' ' + cat >expect <<-\EOF && + source-b + source-a + source-b + EOF + git log --format=%S source-a source-b >actual && + test_cmp expect actual +' + +test_expect_success 'log --format=%S paints tag names' ' + git tag -m tagged source-tag && + cat >expect <<-\EOF && + source-tag + source-a + source-tag + EOF + git log --format=%S source-tag source-a >actual && + test_cmp expect actual +' + +test_expect_success 'log --format=%S paints symmetric ranges' ' + cat >expect <<-\EOF && + source-b + source-a + EOF + git log --format=%S source-a...source-b >actual && + test_cmp expect actual +' + +test_expect_success '%S in git log --format works with other placeholders (part 1)' ' + git log --format="source-b %h" source-b >expect && + git log --format="%S %h" source-b >actual && + test_cmp expect act
Re: [PATCH v1 0/4] HPE NonStop Port Commits
On 12/26/2018 6:05 PM, randall.s.bec...@rogers.com wrote: From: "Randall S. Becker" This set of patches is a distilled version of the minimal set of changes to git that will allow it to run as client and server on HPE NonStop NSE and NSX systems. NSR systems are no longer under support so references to them have been removed. Each patch in this set is independent but required for correctness. Randall S. Becker (4): transport-helper: use xread instead of read config.mak.uname: support for modern HPE NonStop config. git-compat-util.h: add FLOSS headers for HPE NonStop compat/regex/regcomp.c: define intptr_t and uintptr_t on NonStop These patches look correct to me. Just one question on patch 3 (git-compat-util.h: add FLOSS headers for HPE NonStop). I'm not able to comment on patch 2 (config.mak.uname: support for modern HPE NonStop config.), but it looks to be wrapped in a platform-specific 'if', so should be fine if you are happy with it. Thanks, -Stolee
Re: [PATCH v1 3/4] git-compat-util.h: add FLOSS headers for HPE NonStop
On 12/26/2018 6:05 PM, randall.s.bec...@rogers.com wrote: The NSIG define is also not defined on __TANDEM, so we define it here as 100 if it is not defined only for __TANDEM builds. [snip] +#if ! defined NSIG Why didn't you use "#ifndef" here? Taking a look at the file, I see both "#ifdef" and "#if defined" but no "#if ! defined". Thanks, -Stolee
[PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index
From: Derrick Stolee We will add new subcommands to the multi-pack-index, and that will make the documentation a bit messier. Clean up the 'verb' descriptions by renaming the concept to 'subcommand' and removing the reference to the object directory. Helped-by: Stefan Beller Helped-by: Szeder Gábor Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index f7778a2c85..1af406aca2 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes SYNOPSIS [verse] -'git multi-pack-index' [--object-dir=] +'git multi-pack-index' [--object-dir=] DESCRIPTION --- @@ -23,13 +23,13 @@ OPTIONS `/packs/multi-pack-index` for the current MIDX file, and `/packs` for the pack-files to index. +The following subcommands are available: + write:: - When given as the verb, write a new MIDX file to - `/packs/multi-pack-index`. + Write a new MIDX file. verify:: - When given as the verb, verify the contents of the MIDX file - at `/packs/multi-pack-index`. + Verify the contents of the MIDX file. EXAMPLES -- gitgitgadget
[PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index
The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new subcommands for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. 3. Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info). We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Updates in V2: * Added a method, unlink_pack_path() to remove packfiles, but with the additional check for a .keep file. This borrows logic from builtin/repack.c. * Modified documentation and commit messages to replace 'verb' with 'subcommand'. Simplified the documentation. (I left 'verbs' in the title of the cover letter for consistency.) Thanks, -Stolee Derrick Stolee (7): repack: refactor pack deletion for future use Docs: rearrange subcommands for multi-pack-index multi-pack-index: prepare for 'expire' subcommand midx: refactor permutation logic multi-pack-index: implement 'expire' verb multi-pack-index: prepare 'repack' subcommand midx: implement midx_repack() Documentation/git-multi-pack-index.txt | 26 ++- builtin/multi-pack-index.c | 12 +- builtin/repack.c | 14 +- midx.c | 217 +++-- midx.h | 2 + packfile.c | 28 packfile.h | 7 + t/t5319-multi-pack-index.sh| 98 +++ 8 files changed, 376 insertions(+), 28 deletions(-) base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/92 Range-diff vs v1: -: -- > 1: a697df120c repack: refactor pack deletion for future use -: -- > 2: 55df6b20ff Docs: rearrange subcommands for multi-pack-index 1: 1e34b48a20 ! 3: 2529afe89e multi-pack-index: prepare for 'expire' verb @@ -1,6 +1,6 @@ Autho
[PATCH v2 5/7] multi-pack-index: implement 'expire' verb
From: Derrick Stolee The 'git multi-pack-index expire' command looks at the existing mult-pack-index, counts the number of objects referenced in each pack-file, deletes the pack-fils with no referenced objects, and rewrites the multi-pack-index to no longer reference those packs. Refactor the write_midx_file() method to call write_midx_internal() which now takes an existing 'struct multi_pack_index' and a list of pack-files to drop (as specified by the names of their pack- indexes). As we write the new multi-pack-index, we drop those file names from the list of known pack-files. The expire_midx_packs() method removes the unreferenced pack-files after carefully closing the packs to avoid open handles. Test that a new pack-file that covers the contents of two other pack-files leads to those pack-files being deleted during the expire command. Be sure to read the multi-pack-index to ensure it no longer references those packs. Signed-off-by: Derrick Stolee --- midx.c | 82 +++-- t/t5319-multi-pack-index.sh | 15 +++ 2 files changed, 93 insertions(+), 4 deletions(-) diff --git a/midx.c b/midx.c index 3bd7183a53..043cd1fd97 100644 --- a/midx.c +++ b/midx.c @@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off return written; } -int write_midx_file(const char *object_dir) +static int write_midx_internal(const char *object_dir, struct multi_pack_index *m, + struct string_list *packs_to_drop) { unsigned char cur_chunk, num_chunks = 0; char *midx_name; @@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir) uint32_t nr_entries, num_large_offsets = 0; struct pack_midx_entry *entries = NULL; int large_offsets_needed = 0; + int result = 0; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir) midx_name); } - packs.m = load_multi_pack_index(object_dir, 1); + if (m) + packs.m = m; + else + packs.m = load_multi_pack_index(object_dir, 1); packs.nr = 0; packs.alloc_list = packs.m ? packs.m->num_packs : 16; @@ -787,7 +792,25 @@ int write_midx_file(const char *object_dir) ALLOC_ARRAY(packs.perm, packs.alloc_perm); if (packs.m) { + int drop_index = 0, missing_drops = 0; for (i = 0; i < packs.m->num_packs; i++) { + if (packs_to_drop && drop_index < packs_to_drop->nr) { + int cmp = strcmp(packs.m->pack_names[i], + packs_to_drop->items[drop_index].string); + + if (!cmp) { + drop_index++; + continue; + } else if (cmp > 0) { + error(_("did not see pack-file %s to drop"), + packs_to_drop->items[drop_index].string); + drop_index++; + i--; + missing_drops++; + continue; + } + } + ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list); ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names); ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm); @@ -798,6 +821,12 @@ int write_midx_file(const char *object_dir) packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1; packs.nr++; } + + if (packs_to_drop && (drop_index < packs_to_drop->nr || missing_drops)) { + error(_("did not see all pack-files to drop")); + result = 1; + goto cleanup; + } } for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs); @@ -932,7 +961,12 @@ int write_midx_file(const char *object_dir) free(packs.perm); free(entries); free(midx_name); - return 0; + return result; +} + +int write_midx_file(const char *object_dir) +{ + return write_midx_internal(object_dir, NULL, NULL); } void clear_midx_file(struct repository *r) @@ -1034,5 +1068,45 @@ int verify_midx_file(const char *object_dir) int expire_midx_packs(const char *object_dir) { - return 0; + uint32_t i, *count, result = 0; + struct string_list packs_to_drop = STRING_LIST_INIT_DUP; + struct multi_pac
[PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand
From: Derrick Stolee The multi-pack-index tracks objects in a collection of pack-files. Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers. It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file. Introduce a new 'expire' subcommand to the multi-pack-index builtin. This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files. More details about the specifics will follow as the method is implemented. Add a test that verifies the 'expire' subcommand is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and should not be removed during an 'expire' operation. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 5 +++ builtin/multi-pack-index.c | 4 ++- midx.c | 5 +++ midx.h | 1 + t/t5319-multi-pack-index.sh| 47 ++ 5 files changed, 61 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 1af406aca2..6186c4c936 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -31,6 +31,11 @@ write:: verify:: Verify the contents of the MIDX file. +expire:: + Delete the pack-files that are tracked by the MIDX file, but + have no objects referenced by the MIDX. Rewrite the MIDX file + afterward to remove all references to these pack-files. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index fca70f8e4f..145de3a46c 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), NULL }; @@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv, return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) return verify_midx_file(opts.object_dir); + if (!strcmp(argv[0], "expire")) + return expire_midx_packs(opts.object_dir); die(_("unrecognized verb: %s"), argv[0]); } diff --git a/midx.c b/midx.c index 730ff84dff..bb825ef816 100644 --- a/midx.c +++ b/midx.c @@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir) return verify_midx_error; } + +int expire_midx_packs(const char *object_dir) +{ + return 0; +} diff --git a/midx.h b/midx.h index 774f652530..e3a2b740b5 100644 --- a/midx.h +++ b/midx.h @@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(struct repository *r); int verify_midx_file(const char *object_dir); +int expire_midx_packs(const char *object_dir); void close_midx(struct multi_pack_index *m); diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 70926b5bc0..948effc1ee 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' ' "incorrect object offset" ' +test_expect_success 'setup expire tests' ' + mkdir dup && + ( + cd dup && + git init && + for i in $(test_seq 1 20) + do + test_commit $i + done && + git branch A HEAD && + git branch B HEAD~8 && + git branch C HEAD~13 && + git branch D HEAD~16 && + git branch E HEAD~18 && + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && + refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-D <<-EOF && + refs/heads/D + ^refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-C <<-EOF && + refs/heads/C + ^refs/heads/D + EOF + git pack-objects --revs .git/objects/pack/pack-B <<-EOF && + refs/heads/B + ^refs/heads/C + EOF + git pack-objects --revs .git/objects/pack/pack-A <<-EOF && + refs/heads/A +
[PATCH v2 7/7] midx: implement midx_repack()
From: Derrick Stolee To repack using a multi-pack-index, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, adding the packs to a list if they are smaller than the given pack-size. Finally, collect the objects from the multi-pack- index that are in those packs and send them to 'git pack-objects'. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. This object-size idea could be a direction for future expansion in this area. Signed-off-by: Derrick Stolee --- midx.c | 109 +++- t/t5319-multi-pack-index.sh | 25 + 2 files changed, 133 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 127c43f7b0..f6bc111438 100644 --- a/midx.c +++ b/midx.c @@ -8,6 +8,7 @@ #include "sha1-lookup.h" #include "midx.h" #include "progress.h" +#include "run-command.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -,7 +1112,113 @@ int expire_midx_packs(const char *object_dir) return result; } -int midx_repack(const char *object_dir, size_t batch_size) +struct time_and_id { + timestamp_t mtime; + uint32_t pack_int_id; +}; + +static int compare_by_mtime(const void *a_, const void *b_) { + const struct time_and_id *a, *b; + + a = (const struct time_and_id *)a_; + b = (const struct time_and_id *)b_; + + if (a->mtime < b->mtime) + return -1; + if (a->mtime > b->mtime) + return 1; return 0; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + int result = 0; + uint32_t i, packs_to_repack; + size_t total_size; + struct time_and_id *pack_ti; + unsigned char *include_pack; + struct child_process cmd = CHILD_PROCESS_INIT; + struct strbuf base_name = STRBUF_INIT; + struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); + + if (!m) + return 0; + + include_pack = xcalloc(m->num_packs, sizeof(unsigned char)); + pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id)); + + for (i = 0; i < m->num_packs; i++) { + pack_ti[i].pack_int_id = i; + + if (prepare_midx_pack(m, i)) + continue; + + pack_ti[i].mtime = m->packs[i]->mtime; + } + QSORT(pack_ti, m->num_packs, compare_by_mtime); + + total_size = 0; + packs_to_repack = 0; + for (i = 0; total_size < batch_size && i < m->num_packs; i++) { + int pack_int_id = pack_ti[i].pack_int_id; + struct packed_git *p = m->packs[pack_int_id]; + + if (!p) + continue; + if (p->pack_size >= batch_size) + continue; + + packs_to_repack++; + total_size += p->pack_size; + include_pack[pack_int_id] = 1; + } + + if (total_size < batch_size || packs_to_repack < 2) + goto cleanup; + + argv_array_push(&cmd.args, "pack-objects"); + + strbuf_addstr(&base_name, object_dir); + strbuf_addstr(&base_name, "/pack/pack"); + argv_array_push(&cmd.args, base_name.buf); + strbuf_release(&base_name); + + cmd.git_cmd = 1; + cmd.in = cmd.out = -1; + + if (start_command(&cmd)) { + error(_("could not start pack-objects")); + result = 1; + goto cleanup; + } + + for (i = 0; i < m->num_objects; i++) { + struct object_id oid; + uint32_t pack_int_id = nth_midxed_pack_int_id(m, i); + + if (!include_pack[pack_int_id]) + continue; + + nth_midxed_object_oid(&oid, m, i); + xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz); + xwrite(cmd.in, "\n", 1); + } + close(cmd.in); + + if (finish_command(&cmd)) { + error(_("could not finish pack-objects")); + result = 1; + goto cleanup; + } + + result = write_midx_internal(object_dir, m, NULL); + m = NULL; + +cleanup: + if (m) + close_midx(m); + free(include_pack); + free(pack_ti); + return result; +} diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh i
[PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand
From: Derrick Stolee In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The verb will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 11 +++ builtin/multi-pack-index.c | 10 +- midx.c | 5 + midx.h | 1 + t/t5319-multi-pack-index.sh| 11 +++ 5 files changed, 37 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 6186c4c936..cc63531cc0 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -36,6 +36,17 @@ expire:: have no objects referenced by the MIDX. Rewrite the MIDX file afterward to remove all references to these pack-files. +repack:: + Collect a batch of pack-files whose size are all at most the + size given by --batch-size, but whose sizes sum to larger + than --batch-size. The batch is selected by greedily adding + small pack-files starting with the oldest pack-files that fit + the size. Create a new pack-file containing the objects the + multi-pack-index indexes into those pack-files, and rewrite + the multi-pack-index to contain that pack-file. A later run + of 'git multi-pack-index expire' will delete the pack-files + that were part of this batch. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 145de3a46c..d87a2235e3 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,12 +5,13 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire|repack --batch-size=)"), NULL }; static struct opts_multi_pack_index { const char *object_dir; + unsigned long batch_size; } opts; int cmd_multi_pack_index(int argc, const char **argv, @@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv, static struct option builtin_multi_pack_index_options[] = { OPT_FILENAME(0, "object-dir", &opts.object_dir, N_("object directory containing set of packfile and pack-index pairs")), + OPT_MAGNITUDE(0, "batch-size", &opts.batch_size, + N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")), OPT_END(), }; @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv, return 1; } + if (!strcmp(argv[0], "repack")) + return midx_repack(opts.object_dir, (size_t)opts.batch_size); + if (opts.batch_size) + die(_("--batch-size option is only for 'repack' verb")); + if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) diff --git a/midx.c b/midx.c index 043cd1fd97..127c43f7b0 100644 --- a/midx.c +++ b/midx.c @@ -1110,3 +1110,8 @@ int expire_midx_packs(const char *object_dir) string_list_clear(&packs_to_drop, 0); return result; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + return 0; +} diff --git a/midx.h b/midx.h index e3a2b740b5..394a21ee96 100644 --- a/midx.h +++ b/midx.h @@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir); void clear_midx_file(struct reposit
[PATCH v2 1/7] repack: refactor pack deletion for future use
From: Derrick Stolee The repack builtin deletes redundant pack-files and their associated .idx, .promisor, .bitmap, and .keep files. We will want to re-use this logic in the future for other types of repack, so pull the logic into 'unlink_pack_path()' in packfile.c. The 'ignore_keep' parameter is enabled for the use in repack, but will be important for a future caller. Signed-off-by: Derrick Stolee --- builtin/repack.c | 14 ++ packfile.c | 28 packfile.h | 7 +++ 3 files changed, 37 insertions(+), 12 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 45583683ee..3d445b34b4 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, static void remove_redundant_pack(const char *dir_name, const char *base_name) { - const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"}; - int i; struct strbuf buf = STRBUF_INIT; - size_t plen; - - strbuf_addf(&buf, "%s/%s", dir_name, base_name); - plen = buf.len; - - for (i = 0; i < ARRAY_SIZE(exts); i++) { - strbuf_setlen(&buf, plen); - strbuf_addstr(&buf, exts[i]); - unlink(buf.buf); - } + strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); + unlink_pack_path(buf.buf, 1); strbuf_release(&buf); } diff --git a/packfile.c b/packfile.c index d1e6683ffe..bacecb4d0d 100644 --- a/packfile.c +++ b/packfile.c @@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o) } } +void unlink_pack_path(const char *pack_name, int force_delete) +{ + static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"}; + int i; + struct strbuf buf = STRBUF_INIT; + size_t plen; + + strbuf_addstr(&buf, pack_name); + strip_suffix_mem(buf.buf, &buf.len, ".pack"); + plen = buf.len; + + if (!force_delete) { + strbuf_addstr(&buf, ".keep"); + if (!access(buf.buf, F_OK)) { + strbuf_release(&buf); + return; + } + } + + for (i = 0; i < ARRAY_SIZE(exts); i++) { + strbuf_setlen(&buf, plen); + strbuf_addstr(&buf, exts[i]); + unlink(buf.buf); + } + + strbuf_release(&buf); +} + /* * The LRU pack is the one with the oldest MRU window, preferring packs * with no used windows, or the oldest mtime if it has no windows allocated. diff --git a/packfile.h b/packfile.h index 6c4037605d..5b7bcdb1dd 100644 --- a/packfile.h +++ b/packfile.h @@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **); extern void clear_delta_base_cache(void); extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local); +/* + * Unlink the .pack and associated extension files. + * Does not unlink if 'force_delete' is false and the pack-file is + * marked as ".keep". + */ +extern void unlink_pack_path(const char *pack_name, int force_delete); + /* * Make sure that a pointer access into an mmap'd index file is within bounds, * and can provide at least 8 bytes of data. -- gitgitgadget
[PATCH v2 4/7] midx: refactor permutation logic
From: Derrick Stolee When writing a multi-pack-index, we keep track of an integer permutation, tracking the list of pack-files that we know about (both from the existing multi-pack-index and the new pack-files being introduced) and converting them into a sorted order for the new multi-pack-index. In anticipation of dropping pack-files from the existing multi- pack-index, refactor the logic around how we track this permutation. First, insert the permutation into the pack_list structure. This allows us to grow the permutation dynamically as we add packs. Second, fill the permutation with values corresponding to their position in the list of pack-files, sorted as follows: 1. The pack-files in the existing multi-pack-index, sorted lexicographically. 2. The pack-files not in the existing multi-pack-index, sorted as discovered from the filesystem. There is a subtle thing in how we initialize this permutation, specifically how we use 'i' for the initial value. This will matter more when we implement the logic for dropping existing packs, as we will create holes in the ordering. Signed-off-by: Derrick Stolee --- midx.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/midx.c b/midx.c index bb825ef816..3bd7183a53 100644 --- a/midx.c +++ b/midx.c @@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f, struct pack_list { struct packed_git **list; char **names; + uint32_t *perm; uint32_t nr; uint32_t alloc_list; uint32_t alloc_names; + uint32_t alloc_perm; size_t pack_name_concat_len; struct multi_pack_index *m; }; @@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list); ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names); + ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm); packs->list[packs->nr] = add_packed_git(full_path, full_path_len, @@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, return; } + packs->perm[packs->nr] = packs->nr; packs->names[packs->nr] = xstrdup(file_name); packs->pack_name_concat_len += strlen(file_name) + 1; packs->nr++; @@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p ALLOC_ARRAY(pairs, nr_packs); for (i = 0; i < nr_packs; i++) { - pairs[i].pack_int_id = i; + pairs[i].pack_int_id = perm[i]; pairs[i].pack_name = pack_names[i]; } @@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir) struct hashfile *f = NULL; struct lock_file lk; struct pack_list packs; - uint32_t *pack_perm = NULL; uint64_t written = 0; uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1]; uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1]; @@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir) packs.nr = 0; packs.alloc_list = packs.m ? packs.m->num_packs : 16; - packs.alloc_names = packs.alloc_list; + packs.alloc_perm = packs.alloc_names = packs.alloc_list; packs.list = NULL; packs.names = NULL; + packs.perm = NULL; packs.pack_name_concat_len = 0; ALLOC_ARRAY(packs.list, packs.alloc_list); ALLOC_ARRAY(packs.names, packs.alloc_names); + ALLOC_ARRAY(packs.perm, packs.alloc_perm); if (packs.m) { for (i = 0; i < packs.m->num_packs; i++) { ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list); ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names); + ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm); + packs.perm[packs.nr] = i; packs.list[packs.nr] = NULL; packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]); packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1; @@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir) packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT - (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT); - ALLOC_ARRAY(pack_perm, packs.nr); - sort_packs_by_name(packs.names, packs.nr, pack_perm); + sort_packs_by_name(packs.names, packs.nr, packs.perm); - entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries); + entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, &nr_entri
[PATCH 0/1] commit-graph: writing missing parents is a BUG
A user complained that they had the following message in a git command: fatal: invalid parent position 2147483647 In hex, this value is 0x7fff, corresponding to the GRAPH_MISSING_PARENT constant. This constant was intended as a way to have the commit-graph store commits with parents that are not also in the commit-graph. During development, however, we chose to require the commit-graph to be closed under reachability. Thus, this value should never be written, and we don't fall back to parsing usual commits when we see the constant. This actually happened, and the root cause is unknown. This either means the reachable closure logic is broken somewhere, or something caused the binary search to find the parent in our list of commits. This second problem is more likely, as we have seen RAM issues cause corrupted memory before. I'm still investigating the root cause, but for now we can hit a BUG() statement instead of writing bad data. Thanks, -Stolee Derrick Stolee (1): commit-graph: writing missing parents is a BUG commit-graph.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) base-commit: 0d0ac3826a3bbb9247e39e12623bbcfdd722f24c Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-102%2Fderrickstolee%2Fcommit-graph-bug-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-102/derrickstolee/commit-graph-bug-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/102 -- gitgitgadget
[PATCH 1/1] commit-graph: writing missing parents is a BUG
From: Derrick Stolee When writing a commit-graph, we write GRAPH_MISSING_PARENT if the parent's object id does not appear in the list of commits to be written into the commit-graph. This was done as the initial design allowed commits to have missing parents, but the final version requires the commit-graph to be closed under reachability. Thus, this GRAPH_MISSING_PARENT value should never be written. However, there are reasons why it could be written! These range from a bug in the reachable-closure code to a memory error causing the binary search into the list of object ids to fail. In either case, we should fail fast and avoid writing the commit-graph file with bad data. Remove the GRAPH_MISSING_PARENT constant in favor of the constant GRAPH_EDGE_LAST_MASK, which has the same value. Signed-off-by: Derrick Stolee --- commit-graph.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 40c855f185..c14ada6918 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -34,7 +34,6 @@ #define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 #define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000 -#define GRAPH_PARENT_MISSING 0x7fff #define GRAPH_EDGE_LAST_MASK 0x7fff #define GRAPH_PARENT_NONE 0x7000 @@ -496,7 +495,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, commit_to_sha1); if (edge_value < 0) - edge_value = GRAPH_PARENT_MISSING; + BUG("missing parent %s for commit %s", + oid_to_hex(&parent->item->object.oid), + oid_to_hex(&(*list)->object.oid)); } hashwrite_be32(f, edge_value); @@ -514,7 +515,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, nr_commits, commit_to_sha1); if (edge_value < 0) - edge_value = GRAPH_PARENT_MISSING; + BUG("missing parent %s for commit %s", + oid_to_hex(&parent->item->object.oid), + oid_to_hex(&(*list)->object.oid)); } hashwrite_be32(f, edge_value); @@ -567,7 +570,9 @@ static void write_graph_chunk_large_edges(struct hashfile *f, commit_to_sha1); if (edge_value < 0) - edge_value = GRAPH_PARENT_MISSING; + BUG("missing parent %s for commit %s", + oid_to_hex(&parent->item->object.oid), + oid_to_hex(&(*list)->object.oid)); else if (!parent->next) edge_value |= GRAPH_LAST_EDGE; @@ -864,7 +869,7 @@ void write_commit_graph(const char *obj_dir, count_distinct++; } - if (count_distinct >= GRAPH_PARENT_MISSING) + if (count_distinct >= GRAPH_EDGE_LAST_MASK) die(_("the commit graph format cannot write %d commits"), count_distinct); commits.nr = 0; @@ -891,7 +896,7 @@ void write_commit_graph(const char *obj_dir, } num_chunks = num_extra_edges ? 4 : 3; - if (commits.nr >= GRAPH_PARENT_MISSING) + if (commits.nr >= GRAPH_EDGE_LAST_MASK) die(_("too many commits to write graph")); compute_generation_numbers(&commits, report_progress); -- gitgitgadget
Re: commit-graph idea: warn when disabled for incompatible repositories
On 12/18/2018 1:21 PM, Ævar Arnfjörð Bjarmason wrote: diff --git a/builtin/gc.c b/builtin/gc.c index 871a56f1c5..702568b70d 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -662,9 +662,14 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); - if (gc_write_commit_graph) + if (gc_write_commit_graph) { + int verbose = !quiet && !daemonized; + if (verbose && !commit_graph_compatible(the_repository)) + warning(_("The `gc.writeCommitGraph' setting is on, " + "but commit_graph_compatible() = false")); write_commit_graph_reachable(get_object_directory(), 0, -!quiet && !daemonized); +verbose); + } I actually think this is the wrong place to put it. This will cause a warning for someone with 'gc.writeCommitGraph' enabled and running GC on a shallow clone. I think the issue was someone running 'git commit-graph write' inside a shallow clone that succeeds without doing anything. Also, I bet you would hit this block if you run the test suite with GIT_TEST_COMMIT_GRAPH=1. Thanks, -Stolee if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " diff --git a/commit-graph.c b/commit-graph.c index 40c855f185..60915bf9aa 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -61,7 +61,7 @@ static struct commit_graph *alloc_commit_graph(void) extern int read_replace_refs; -static int commit_graph_compatible(struct repository *r) +int commit_graph_compatible(struct repository *r) { if (!r->gitdir) return 0; diff --git a/commit-graph.h b/commit-graph.h index 9db40b4d3a..7c92d41a28 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -12,6 +12,8 @@ struct commit; char *get_commit_graph_filename(const char *obj_dir); +int commit_graph_compatible(struct repository *r); + /* * Given a commit struct, try to fill the commit struct info, including: * 1. tree object This part looks correct, and necessary for the warning in builtin/commit-graph.c Thanks, -Stolee
Re: commit-graph idea: warn when disabled for incompatible repositories
On 12/18/2018 9:22 AM, Thomas Ferris Nicolaisen wrote: Hey there, Hi, Thomas! Accidentally, my first use-case was a local copy of a big repository (chromium source) that another developer had cloned shallow (I wasn't even aware of their clone being shallow at that point). After spending a little time trying to figure out why no commit-graph file was being created, I noticed that it worked just fine testing in a fresh git repo. Then I discovered the .git/shallow file in the big repo. So I did fetch --unshallow, and commit-graph started working. Taking a 20 second log --graph operation down to less than a second (wooo!). I saw some recent release notes that mentions that commit-graph is disabled in incompatible repositories (graft, replace). I assume this also be the case for shallow clones. The commit-graph feature is not designed to work well with these features, and the "easy" fix we have so far is to completely avoid the interaction. The tricky bits come in when the commit parents can "change" according to these other features. The commit-graph would need to change at the same time, and this is actually very difficult to get correct. Here's the idea that may help others on the same path: Some warning output when attempting to use commit-graph when it is disabled (either by configuration or automatically). I think others that come across commit-graph may have tried such tricks (like shallow clones) to better work with their repositories, and it could be frustrating that commit-graph has no apparent effect. This is a good idea, and I would happily review a patch that added such a warning. The warning would want to be in builtin/commit-graph.c, and use the commit_graph_compatible() method from commit-graph.c. (You'll need to expose the method in the .h file.) Thanks! -Stolee
Re: [PATCH v4 4/6] revision: implement sparse algorithm
On 12/17/2018 9:26 AM, Ævar Arnfjörð Bjarmason wrote: On Mon, Dec 17 2018, Derrick Stolee wrote: As for adding progress to this step, I'm open to it. It can be done as a sequel series. Okey. To clarify I wasn't complaining about the lack of progress output, we didn't have it before, just clarifying (as I've found out now) that when you're talking about "enumerating objects" in your commit message it's *not* what we're doing when we show the "Enumerating objects" progress bar, but an unrelated step prior to that. Part of the problem is that in builtin/pack-objects.c, we have the following: if (progress) progress_state = start_progress(_("Enumerating objects"), 0); if (!use_internal_rev_list) read_object_list_from_stdin(); else { get_object_list(rp.argc, rp.argv); argv_array_clear(&rp); } cleanup_preferred_base(); if (include_tag && nr_result) for_each_ref(add_ref_tag, NULL); stop_progress(&progress_state); and the logic for walking uninteresting objects is the mark_edges_uninteresting() inside get_object_list() (both entirely contained in this progress state). Perhaps the best thing to do is to untangle the progress for the two modes based on 'use_internal_rev_list'. Could we then have the progress for get_object_list() be "Walking objects" instead? Thanks, -Stolee
Re: [PATCH v4 4/6] revision: implement sparse algorithm
On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote: On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote: Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. We spend a long time printing those out before we ever get to "Enumerating objects". Which was where I was trying to test this, i.e. is this a lot of work we perform before we print out the progress bar, and regardless of this optimization should have other progress output there, so we can see this time we're spending on this? It is true that part of the problem is that a 'git push' will sit for a while without presenting any feedback until this part of the algorithm is complete. The current series intends to significantly reduce this time. As for adding progress to this step, I'm open to it. It can be done as a sequel series. What would we use to describe this section? "Enumerating remote objects"? Thanks, -Stolee
Re: Git blame performance on files with a lot of history
On 12/14/2018 1:29 PM, Clement Moyroud wrote: My group at work is migrating a CVS repo to Git. The biggest issue we face so far is the performance of git blame, especially compared to CVS on the same file. One file especially causes us trouble: it's a 30k lines file with 25 years of history in 3k+ commits. The complete repo has 200k+ commits over that same period of time. I think the 30k lines is the bigger issue than the 200k+ commits. I'm not terribly familiar with the blame code, though. Currently, 'cvs annotate' takes 2.7 seconds, while 'git blame' (without -M nor -C) takes 145s. I tried using the commit-graph with the Bloom filter, per https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5...@gmail.com/. Thanks for the interest in this prototype feature. Sorry that it doesn't appear to help you in this case. It should definitely be a follow-up when that feature gets polished to production-quality. Looking at the blame code, it does not seem to be able to use the commit graph, so I tried the same rev-list command from the e-mail, using my own file: > GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y /path/to/git rev-list --count --full-history HEAD -- important/file.C 3576 Please double-check that you have the 'core.commitGraph' config setting enabled, or you will not read the commit-graph at run-time: git config core.commitGraph true I see that the commit introducing GIT_TRACE_BLOOM_FILTER [1] does nothing if the commit-graph is not loaded. Thanks, -Stolee [1] https://github.com/derrickstolee/git/commit/adc469894b755512c9d02f099700ead2a7a78377
[PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 25 + revision.h | 2 ++ 2 files changed, 27 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..f9eb6400f1 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (!tree) + continue; + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH v4 5/6] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- Documentation/config/pack.txt | 9 + builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH v4 4/6] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 139 ++--- t/t5322-pack-objects-sparse.sh | 21 +++-- 2 files changed, 144 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..63bf6230dc 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reac
[PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
From: Derrick Stolee Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 1 + t/README | 4 t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget
[PATCH v4 3/6] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 11 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 129 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1
[PATCH v4 2/6] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 70 +++--- list-objects.h | 4 ++- 6 files changed, 66 insertions(+), 16 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..fb728f7842 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struc
[PATCH v4 0/6] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Updates in V4: * Switched to using hashmap instead of string_list for the path/oidset dictionary. (This is due to some fear that the string_list performance would degrade for a very wide tree, but I am unable to measure a performance difference.) * Some cleanup of code snippets across commits. * Some grammar cleanup in the commit messages. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 70 +++--- list-objects.h | 4 +- revision.c | 144 + revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 12 files changed, 382 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v3: 1: 60617681f7 ! 1: 817e30a287 revision: add mark_tree_uninteresting_sparse @@ -35,6 +35,9 @@ + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + ++ if (!tree) ++ continue; ++ + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call 2: 4527addacb ! 2: 39dc89beb9 list-objects: consume sparse tree walk @@ -11,9 +11,10 @@ We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which -non-commit objects are reachable from uninteresting commits. +non-commit objects are reachable from uninteresting commits. This +commit walk is not changing during this series. -The mark_edges_unintersting() method in list-objects.c iterates on +The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following:
Re: [PATCH v2 1/2] .gitattributes: ensure t/oid-info/* has eol=lf
On 12/12/2018 9:38 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee The new test_oid machinery in the test library requires reading some information from t/oid-info/hash-info and t/oid-info/oid. The shell logic that reads from these files is sensitive to CRLF line endings, causing a problem when the test suite is run on a Windows machine that converts LF to CRLF. Exclude the files in this folder from this conversion. Signed-off-by: Derrick Stolee --- It seems that this step is identical to v1, to which SZEDER Gábor had trouble with (cf. <20181212133945.gv30...@szeder.dev>). I am guessing that the review and v2 e-mails have crossed. FWIW, I personally do not think "is sensitive to CRLF" is too bad, as my attempt to clarify it does not make it much better, e.g. The logic to read from these files in shell uses built-in "read" command, which leaves CR at the end of these text files when they are checked out with CRLF line endings, at least when run with bash shipped with Git for Windows. This results in an unexpected value in the variable these lines are read into, leading the tests to fail. So, I'll keep what I queued when I received v1 for now. Sorry for missing the edit to the message. You are correct that v2 just added one commit. I like your rewording, if you feel like editing it. Thanks, -Stolee
Re: [PATCH on master v2] revision: use commit graph in get_reference()
On 12/12/2018 8:27 PM, Jeff King wrote: On Wed, Dec 12, 2018 at 11:58:12AM -0800, Jonathan Tan wrote: Yeah, this was the part that took me a bit to figure out, as well. The optimization here is really just avoiding a call to lookup_commit(), which will do a single hash-table lookup. I wonder if that's actually worth this more complex interface (as opposed to just always taking an oid and then always returning a "struct commit", which could be old or new). Avoidance of lookup_commit() is more important than an optimization, I think. Here, we call lookup_commit() only when we know that that object is a commit (by its presence in a commit graph). If we just called it blindly, we might mistakenly create a commit for that hash when it is actually an object of another type. (We could inline lookup_commit() in parse_commit_in_graph_one(), removing the object creation part, but that adds complexity as well.) I was thinking we would only do so in the happy path when we find a commit. I.e., something like: obj = lookup_object(oid); /* does not auto-vivify */ if (obj && obj->parsed) return obj; if (we_have_it_in_commit_graph) { commit = obj || lookup_commit(oid); fill_in_details_from_commit_graph(commit); return &commit->obj; } else { return parse_object(oid); } which is more along the lines of that parse_probably_commit() that Stolee mentioned. This approach is what I had in mind. Thanks for making it more concrete! -Stolee
[PATCH v2 0/2] Add more eol=lf to .gitattributes
I noticed that our CI builds (see [1] for an example) were returning success much faster than they did before Git v2.20.0. Turns out that there was a test script failure involving the new test hash logic. error: bug in the test script: bad hash algorithm make[1]: *** [Makefile:56: t-basic.sh] Error 1 make[1]: *** Waiting for unfinished jobs This failure was due to an LF -> CRLF conversion in some data files in t/oid-info/. Don't munge these files, and the tests can continue. UPDATED IN V2: Unfortunately, I didn't check the full test suite after getting beyond this point, and found another LF -> CRLF conversion problem in t4256-am-format-flowed.sh due to example patches in t/t4256/1/. Add these in a second commit. Thanks, dscho, for helping with the correct placement. Thanks, -Stolee [1] https://gvfs.visualstudio.com/ci/_build/results?buildId=4815 Derrick Stolee (1): .gitattributes: ensure t/oid-info/* has eol=lf Johannes Schindelin (1): t4256: mark support files as LF-only .gitattributes | 1 + t/.gitattributes | 1 + 2 files changed, 2 insertions(+) base-commit: 5d826e972970a784bd7a7bdf587512510097b8c7 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-98%2Fderrickstolee%2Ftest-oid-fix-windows-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-98/derrickstolee/test-oid-fix-windows-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/98 Range-diff vs v1: 1: 4a22502a31 = 1: 4a22502a31 .gitattributes: ensure t/oid-info/* has eol=lf -: -- > 2: 4275b8a581 t4256: mark support files as LF-only -- gitgitgadget
[PATCH v2 1/2] .gitattributes: ensure t/oid-info/* has eol=lf
From: Derrick Stolee The new test_oid machinery in the test library requires reading some information from t/oid-info/hash-info and t/oid-info/oid. The shell logic that reads from these files is sensitive to CRLF line endings, causing a problem when the test suite is run on a Windows machine that converts LF to CRLF. Exclude the files in this folder from this conversion. Signed-off-by: Derrick Stolee --- .gitattributes | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitattributes b/.gitattributes index acf853e029..3738cea7eb 100644 --- a/.gitattributes +++ b/.gitattributes @@ -13,3 +13,4 @@ /Documentation/gitk.txt conflict-marker-size=32 /Documentation/user-manual.txt conflict-marker-size=32 /t/t-*.sh conflict-marker-size=32 +/t/oid-info/* eol=lf -- gitgitgadget
[PATCH 0/1] .gitattributes: ensure t/oid-info/* has eol=lf
I noticed that our CI builds (see [1] for an example) were returning success much faster than they did before Git v2.20.0. Turns out that there was a test script failure involving the new test hash logic. error: bug in the test script: bad hash algorithm make[1]: *** [Makefile:56: t-basic.sh] Error 1 make[1]: *** Waiting for unfinished jobs The reason this passed was because we were running this command in our build: make GIT_TEST_OPTS=--quiet -j 10 -C t || make GIT_TEST_OPTS=\"-i -v -x\" -k -C t failed The first make failed, but the test script did not record any failed tests and hence the second command succeeded. The test bug relates to this function added by 2c02b110d "t: add test functions to translate hash-related values": +# Load key-value pairs from stdin suitable for use with test_oid. Blank lines +# and lines starting with "#" are ignored. Keys must be shell identifier +# characters. +# +# Examples: +# rawsz sha1:20 +# rawsz sha256:32 +test_oid_cache () { + local tag rest k v && + + { test -n "$test_hash_algo" || test_detect_hash; } && + while read tag rest + do + case $tag in + \#*) + continue;; + ?*) + # non-empty + ;; + *) + # blank line + continue;; + esac && + + k="${rest%:*}" && + v="${rest#*:}" && + + if ! expr "$k" : '[a-z0-9][a-z0-9]*$' >/dev/null + then + error 'bug in the test script: bad hash algorithm' + fi && + eval "test_oid_${k}_$tag=\"\$v\"" + done +} The verbose output included these values. Note how '\r' appears in the variable values. ++ test_oid_init ++ test -n '' ++ test_detect_hash ++ test_hash_algo=sha1 ++ test_oid_cache ++ local tag rest k v ++ test -n sha1 ++ read tag rest ++ case $tag in ++ k=sha1 ++ v=$'20\r' ++ expr sha1 : '[a-z0-9][a-z0-9]*$' ++ eval 'test_oid_sha1_rawsz="$v"' +++ test_oid_sha1_rawsz=$'20\r' ++ read tag rest ++ case $tag in ++ k=sha256 ++ v=$'32\r' ++ expr sha256 : '[a-z0-9][a-z0-9]*$' ++ eval 'test_oid_sha256_rawsz="$v"' +++ test_oid_sha256_rawsz=$'32\r' ++ read tag rest ++ case $tag in ++ k= ++ v= ++ expr '' : '[a-z0-9][a-z0-9]*$' [1] https://gvfs.visualstudio.com/ci/_build/results?buildId=4815 Derrick Stolee (1): .gitattributes: ensure t/oid-info/* has eol=lf .gitattributes | 1 + 1 file changed, 1 insertion(+) base-commit: 5d826e972970a784bd7a7bdf587512510097b8c7 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-98%2Fderrickstolee%2Ftest-oid-fix-windows-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-98/derrickstolee/test-oid-fix-windows-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/98 -- gitgitgadget
[PATCH 1/1] .gitattributes: ensure t/oid-info/* has eol=lf
From: Derrick Stolee The new test_oid machinery in the test library requires reading some information from t/oid-info/hash-info and t/oid-info/oid. The shell logic that reads from these files is sensitive to CRLF line endings, causing a problem when the test suite is run on a Windows machine that converts LF to CRLF. Exclude the files in this folder from this conversion. Signed-off-by: Derrick Stolee --- .gitattributes | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitattributes b/.gitattributes index acf853e029..3738cea7eb 100644 --- a/.gitattributes +++ b/.gitattributes @@ -13,3 +13,4 @@ /Documentation/gitk.txt conflict-marker-size=32 /Documentation/user-manual.txt conflict-marker-size=32 /t/t-*.sh conflict-marker-size=32 +/t/oid-info/* eol=lf -- gitgitgadget
Re: [PATCH 5/5] midx: implement midx_repack()
On 12/10/2018 9:32 PM, Stefan Beller wrote: On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee To repack using a multi-pack-index, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, adding the packs to a list if they are smaller than the given pack-size. Finally, collect the objects from the multi-pack- index that are in those packs and send them to 'git pack-objects'. Makes sense. With this operation we only coalesce some packfiles into a new pack file. So to perform the "complete" repack this command has to be run repeatedly until there is at most one packfile left that is smaller than batch size. Well, the batch size essentially means "If a pack-file is larger than , then leave it be. I'm happy with packs that large." This assumes that the reason the pack is that large is because it was already combined with other packs or contains a lot of objects. Imagine the following scenario: There are 5 packfiles A, B, C, D, E, created last Monday thru Friday (A is oldest, E youngest). The sizes are [A=4, B=6, C=5, D=5, E=4] You'd issue a repack with batch size=10, such that A and B would be repacked into F, which is created today, size is less or equal than 10. You issue another repack tomorrow, which then would coalesce C and D to G, which is dated tomorrow, size is less or equal to 10 as well. You issue a third repack, which then takes E (as it is the oldest) and would probably find F as the next oldest (assuming it is less than 10), to repack into H. H is then compromised of A, B and E, and G is C+D. In a way these repacks, always picking up the oldest, sound like you "roll forward" objects into new packs. As the new packs are newest (we have no packs from the future), we'd cycle through different packs to look at for packing on each repacking. It is however more likely that content is more similar on a temporal basis. (e.g. I am boldly claiming that [ABC, DE] would take less space than [ABE, CD] as produced above). (The obvious solution to this hypothetical would be to backdate the resulting pack to the youngest pack that is input to the new pack, but I dislike fudging with the time a file is created/touched, so let's not go there) This raises a good point about what happens when we "roll over" into the "repacked" packs. I'm not claiming that this is an optimal way to save space, but is a way to incrementally collect small packs into slightly larger packs, all while not interrupting concurrent Git commands. Reducing pack count improves data locality, which is my goal here. In our environment, we do see reduced space as a benefit, even if it is not optimal. Would the object count make sense as input instead of the pack date? While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this referenced came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. This object-size idea could be a direction for future expansion in this area. Ah, that also explains why the above idea is toast. Would it make sense to extend or annotate the midx file to give hints at which packs are easy to combine? I guess such an "annotation worker" could run in a separate thread / pool with the lowest priority as this seems like a decent fallback for the lack of any better information how to pick the packfiles. One idea I had earlier (and is in Documentation/technical/multi-pack-index.txt) is to have the midx track metadata about pack-files. We could avoid this "rollover" problem by tracking which packs were repacked using this mechanism. This could create a "pack generation" value, and we could collect a batch of packs that have the same generation. This does seem a bit overcomplicated for the potential benefit, and could waste better use of that metadata concept. For instance, we could use the metadata to track the information given by ".keep" and ".promisor" files. Thanks, -Stolee
Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb
On 12/10/2018 8:54 PM, Stefan Beller wrote: On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' verb to 'git multi-pack-index' that takes a '--batch-size' option. The verb will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack-files that have duplicate copies in other pack-files. This highlights an optimization problem: How do we pick the batches optimally? Ideally we'd pick packs that have an overlap of many same objects to dedup them completely, next best would be to have objects that are very similar, such that they delta very well. (Assuming that the sum of the resulting pack sizes is a metric we'd want to optimize for eventually) For now it seems we just take a random cut of "small" packs. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. This is another clever trick that makes the test correct despite no implementation yet. :-) +repack:: + When given as the verb, collect a batch of pack-files whose + size are all at most the size given by --batch-size, okay. but + whose sizes sum to larger than --batch-size. or less if there are not enough packs. If there are not enough packs to reach the batch size, then we do nothing. Now it would be interesting if we can specify an upper bound (e.g. my laptop has 8GB of ram, can I use this incremental repacking optimally by telling git to make batches of at most 7.5G), the answer seems to follow... Well, this gets us to the "unsigned long" problem and how it pervades the OPT_MAGNITUDE() code and things that use that kind of parameter. This means that Windows users can specify a maximum of (1 << 32) - 1. This size is intended to be large enough to make a reasonable change in the pack organization without rewriting the entire repo (e.g. 1GB). If there is another word than "repack" that means "collect objects from a subset of pack-files and place them into a new pack-file, doing a small amount of redeltification in the process" then I am open to suggestions. I tried (briefly) to fix this dependence, but it got too large for me to handle at the same time as this change. I'll consider revisiting it again later. The batch is + selected by greedily adding small pack-files starting with + the oldest pack-files that fit the size. Create a new pack- + file containing the objects the multi-pack-index indexes + into thos pack-files, and rewrite the multi-pack-index to those + contain that pack-file. A later run of 'git multi-pack-index + expire' will delete the pack-files that were part of this + batch. ... but the optimization seems to be rather about getting rid of the oldest packs first instead of getting as close to the batch size. (e.g. another way to look at this is to "find the permutation of all packs that (each are smaller than batch size), but in sum are the smallest threshold above the batch size). You are describing the subset-sum problem, with an additional optimization component. While there are dynamic programming approaches that are usually effective (if the sum is small), this problem is NP-complete, and hence could lead to complications. I guess that the strategy of picking the oldest is just easiest to implement and should be sufficient for now, but memory bounds might be interesting to keep in mind, just as the optimal packing from above. It is easy to implement, and fast. Further, we do have a heuristic that the pack modified time correlates with the time the objects were introduced to the repository and hence may compress well when placed together. Thanks, -Stolee
Re: [PATCH 1/5] multi-pack-index: prepare for 'expire' verb
On 12/10/2018 8:59 PM, SZEDER Gábor wrote: On Mon, Dec 10, 2018 at 05:35:28PM -0800, Stefan Beller wrote: On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget wrote: +expire:: + When given as the verb, Can it be given in another way? Or rather "if the verb is expire", then ... (I just checked the current man page, and both write and verify use this pattern as well. I find it strange as this first part of the sentence conveys little information, but is repeated 3 times now (once for each verb)). Maybe we can restructure the man page to have it more like The following verbs are available: +write:: +create a new MIDX file, writing to /packs/multi-pack-index. + +verify:: +verify the contents ... I think a s/verb/subcommand/ would help a lot, too, because that's what we call it everywhere else. Thanks, both. V2 will include a new patch that reformats the doc to use these suggestions, then extend it for the new subcommand. -Stolee
[PATCH 3/5] multi-pack-index: implement 'expire' verb
From: Derrick Stolee The 'git multi-pack-index expire' command looks at the existing mult-pack-index, counts the number of objects referenced in each pack-file, deletes the pack-fils with no referenced objects, and rewrites the multi-pack-index to no longer reference those packs. Refactor the write_midx_file() method to call write_midx_internal() which now takes an existing 'struct multi_pack_index' and a list of pack-files to drop (as specified by the names of their pack- indexes). As we write the new multi-pack-index, we drop those file names from the list of known pack-files. The expire_midx_packs() method removes the unreferenced pack-files after carefully closing the packs to avoid open handles. Test that a new pack-file that covers the contents of two other pack-files leads to those pack-files being deleted during the expire command. Be sure to read the multi-pack-index to ensure it no longer references those packs. Signed-off-by: Derrick Stolee --- midx.c | 87 +++-- t/t5319-multi-pack-index.sh | 15 +++ 2 files changed, 98 insertions(+), 4 deletions(-) diff --git a/midx.c b/midx.c index 3bd7183a53..50e4cd7270 100644 --- a/midx.c +++ b/midx.c @@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off return written; } -int write_midx_file(const char *object_dir) +static int write_midx_internal(const char *object_dir, struct multi_pack_index *m, + struct string_list *packs_to_drop) { unsigned char cur_chunk, num_chunks = 0; char *midx_name; @@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir) uint32_t nr_entries, num_large_offsets = 0; struct pack_midx_entry *entries = NULL; int large_offsets_needed = 0; + int result = 0; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir) midx_name); } - packs.m = load_multi_pack_index(object_dir, 1); + if (m) + packs.m = m; + else + packs.m = load_multi_pack_index(object_dir, 1); packs.nr = 0; packs.alloc_list = packs.m ? packs.m->num_packs : 16; @@ -787,7 +792,24 @@ int write_midx_file(const char *object_dir) ALLOC_ARRAY(packs.perm, packs.alloc_perm); if (packs.m) { + int drop_index = 0, missing_drops = 0; for (i = 0; i < packs.m->num_packs; i++) { + if (packs_to_drop && drop_index < packs_to_drop->nr) { + int cmp = strcmp(packs.m->pack_names[i], + packs_to_drop->items[drop_index].string); + + if (!cmp) { + drop_index++; + continue; + } else if (cmp > 0) { + error(_("did not see pack-file %s to drop"), + packs_to_drop->items[drop_index].string); + drop_index++; + i--; + missing_drops++; + } + } + ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list); ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names); ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm); @@ -798,6 +820,12 @@ int write_midx_file(const char *object_dir) packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1; packs.nr++; } + + if (packs_to_drop && (drop_index < packs_to_drop->nr || missing_drops)) { + error(_("did not see all pack-files to drop")); + result = 1; + goto cleanup; + } } for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs); @@ -932,7 +960,12 @@ cleanup: free(packs.perm); free(entries); free(midx_name); - return 0; + return result; +} + +int write_midx_file(const char *object_dir) +{ + return write_midx_internal(object_dir, NULL, NULL); } void clear_midx_file(struct repository *r) @@ -1034,5 +1067,51 @@ int verify_midx_file(const char *object_dir) int expire_midx_packs(const char *object_dir) { - return 0; + uint32_t i, *count, result = 0; + size_t dirlen; + struct strbuf buf = STRBUF_INIT; + struct string_list packs_to_drop = STRING_LIST_INIT_DUP; + struct multi_pack_index *m = load
[PATCH 5/5] midx: implement midx_repack()
From: Derrick Stolee To repack using a multi-pack-index, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, adding the packs to a list if they are smaller than the given pack-size. Finally, collect the objects from the multi-pack- index that are in those packs and send them to 'git pack-objects'. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. This object-size idea could be a direction for future expansion in this area. Signed-off-by: Derrick Stolee --- midx.c | 109 +++- t/t5319-multi-pack-index.sh | 25 + 2 files changed, 133 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 4caf148464..3718e78132 100644 --- a/midx.c +++ b/midx.c @@ -8,6 +8,7 @@ #include "sha1-lookup.h" #include "midx.h" #include "progress.h" +#include "run-command.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -1116,7 +1117,113 @@ int expire_midx_packs(const char *object_dir) return result; } -int midx_repack(const char *object_dir, size_t batch_size) +struct time_and_id { + timestamp_t mtime; + uint32_t pack_int_id; +}; + +static int compare_by_mtime(const void *a_, const void *b_) { + const struct time_and_id *a, *b; + + a = (const struct time_and_id *)a_; + b = (const struct time_and_id *)b_; + + if (a->mtime < b->mtime) + return -1; + if (a->mtime > b->mtime) + return 1; return 0; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + int result = 0; + uint32_t i, packs_to_repack; + size_t total_size; + struct time_and_id *pack_ti; + unsigned char *include_pack; + struct child_process cmd = CHILD_PROCESS_INIT; + struct strbuf base_name = STRBUF_INIT; + struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); + + if (!m) + return 0; + + include_pack = xcalloc(m->num_packs, sizeof(unsigned char)); + pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id)); + + for (i = 0; i < m->num_packs; i++) { + pack_ti[i].pack_int_id = i; + + if (prepare_midx_pack(m, i)) + continue; + + pack_ti[i].mtime = m->packs[i]->mtime; + } + QSORT(pack_ti, m->num_packs, compare_by_mtime); + + total_size = 0; + packs_to_repack = 0; + for (i = 0; total_size < batch_size && i < m->num_packs; i++) { + int pack_int_id = pack_ti[i].pack_int_id; + struct packed_git *p = m->packs[pack_int_id]; + + if (!p) + continue; + if (p->pack_size >= batch_size) + continue; + + packs_to_repack++; + total_size += p->pack_size; + include_pack[pack_int_id] = 1; + } + + if (total_size < batch_size || packs_to_repack < 2) + goto cleanup; + + argv_array_push(&cmd.args, "pack-objects"); + + strbuf_addstr(&base_name, object_dir); + strbuf_addstr(&base_name, "/pack/pack"); + argv_array_push(&cmd.args, base_name.buf); + strbuf_release(&base_name); + + cmd.git_cmd = 1; + cmd.in = cmd.out = -1; + + if (start_command(&cmd)) { + error(_("could not start pack-objects")); + result = 1; + goto cleanup; + } + + for (i = 0; i < m->num_objects; i++) { + struct object_id oid; + uint32_t pack_int_id = nth_midxed_pack_int_id(m, i); + + if (!include_pack[pack_int_id]) + continue; + + nth_midxed_object_oid(&oid, m, i); + xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz); + xwrite(cmd.in, "\n", 1); + } + close(cmd.in); + + if (finish_command(&cmd)) { + error(_("could not finish pack-objects")); + result = 1; + goto cleanup; + } + + result = write_midx_internal(object_dir, m, NULL); + m = NULL; + +cleanup: + if (m) + close_midx(m); + free(include_pack); + free(pack_ti); + return result; +} diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
[PATCH 2/5] midx: refactor permutation logic
From: Derrick Stolee When writing a multi-pack-index, we keep track of an integer permutation, tracking the list of pack-files that we know about (both from the existing multi-pack-index and the new pack-files being introduced) and converting them into a sorted order for the new multi-pack-index. In anticipation of dropping pack-files from the existing multi- pack-index, refactor the logic around how we track this permutation. First, insert the permutation into the pack_list structure. This allows us to grow the permutation dynamically as we add packs. Second, fill the permutation with values corresponding to their position in the list of pack-files, sorted as follows: 1. The pack-files in the existing multi-pack-index, sorted lexicographically. 2. The pack-files not in the existing multi-pack-index, sorted as discovered from the filesystem. There is a subtle thing in how we initialize this permutation, specifically how we use 'i' for the initial value. This will matter more when we implement the logic for dropping existing packs, as we will create holes in the ordering. Signed-off-by: Derrick Stolee --- midx.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/midx.c b/midx.c index bb825ef816..3bd7183a53 100644 --- a/midx.c +++ b/midx.c @@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f, struct pack_list { struct packed_git **list; char **names; + uint32_t *perm; uint32_t nr; uint32_t alloc_list; uint32_t alloc_names; + uint32_t alloc_perm; size_t pack_name_concat_len; struct multi_pack_index *m; }; @@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list); ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names); + ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm); packs->list[packs->nr] = add_packed_git(full_path, full_path_len, @@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, return; } + packs->perm[packs->nr] = packs->nr; packs->names[packs->nr] = xstrdup(file_name); packs->pack_name_concat_len += strlen(file_name) + 1; packs->nr++; @@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p ALLOC_ARRAY(pairs, nr_packs); for (i = 0; i < nr_packs; i++) { - pairs[i].pack_int_id = i; + pairs[i].pack_int_id = perm[i]; pairs[i].pack_name = pack_names[i]; } @@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir) struct hashfile *f = NULL; struct lock_file lk; struct pack_list packs; - uint32_t *pack_perm = NULL; uint64_t written = 0; uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1]; uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1]; @@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir) packs.nr = 0; packs.alloc_list = packs.m ? packs.m->num_packs : 16; - packs.alloc_names = packs.alloc_list; + packs.alloc_perm = packs.alloc_names = packs.alloc_list; packs.list = NULL; packs.names = NULL; + packs.perm = NULL; packs.pack_name_concat_len = 0; ALLOC_ARRAY(packs.list, packs.alloc_list); ALLOC_ARRAY(packs.names, packs.alloc_names); + ALLOC_ARRAY(packs.perm, packs.alloc_perm); if (packs.m) { for (i = 0; i < packs.m->num_packs; i++) { ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list); ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names); + ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm); + packs.perm[packs.nr] = i; packs.list[packs.nr] = NULL; packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]); packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1; @@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir) packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT - (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT); - ALLOC_ARRAY(pack_perm, packs.nr); - sort_packs_by_name(packs.names, packs.nr, pack_perm); + sort_packs_by_name(packs.names, packs.nr, packs.perm); - entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries); + entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, &nr_entri
[PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index
The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new verbs for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. 3. Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info). We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Thanks, -Stolee Derrick Stolee (5): multi-pack-index: prepare for 'expire' verb midx: refactor permutation logic multi-pack-index: implement 'expire' verb multi-pack-index: prepare 'repack' verb midx: implement midx_repack() Documentation/git-multi-pack-index.txt | 20 +++ builtin/multi-pack-index.c | 12 +- midx.c | 222 +++-- midx.h | 2 + t/t5319-multi-pack-index.sh| 98 +++ 5 files changed, 343 insertions(+), 11 deletions(-) base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/92 -- gitgitgadget
[PATCH 4/5] multi-pack-index: prepare 'repack' verb
From: Derrick Stolee In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' verb to 'git multi-pack-index' that takes a '--batch-size' option. The verb will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack-files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 12 builtin/multi-pack-index.c | 10 +- midx.c | 5 + midx.h | 1 + t/t5319-multi-pack-index.sh| 11 +++ 5 files changed, 38 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 822d83c845..e43e7da71e 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -39,6 +39,18 @@ expire:: Rewrite the MIDX file afterward to remove all references to these pack-files. +repack:: + When given as the verb, collect a batch of pack-files whose + size are all at most the size given by --batch-size, but + whose sizes sum to larger than --batch-size. The batch is + selected by greedily adding small pack-files starting with + the oldest pack-files that fit the size. Create a new pack- + file containing the objects the multi-pack-index indexes + into thos pack-files, and rewrite the multi-pack-index to + contain that pack-file. A later run of 'git multi-pack-index + expire' will delete the pack-files that were part of this + batch. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 145de3a46c..d87a2235e3 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,12 +5,13 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire|repack --batch-size=)"), NULL }; static struct opts_multi_pack_index { const char *object_dir; + unsigned long batch_size; } opts; int cmd_multi_pack_index(int argc, const char **argv, @@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv, static struct option builtin_multi_pack_index_options[] = { OPT_FILENAME(0, "object-dir", &opts.object_dir, N_("object directory containing set of packfile and pack-index pairs")), + OPT_MAGNITUDE(0, "batch-size", &opts.batch_size, + N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")), OPT_END(), }; @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv, return 1; } + if (!strcmp(argv[0], "repack")) + return midx_repack(opts.object_dir, (size_t)opts.batch_size); + if (opts.batch_size) + die(_("--batch-size option is only for 'repack' verb")); + if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) diff --git a/midx.c b/midx.c index 50e4cd7270..4caf148464 100644 --- a/midx.c +++ b/midx.c @@ -1115,3 +1115,8 @@ int expire_midx_packs(const char *object_dir) string_list_clear(&packs_to_drop, 0); return result; } + +int midx_repack(const char *object_dir, size_t batch_size) +{ + return 0; +} diff --git a/midx.h b/midx.h index e3a2b740b5..394a21ee96 100644 --- a/midx.h +++ b/midx.h @@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir); void clear_midx_file(struct repository *r); int veri
[PATCH 1/5] multi-pack-index: prepare for 'expire' verb
From: Derrick Stolee The multi-pack-index tracks objects in a collection of pack-files. Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers. It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file. Introduce a new 'expire' verb to the multi-pack-index builtin. This verb will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files. More details about the specifics will follow as the method is implemented. Add a test that verifies the 'expire' verb is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and should not be removed during an 'expire' operation. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 8 + builtin/multi-pack-index.c | 4 ++- midx.c | 5 +++ midx.h | 1 + t/t5319-multi-pack-index.sh| 47 ++ 5 files changed, 64 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index f7778a2c85..822d83c845 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -31,6 +31,14 @@ verify:: When given as the verb, verify the contents of the MIDX file at `/packs/multi-pack-index`. +expire:: + When given as the verb, delete the pack-files that are tracked + by the MIDX file at `/packs/multi-pack-index` but have + no objects referenced by the MIDX. All objects in these pack- + files have another copy in a more-recently modified pack-file. + Rewrite the MIDX file afterward to remove all references to + these pack-files. + EXAMPLES diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index fca70f8e4f..145de3a46c 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] (write|verify)"), + N_("git multi-pack-index [--object-dir=] (write|verify|expire)"), NULL }; @@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv, return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) return verify_midx_file(opts.object_dir); + if (!strcmp(argv[0], "expire")) + return expire_midx_packs(opts.object_dir); die(_("unrecognized verb: %s"), argv[0]); } diff --git a/midx.c b/midx.c index 730ff84dff..bb825ef816 100644 --- a/midx.c +++ b/midx.c @@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir) return verify_midx_error; } + +int expire_midx_packs(const char *object_dir) +{ + return 0; +} diff --git a/midx.h b/midx.h index 774f652530..e3a2b740b5 100644 --- a/midx.h +++ b/midx.h @@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(struct repository *r); int verify_midx_file(const char *object_dir); +int expire_midx_packs(const char *object_dir); void close_midx(struct multi_pack_index *m); diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 70926b5bc0..948effc1ee 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' ' "incorrect object offset" ' +test_expect_success 'setup expire tests' ' + mkdir dup && + ( + cd dup && + git init && + for i in $(test_seq 1 20) + do + test_commit $i + done && + git branch A HEAD && + git branch B HEAD~8 && + git branch C HEAD~13 && + git branch D HEAD~16 && + git branch E HEAD~18 && + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && + refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-D <<-EOF && + refs/heads/D + ^refs/heads/E + EOF + git pack-objects --revs .git/objects/pack/pack-C <<-EOF && + refs/heads/C + ^refs/heads/D + EOF + git pack-objects --revs .git/objects/pack/pack-B <<-EOF && +
[PATCH v3 5/6] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- Documentation/config/pack.txt | 9 + builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH v3 2/6] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++--- list-objects.h | 4 ++- revision.c | 3 +++ 7 files changed, 61 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..4fbdeca0a4 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struct oidset *set) +
[PATCH v3 4/6] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 116 ++--- t/t5322-pack-objects-sparse.sh | 21 -- 2 files changed, 121 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..971f1bb095 100644 --- a/revision.c +++ b/revision.c @@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(st
[PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
From: Derrick Stolee Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 1 + t/README | 4 t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget
[PATCH v3 3/6] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 11 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 129 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1
[PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 22 ++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH v3 0/6] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 ++- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++- list-objects.h | 4 +- revision.c | 121 + revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 + 12 files changed, 351 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v2: 1: 60617681f7 = 1: 60617681f7 revision: add mark_tree_uninteresting_sparse 2: 4527addacb = 2: 4527addacb list-objects: consume sparse tree walk 3: 9644f6ff04 ! 3: 4ef318bdb2 pack-objects: add --sparse option @@ -32,7 +32,9 @@ +--sparse:: + Use the "sparse" algorithm to determine which objects to include in -+ the pack. This can have significant performance benefits when computing ++ the pack, when combined with the "--revs" option. This algorithm ++ only walks trees that appear in paths that introduce new objects. ++ This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. 4: c99957d06f = 4: 571b2e2784 revision: implement sparse algorithm 5: d6912188be ! 5: 33d2c04dd6 pack-objects: create pack.useSparse setting @@ -19,6 +19,26 @@ Signed-off-by: Derrick Stolee +diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt +--- a/Documentation/config/pack.txt b/Documentation/config/pack.txt +@@ + true. You should not generally need to turn this off unless + you are debugging pack bitmaps. + ++pack.useSparse:: ++ When true, git will default to using the '--sparse' option in ++ 'git pack-objects' when the '--revs' option
Git Test Coverage Report (Sunday, Dec 9)
Here is today's coverage report. Thanks, -Stolee [1] https://dev.azure.com/git/git/_build/results?buildId=287 --- pu: dfcf84ebfa17eb0bb3b57806fa530e87d8c8f1b8 jch: dd824ca506dbdf198282714b0bd21665c5825b4d next: bc1bbc6f855c3b5ef7fcbd0f688f647c4e5b208b master: 5d826e972970a784bd7a7bdf587512510097b8c7 master@{1}: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb Uncovered code in 'pu' not in 'jch' -- builtin/blame.c 74e8221b52 builtin/blame.c 928) blame_date_width = sizeof("Thu Oct 19 16:00"); 74e8221b52 builtin/blame.c 929) break; builtin/remote.c b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used with other options")); b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used when only one url is defined")); commit-graph.c 721351787e 128) return NULL; 721351787e 131) return NULL; 721351787e 187) free(graph); 721351787e 188) return NULL; 721351787e 223) free(graph); 721351787e 224) return NULL; date.c 74e8221b52 113) die("Timestamp too large for this system: %"PRItime, time); 74e8221b52 216) if (tm->tm_mon == human_tm->tm_mon) { 74e8221b52 217) if (tm->tm_mday > human_tm->tm_mday) { 74e8221b52 219) } else if (tm->tm_mday == human_tm->tm_mday) { 74e8221b52 220) hide.date = hide.wday = 1; 74e8221b52 221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) { 74e8221b52 223) hide.date = 1; 74e8221b52 231) gettimeofday(&now, NULL); 74e8221b52 232) show_date_relative(time, tz, &now, buf); 74e8221b52 233) return; 74e8221b52 246) hide.seconds = 1; 74e8221b52 247) hide.tz |= !hide.date; 74e8221b52 248) hide.wday = hide.time = !hide.year; 74e8221b52 262) strbuf_rtrim(buf); 74e8221b52 287) gettimeofday(&now, NULL); 74e8221b52 290) human_tz = local_time_tzoffset(now.tv_sec, &human_tm); 74e8221b52 886) static int auto_date_style(void) 74e8221b52 888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : DATE_NORMAL; 74e8221b52 909) return DATE_HUMAN; 74e8221b52 911) return auto_date_style(); pretty.c 4681fe38e1 1069) return 0; b755bf6f83 1107) match_placeholder_arg(p, "=on", end) || b755bf6f83 1108) match_placeholder_arg(p, "=true", end)) { protocol.c 24c10f7473 37) die(_("Unrecognized protocol version")); 24c10f7473 39) die(_("Unrecognized protocol_version")); remote-curl.c 24c10f7473 403) return 0; strbuf.c 18f8e81091 397) return 0; submodule.c 26f80ccfc1 1396) strbuf_release(&gitdir); be76c21282 1519) struct fetch_task *task = task_cb; be76c21282 1523) fetch_task_release(task); wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Anders Waldenborg 18f8e8109: strbuf: separate callback for strbuf_expand:ing literals Anders Waldenborg 4681fe38e: pretty: allow showing specific trailers Anders Waldenborg b755bf6f8: pretty: allow %(trailers) options with explicit value Denton Liu b7f4e371e: remote: add --save-to-push option to git remote set-url Josh Steadmon 24c10f747: protocol: advertise multiple supported versions Josh Steadmon 721351787: commit-graph, fuzz: add fuzzer for commit-graph Linus Torvalds 74e8221b5: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Stefan Beller 26f80ccfc: submodule: migrate get_next_submodule to use repository structs Stefan Beller be76c2128: fetch: ensure submodule objects fetched Uncovered code in 'jch' not in 'next' apply.c 0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) || 0f086e6dca 3356) lstat(ce->name, st)) builtin/branch.c 0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD")); 0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of refs/heads/"), refname); hex.c 47edb64997 93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1) 47edb64997 95) return hash_to_hex_algop_r(buffer, sha1, &hash_algos[GIT_HASH_SHA1]); 47edb64997 116) char *hash_to_hex(const unsigned char *hash) 47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo); pathspec.c 22af33bece 671) name = to_free = xmemdupz(name, namelen); read-cache.c ee70c12820 1730) if (advice_unknown_index_extension) { ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext); ee70c12820 1732) advise(_("This is likely due to the file having been written by a newer\n" ec36c42a63 3508) const char *index = NULL; ec36c42a63 3514) if (!offset) ec36c42a63 3515) return NULL; ec36c42a63 3516) while (offset <= mmap_size - the_hash_algo->rawsz - 8) { ec36c42a63 3517) extsize = get_be32(mmap + offset + 4); ec36c42a63 3518) if (CACHE_EXT((mmap + offset)) == CACHE_EXT_INDEXENTRYOFFSETTABLE) { ec36c42a63 3519) index = mmap + offset + 4 + 4; ec36c42a63 3520) break; ec36c42a63 3522) offset += 8; ec36c42a63 3523) offset += extsize; ec36c42a63
Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
On 12/6/2018 6:36 PM, Jonathan Tan wrote: AFAICT oid_object_info doesn't take advantage of the commit graph, but just looks up the object header, which is still less than completely parsing it. Then lookup_commit is overly strict, as it may return NULL as when there still is a type mismatch (I don't think a mismatch could happen here, as both rely on just the object store, and not the commit graph.), so this would be just defensive programming for the sake of it. I dunno. struct commit *c; if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT && (c = lookup_commit(revs->repo, oid)) && !repo_parse_commit(revs->repo, c)) object = (struct object *) c; else object = parse_object(revs->repo, oid); I like this way better - I'll do it in the next version. If we do _not_ have a commit-graph or if the commit-graph does not have that commit, this will have the same performance problem, right? Should we instead create a direct dependence on the commit-graph, and try to parse the oid from the graph directly? If it succeeds, then we learn that the object is a commit, in addition to all of the parsing work. This means we could avoid oid_object_info() loading data if we succeed. We would fall back to parse_object() if it fails. I was thinking this should be a simple API call to parse_commit_in_graph(), but that requires a struct commit filled with an oid, which is not the best idea if we don't actually know it is a commit yet. The approach I recommend would then be more detailed: 1. Modify find_commit_in_graph() to take a struct object_id instead of a struct commit. This helps find the integer position in the graph. That position can be used in fill_commit_in_graph() to load the commit contents. Keep find_commit_in_graph() static as it should not be a public function. 2. Create a public function with prototype struct commit *try_parse_commit_from_graph(struct repository *r, struct object_id *oid) that returns a commit struct fully parsed if and only if the repository has that oid. It can call find_commit_in_graph(), then lookup_commit() and fill_commit_in_graph() to create the commit and parse the data. 3. In replace of the snippet above, do: struct commit *c; if ((c = try_parse_commit_from_graph(revs->repo, oid)) object = (struct object *)c; else object = parse_object(revs->repo, oid); A similar pattern _could_ be used in parse_object(), but I don't recommend doing this pattern unless we have a reasonable suspicion that we are going to parse commits more often than other objects. (It adds an O(log(# commits)) binary search to each object.) A final thought: consider making this "try the commit graph first, but fall back to parse_object()" a library function with a name like struct object *parse_probably_commit(struct repository *r, struct object_id *oid) so other paths that are parsing a lot of commits (but also maybe tags) could use the logic. Thanks! -Stolee
Re: [PATCH v2 2/3] commit-graph: fix buffer read-overflow
On 12/6/2018 3:20 PM, Josh Steadmon wrote: + +# usage: corrupt_and_zero_graph_then_verify +# Manipulates the commit-graph file at by inserting the data, +# then zeros the file starting at . Finally, runs +# 'git commit-graph verify' and places the output in the file 'err'. Tests 'err' +# for the given string. +corrupt_and_zero_graph_then_verify() { This method is very similar to to 'corrupt_graph_and_verify()', the only difference being the zero_pos, which zeroes the graph. Could it instead be a modification of corrupt_graph_and_verify() where $4 is interpreted as zero_pos, and if it is blank we don't do the truncation? +test_expect_success 'detect truncated graph' ' + corrupt_and_zero_graph_then_verify $GRAPH_BYTE_CHUNK_COUNT "\xff" \ + $GRAPH_CHUNK_LOOKUP_OFFSET "chunk lookup table entry missing" +' + Thanks for this! I think it's valuable to keep explicit tests around that were discovered from your fuzz tests. Specifically, I can repeat the test when I get around to the next file format. Thanks, -Stolee
Re: [PATCH 2/2] commit-graph: fix buffer read-overflow
On 12/5/2018 5:32 PM, Josh Steadmon wrote: + if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) { + error(_("chunk lookup table entry missing; graph file may be incomplete")); + free(graph); + return NULL; + } Something I forgot earlier: there are several tests in t5318-commit-graph.sh that use 'git commit-graph verify' to ensure we hit these error conditions on a corrupted commit-graph file. Could you try adding a test there that looks for this error message? Thanks, -Stolee
Re: git, monorepos, and access control
On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote: On Wed, Dec 05 2018, Coiner, John wrote: I'm an engineer with AMD. I'm looking at whether we could switch our internal version control to a monorepo, possibly one based on git and VFSForGit. Has anyone looked at adding access control to git, at a per-directory granularity? Is this a feature that the git community would possibly welcome? All of what you've described is possible to implement in git, but as far as I know there's no existing implementation of it. Microsoft's GVFS probably comes closest, and they're actively upstreaming bits of that, but as far as I know that doesn't in any way try to solve this "users XYZ can't even list such-and-such a tree" problem. (Avar has a lot of good ideas in his message, so I'm just going to add on a few here.) This directory-level security is not a goal for VFS for Git, and I don't see itbecoming a priority as it breaks a number of design decisions we made in our object storage and communication models. The best I can think about when considering Git as an approach would be to use submodules for your security-related content, and then have server- side security for access to those repos. Of course, submodules are not supported in VFS for Git, either. The Gerrit service has _branch_ level security, which is related to the reachability questions that a directory security would need. However, the problem is quite different. Gerrit does have a lot of experience in dealing with submodules, though, so that's probably a good place to start. Thanks, -Stolee
Re: How de-duplicate similar repositories with alternates
On 12/4/2018 8:35 AM, Ævar Arnfjörð Bjarmason wrote: The leftover space spent is the commit-grah (not de-duped like objects are), and... The commit-graph could be shared, as the commits in each enlistment can be parsed from local with GENERATION_NUMBER_INFINITY, giving us similar speedups. The issue is: who updates the file? As the commit-graph gets behind, performance will degrade. But it seems like you'd need similar maintenance on the alternate object store, anyway. Thanks, -Stolee
Re: How de-duplicate similar repositories with alternates
On 12/4/2018 2:06 AM, Jeff King wrote: On Thu, Nov 29, 2018 at 10:55:49AM -0800, Stefan Beller wrote: I view alternates as a historic artefact as the deduping of objects client side can be done using worktrees, and on the serverside - I think - most of the git hosters use namespaces and put a fork network into the same repository and use pack islands. By contrast, object storage is pretty easy to share. It scales reasonably well, and the security model is much simpler due to the immutable nature of object names. And for the client side, we use alternates as an important way to scale VFS for Git to multiple enlistments on the same machine. VFS for Git manages a "shared object cache" (the alternate) that is updated in the background (including multi-pack-index and commit-graph). Using worktrees for the same effect would add complications to the user interactions, not only when creating an enlistment but the fact that two enlistments cannot check out the same ref will confuse users. Thanks, -Stolee
Git Test Coverage Report (Friday Nov 30)
Here is today's test coverage report. Thanks, -Stolee [1] https://dev.azure.com/git/git/_build/results?buildId=277 --- pu: 5a1a9a96d55fbb80426189a921d7b6cc66564c78 jch: 71c29cabb7379fe9abaacbbbd1350268d0c18b4f next: a9faaff8c120bf4783cb892c157871fe524b3608 master: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb master@{1}: 7f4e64169352e03476b0ea64e7e2973669e491a2 Uncovered code in 'pu' not in 'jch' -- builtin/blame.c 74e8221b52 builtin/blame.c 928) blame_date_width = sizeof("Thu Oct 19 16:00"); 74e8221b52 builtin/blame.c 929) break; builtin/remote.c b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used with other options")); b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used when only one url is defined")); date.c 74e8221b52 113) die("Timestamp too large for this system: %"PRItime, time); 74e8221b52 216) if (tm->tm_mon == human_tm->tm_mon) { 74e8221b52 217) if (tm->tm_mday > human_tm->tm_mday) { 74e8221b52 219) } else if (tm->tm_mday == human_tm->tm_mday) { 74e8221b52 220) hide.date = hide.wday = 1; 74e8221b52 221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) { 74e8221b52 223) hide.date = 1; 74e8221b52 231) gettimeofday(&now, NULL); 74e8221b52 232) show_date_relative(time, tz, &now, buf); 74e8221b52 233) return; 74e8221b52 246) hide.seconds = 1; 74e8221b52 247) hide.tz |= !hide.date; 74e8221b52 248) hide.wday = hide.time = !hide.year; 74e8221b52 262) strbuf_rtrim(buf); 74e8221b52 287) gettimeofday(&now, NULL); 74e8221b52 290) human_tz = local_time_tzoffset(now.tv_sec, &human_tm); 74e8221b52 886) static int auto_date_style(void) 74e8221b52 888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : DATE_NORMAL; 74e8221b52 909) return DATE_HUMAN; 74e8221b52 911) return auto_date_style(); protocol.c 24c10f7473 37) die(_("Unrecognized protocol version")); 24c10f7473 39) die(_("Unrecognized protocol_version")); remote-curl.c 24c10f7473 403) return 0; sha1-array.c bba406749a 91) oidcpy(&oids[dst], &oids[src]); strbuf.c 10a40f5700 397) return 0; submodule.c e2419f7e30 1376) strbuf_release(&gitdir); 7454fe5cb6 1499) struct get_next_submodule_task *task = task_cb; 7454fe5cb6 1503) get_next_submodule_task_release(task); 7454fe5cb6 1530) return 0; 7454fe5cb6 1534) goto out; 7454fe5cb6 1549) return 0; wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Anders Waldenborg 10a40f570: strbuf: separate callback for strbuf_expand:ing literals Denton Liu b7f4e371e: remote: add --save-to-push option to git remote set-url Josh Steadmon 24c10f747: protocol: advertise multiple supported versions Linus Torvalds 74e8221b5: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Stefan Beller 7454fe5cb: fetch: try fetching submodules if needed objects were not fetched Stefan Beller bba406749: sha1-array: provide oid_array_filter Stefan Beller e2419f7e3: submodule: migrate get_next_submodule to use repository structs Uncovered code in 'jch' not in 'next' apply.c 0f086e6dca 3355) if (checkout_entry(ce, &costate, NULL, NULL) || 0f086e6dca 3356) lstat(ce->name, st)) builtin/branch.c 0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD")); 0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of refs/heads/"), refname); hex.c 47edb64997 93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1) 47edb64997 95) return hash_to_hex_algop_r(buffer, sha1, &hash_algos[GIT_HASH_SHA1]); 47edb64997 116) char *hash_to_hex(const unsigned char *hash) 47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo); pathspec.c 22af33bece 671) name = to_free = xmemdupz(name, namelen); read-cache.c ee70c12820 1730) if (advice_unknown_index_extension) { ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext); ee70c12820 1732) advise(_("This is likely due to the file having been written by a newer\n" sequencer.c 18e711a162 2387) opts->quiet = 1; sha1-file.c 2f90b9d9b4 sha1-file.c 172) int hash_algo_by_name(const char *name) 2f90b9d9b4 sha1-file.c 175) if (!name) 2f90b9d9b4 sha1-file.c 176) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 177) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 178) if (!strcmp(name, hash_algos[i].name)) 2f90b9d9b4 sha1-file.c 179) return i; 2f90b9d9b4 sha1-file.c 180) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 183) int hash_algo_by_id(uint32_t format_id) 2f90b9d9b4 sha1-file.c 186) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 187) if (format_id == hash_algos[i].format_id) 2f90b9d9b4 sha1-file.c 188) return i; 2f90b9d9b4 sha1-file.c 189) return GIT_HASH_UNKNOWN; tree.c e092073d64 104) commit = lookup_commit(r, e
Re: [PATCH 3/5] pack-objects: add --sparse option
On 11/29/2018 9:39 PM, Junio C Hamano wrote: Derrick Stolee writes: While _eventually_ we should make this opt-out, we shouldn't do that until it has cooked a while. I actually do not agree. If the knob gives enough benefit, the users will learn about it viva voce, and in a few more releases we'll hear "enough users complain that they have to turn it on, let's make it the default". If that does not happen, the knob does not deserve to be turned on in the first place. That's a good philosophy. I'll keep it in mind. Having said that, I won't be commenting on this shiny new toy before the final. I want to see more people help tying the loose ends and give it final varnish to the upcoming release, as it seems to have become rockier and larger release than we originally anticipated. Yeah, no rush on this one. I just wanted to get some initial feedback about the idea. Thanks, -Stolee
[PATCH v2 5/6] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget