On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <[email protected]> wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
>
> Signed-off-by: Jonathan Tan <[email protected]>
> ---
> Documentation/technical/repository-version.txt | 6 ++
> Makefile | 1 +
> builtin/fsck.c | 18 +++-
> cache.h | 2 +
> environment.c | 1 +
> fsck.c | 6 +-
> object.c | 19 ++++
> object.h | 19 ++++
> promised-object.c | 130
> +++++++++++++++++++++++++
> promised-object.h | 22 +++++
> setup.c | 7 +-
> t/t3907-promised-object.sh | 41 ++++++++
> t/test-lib-functions.sh | 6 ++
> 13 files changed, 273 insertions(+), 5 deletions(-)
> create mode 100644 promised-object.c
> create mode 100644 promised-object.h
> create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt
> b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
> When the config key `extensions.preciousObjects` is set to `true`,
> objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
> `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
> LIB_OBJS += pretty.o
> LIB_OBJS += prio-queue.o
> LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
> LIB_OBJS += prompt.o
> LIB_OBJS += quote.o
> LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
> #include "progress.h"
> #include "streaming.h"
> #include "decorate.h"
> +#include "promised-object.h"
>
> #define REACHABLE 0x0001
> #define SEEN 0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
> #define ERROR_REACHABLE 02
> #define ERROR_PACK 04
> #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
> static const char *describe_object(struct object *obj)
> {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const
> struct object_id *oid,
> {
> struct object *obj;
>
> - obj = parse_object(oid);
> + obj = parse_or_promise_object(oid);
> if (!obj) {
> error("%s: invalid sha1 pointer %s", refname,
> oid_to_hex(oid));
> errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
> fprintf(stderr, "Checking cache tree\n");
>
> if (0 <= it->entry_count) {
> - struct object *obj = parse_object(&it->oid);
> + struct object *obj = parse_or_promise_object(&it->oid);
> if (!obj) {
> error("%s: invalid sha1 pointer in cache-tree",
> oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct
> object_id *oid,
> return 0;
> }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> + mark_object_for_connectivity(oid);
> + return 0;
> +}
> +
> static char const * const fsck_usage[] = {
> N_("git fsck [<options>] [<object>...]"),
> NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char
> *prefix)
>
> git_config(fsck_config, NULL);
>
> + if (fsck_promised_objects()) {
> + error("Errors found in promised object list");
> + errors_found |= ERROR_PROMISED_OBJECT;
> + }
This got me thinking: It is an error if we do not have an object
and also do not promise it, but what about the other case:
having and object and promising it, too?
That is not harmful to the operation, except that the promise
machinery may be slower due to its size. (Should that be a soft
warning then? Do we have warnings in fsck?)
> * The object type is stored in 3 bits.
> */
We may want to remove this comment while we're here as it
sounds stale despite being technically correct.
1974632c66 (Remove TYPE_* constant macros and use
object_type enums consistently., 2006-07-11)
> struct object {
> + /*
> + * Set if this object is parsed. If set, "type" is populated and this
> + * object can be casted to "struct commit" or an equivalent.
> + */
> unsigned parsed : 1;
> + /*
> + * Set if this object is not in the repo but is promised. If set,
> + * "type" is populated, but this object cannot be casted to "struct
> + * commit" or an equivalent.
> + */
> + unsigned promised : 1;
Would it make sense to have a bit field instead:
#define STATE_BITS 2
#define STATE_PARSED (1<<0)
#define STATE_PROMISED (1<<1)
unsigned state : STATE_BITS
This would be similar to the types and flags?
> +test_expect_success 'fsck fails on missing objects' '
> + test_create_repo repo &&
> +
> + test_commit -C repo 1 &&
> + test_commit -C repo 2 &&
> + test_commit -C repo 3 &&
> + git -C repo tag -a annotated_tag -m "annotated tag" &&
> + C=$(git -C repo rev-parse 1) &&
> + T=$(git -C repo rev-parse 2^{tree}) &&
> + B=$(git hash-object repo/3.t) &&
> + AT=$(git -C repo rev-parse annotated_tag) &&
> +
> + # missing commit, tree, blob, and tag
> + rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40)
> &&
This is a pretty cool test as it promises all sorts of objects
from different parts of the graph.