On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathanta...@google.com> wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
>
> Signed-off-by: Jonathan Tan <jonathanta...@google.com>
> ---
>  Documentation/technical/repository-version.txt |   6 ++
>  Makefile                                       |   1 +
>  builtin/fsck.c                                 |  18 +++-
>  cache.h                                        |   2 +
>  environment.c                                  |   1 +
>  fsck.c                                         |   6 +-
>  object.c                                       |  19 ++++
>  object.h                                       |  19 ++++
>  promised-object.c                              | 130 
> +++++++++++++++++++++++++
>  promised-object.h                              |  22 +++++
>  setup.c                                        |   7 +-
>  t/t3907-promised-object.sh                     |  41 ++++++++
>  t/test-lib-functions.sh                        |   6 ++
>  13 files changed, 273 insertions(+), 5 deletions(-)
>  create mode 100644 promised-object.c
>  create mode 100644 promised-object.h
>  create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt 
> b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
>  When the config key `extensions.preciousObjects` is set to `true`,
>  objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
>  `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
>  LIB_OBJS += pretty.o
>  LIB_OBJS += prio-queue.o
>  LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
>  LIB_OBJS += prompt.o
>  LIB_OBJS += quote.o
>  LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
>  #include "progress.h"
>  #include "streaming.h"
>  #include "decorate.h"
> +#include "promised-object.h"
>
>  #define REACHABLE 0x0001
>  #define SEEN      0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
>  #define ERROR_REACHABLE 02
>  #define ERROR_PACK 04
>  #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
>  static const char *describe_object(struct object *obj)
>  {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const 
> struct object_id *oid,
>  {
>         struct object *obj;
>
> -       obj = parse_object(oid);
> +       obj = parse_or_promise_object(oid);
>         if (!obj) {
>                 error("%s: invalid sha1 pointer %s", refname, 
> oid_to_hex(oid));
>                 errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
>                 fprintf(stderr, "Checking cache tree\n");
>
>         if (0 <= it->entry_count) {
> -               struct object *obj = parse_object(&it->oid);
> +               struct object *obj = parse_or_promise_object(&it->oid);
>                 if (!obj) {
>                         error("%s: invalid sha1 pointer in cache-tree",
>                               oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct 
> object_id *oid,
>         return 0;
>  }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> +       mark_object_for_connectivity(oid);
> +       return 0;
> +}
> +
>  static char const * const fsck_usage[] = {
>         N_("git fsck [<options>] [<object>...]"),
>         NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char 
> *prefix)
>
>         git_config(fsck_config, NULL);
>
> +       if (fsck_promised_objects()) {
> +               error("Errors found in promised object list");
> +               errors_found |= ERROR_PROMISED_OBJECT;
> +       }

This got me thinking: It is an error if we do not have an object
and also do not promise it, but what about the other case:
having and object and promising it, too?
That is not harmful to the operation, except that the promise
machinery may be slower due to its size. (Should that be a soft
warning then? Do we have warnings in fsck?)

>   * The object type is stored in 3 bits.
>   */

We may want to remove this comment while we're here as it
sounds stale despite being technically correct.
1974632c66 (Remove TYPE_* constant macros and use
object_type enums consistently., 2006-07-11)

>  struct object {
> +       /*
> +        * Set if this object is parsed. If set, "type" is populated and this
> +        * object can be casted to "struct commit" or an equivalent.
> +        */
>         unsigned parsed : 1;
> +       /*
> +        * Set if this object is not in the repo but is promised. If set,
> +        * "type" is populated, but this object cannot be casted to "struct
> +        * commit" or an equivalent.
> +        */
> +       unsigned promised : 1;

Would it make sense to have a bit field instead:

#define STATE_BITS 2
#define STATE_PARSED (1<<0)
#define STATE_PROMISED (1<<1)

    unsigned state : STATE_BITS

This would be similar to the types and flags?


> +test_expect_success 'fsck fails on missing objects' '
> +       test_create_repo repo &&
> +
> +       test_commit -C repo 1 &&
> +       test_commit -C repo 2 &&
> +       test_commit -C repo 3 &&
> +       git -C repo tag -a annotated_tag -m "annotated tag" &&
> +       C=$(git -C repo rev-parse 1) &&
> +       T=$(git -C repo rev-parse 2^{tree}) &&
> +       B=$(git hash-object repo/3.t) &&
> +       AT=$(git -C repo rev-parse annotated_tag) &&
> +
> +       # missing commit, tree, blob, and tag
> +       rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) 
> &&

This is a pretty cool test as it promises all sorts of objects
from different parts of the graph.

Reply via email to