Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-29 Thread Philip Oakley

From: "Ben Peart" 
Sent: Tuesday, July 25, 2017 4:10 PM

On 7/21/2017 4:33 PM, Jonathan Tan wrote:

On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart  wrote:


Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
promises. Given the average developer only hydrates 56K files (2 MB
promises) that is 103 MB to download that no one will ever need. We
would like to avoid that if possible as this would be a significant
regression in clone times from where we are today.



A question in the broader context of Narrow clones that I'd be interested 
in.


How narrow are the tree levels that contain the hydrated files? The question 
splits a couple of ways:


A. If one goes to the highest tree that contains all the 56K files, how many 
files and sub trees would be in that complete tree (i.e. what fraction of 
that 'top' tree is inflated).
A2. Is there some deep/wide metric that indicates how tightly together the 
inflated files tend to cluster?


B. If instead we look at just the trees in the paths of those inflated 
files, those trees will also reference other trees/blobs that are not 
inflated, how big is that list (it would indicate the size of a narrow 
repo's object store that holds the oid stubs)


I would quess / expect that the typical inflation only has a few clustered 
areas of interest, but it maybe that in such a big reality (*) the inflated 
files are actually spread very widely. (*) as per various blog posts saying 
there was no realistic way of partitioning the BigWin repo!


I'd be interested in any such sparsity metric (apologies if I've missed 
previous reports).

--
Philip


I'm also concerned about the performance of merging in promises given we
have 100M objects today and growing so the number of promises over time
could get pretty large.


After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.



In our GVFS solution today we do not download any size or object type 
information at clone as the number of objects and the resulting file would 
be too large.  Instead, we have a new sizes endpoint 
(https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables 
us to retrieve object sizes "on demand" much like we are enabling for the 
actual object content.


This protocol could easily be extended to return both size and type so 
that it could be used to retrieve "promise" data for objects as they are 
needed. Having a way to "cache" that data locally so that both git and 
other code could share it would be great.


At a minimum, we should ensure the data stream passed back is the same 
whether at clone time or when hitting a "promises" end point. I think it 
would also be helpful to enable promises to be downloaded on demand much 
like we are doing for the object data itself.



This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.


I think we should have a flag (off by default) that enables someone to
say that promised objects are optional. If the flag is set,
"is_promised_object" will return success and pass the OBJ_ANY type and 
a

size of -1.

Nothing today is using the size and in the two places where the object
type is being checked for consistency (fsck_cache_tree and
fsck_handle_ref) the test can add a test for OBJ_ANY as well.

This will enable very large numbers of objects to be omitted from the
clone without triggering a download of the corresponding number of
promised objects.


Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue 
of

downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called 

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-25 Thread Ben Peart



On 7/21/2017 4:33 PM, Jonathan Tan wrote:

On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart  wrote:


Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
promises. Given the average developer only hydrates 56K files (2 MB
promises) that is 103 MB to download that no one will ever need. We
would like to avoid that if possible as this would be a significant
regression in clone times from where we are today.

I'm also concerned about the performance of merging in promises given we
have 100M objects today and growing so the number of promises over time
could get pretty large.


After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.



In our GVFS solution today we do not download any size or object type 
information at clone as the number of objects and the resulting file 
would be too large.  Instead, we have a new sizes endpoint 
(https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables 
us to retrieve object sizes "on demand" much like we are enabling for 
the actual object content.


This protocol could easily be extended to return both size and type so 
that it could be used to retrieve "promise" data for objects as they are 
needed. Having a way to "cache" that data locally so that both git and 
other code could share it would be great.


At a minimum, we should ensure the data stream passed back is the same 
whether at clone time or when hitting a "promises" end point. I think it 
would also be helpful to enable promises to be downloaded on demand much 
like we are doing for the object data itself.



This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.


I think we should have a flag (off by default) that enables someone to
say that promised objects are optional. If the flag is set,
"is_promised_object" will return success and pass the OBJ_ANY type and a
size of -1.

Nothing today is using the size and in the two places where the object
type is being checked for consistency (fsck_cache_tree and
fsck_handle_ref) the test can add a test for OBJ_ANY as well.

This will enable very large numbers of objects to be omitted from the
clone without triggering a download of the corresponding number of
promised objects.


Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.



Caveats apply as I only did a quick look but I only found the two
locations that were checking the object type for consistency.


I haven't looked into detail, but you are probably right.



Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-21 Thread Jonathan Tan
On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart  wrote:

> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
> promises. Given the average developer only hydrates 56K files (2 MB 
> promises) that is 103 MB to download that no one will ever need. We 
> would like to avoid that if possible as this would be a significant 
> regression in clone times from where we are today.
> 
> I'm also concerned about the performance of merging in promises given we 
> have 100M objects today and growing so the number of promises over time 
> could get pretty large.

After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.

This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.

> >> I think we should have a flag (off by default) that enables someone to
> >> say that promised objects are optional. If the flag is set,
> >> "is_promised_object" will return success and pass the OBJ_ANY type and a
> >> size of -1.
> >>
> >> Nothing today is using the size and in the two places where the object
> >> type is being checked for consistency (fsck_cache_tree and
> >> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> >>
> >> This will enable very large numbers of objects to be omitted from the
> >> clone without triggering a download of the corresponding number of
> >> promised objects.
> > 
> > Eventually I plan to use the size when implementing parameters for
> > history-searching commands (e.g. "git log -S"), but it's true that
> > that's in the future.
> > 
> > Allowing promised objects to be optional would indeed solve the issue of
> > downloading too many promises. It would make the code more complicated,
> > but I'm not sure by how much.
> > 
> > For example, in this fsck patch, the easiest way I could think of to
> > have promised objects was to introduce a 3rd state, called "promised",
> > of "struct object" - one in which the type is known, but we don't have
> > access to the full "struct commit" or equivalent. And thus fsck could
> > assume that if the "struct object" is "parsed" or "promised", the type
> > is known. Having optional promised objects would require that we let
> > this "promised" state have a type of OBJ_UNKNOWN (or something like
> > that) - maybe that would be fine, but I haven't looked into this in
> > detail.
> > 
> 
> Caveats apply as I only did a quick look but I only found the two 
> locations that were checking the object type for consistency.

I haven't looked into detail, but you are probably right.


Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-21 Thread Ben Peart



On 7/20/2017 5:13 PM, Jonathan Tan wrote:

On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart  wrote:


On 7/19/2017 8:21 PM, Jonathan Tan wrote:

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.



Great to see this idea making progress. Making git able to gracefully
handle partial clones (beyond the existing shallow clone support) is a
key piece of dealing with very large objects and repos.


Thanks.


As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.


If I'm reading this patch correctly, for a repo to successfully pass
"git fsck" either the object or a promise must exist for everything fsck
checks.  From the documentation for fsck it says "git fsck defaults to
using the index file, all SHA-1 references in refs namespace, and all
reflogs (unless --no-reflogs is given) as heads." Doesn't this then
imply objects or promises must exist for all objects referenced by any
of these locations?

We're currently in the hundreds of millions of objects on some of our
repos so even downloading the promises for all the objects in the index
is unreasonable as it is gigabytes of data and growing.


For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?



Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
promises. Given the average developer only hydrates 56K files (2 MB 
promises) that is 103 MB to download that no one will ever need. We 
would like to avoid that if possible as this would be a significant 
regression in clone times from where we are today.


I'm also concerned about the performance of merging in promises given we 
have 100M objects today and growing so the number of promises over time 
could get pretty large.



Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.


I think we should have a flag (off by default) that enables someone to
say that promised objects are optional. If the flag is set,
"is_promised_object" will return success and pass the OBJ_ANY type and a
size of -1.

Nothing today is using the size and in the two places where the object
type is being checked for consistency (fsck_cache_tree and
fsck_handle_ref) the test can add a test for OBJ_ANY as well.

This will enable very large numbers of objects to be omitted from the
clone without triggering a download of the corresponding number of
promised objects.


Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.



Caveats apply as I only did a quick look but I only found the two 
locations that were checking the object type for consistency.



A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.


In your work on this, did you investigate if there are other commands
(ie repack/gc) that will need to learn about promised objects? Have you
had a chance (or have plans) to hack up the test suite so that it runs
all tests with promised objects and see what 

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Jonathan Tan
On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart  wrote:

> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> > Currently, Git does not support repos with very large numbers of objects
> > or repos that wish to minimize manipulation of certain blobs (for
> > example, because they are very large) very well, even if the user
> > operates mostly on part of the repo, because Git is designed on the
> > assumption that every referenced object is available somewhere in the
> > repo storage.
> > 
> 
> Great to see this idea making progress. Making git able to gracefully 
> handle partial clones (beyond the existing shallow clone support) is a 
> key piece of dealing with very large objects and repos.

Thanks.

> > As a first step to reducing this problem, introduce the concept of
> > promised objects. Each Git repo can contain a list of promised objects
> > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> > contains functions to query them; functions for creating and modifying
> > that file will be introduced in later patches.
> 
> If I'm reading this patch correctly, for a repo to successfully pass 
> "git fsck" either the object or a promise must exist for everything fsck 
> checks.  From the documentation for fsck it says "git fsck defaults to 
> using the index file, all SHA-1 references in refs namespace, and all 
> reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
> imply objects or promises must exist for all objects referenced by any 
> of these locations?
> 
> We're currently in the hundreds of millions of objects on some of our 
> repos so even downloading the promises for all the objects in the index 
> is unreasonable as it is gigabytes of data and growing.

For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?

Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.

> I think we should have a flag (off by default) that enables someone to 
> say that promised objects are optional. If the flag is set, 
> "is_promised_object" will return success and pass the OBJ_ANY type and a 
> size of -1.
> 
> Nothing today is using the size and in the two places where the object 
> type is being checked for consistency (fsck_cache_tree and 
> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> 
> This will enable very large numbers of objects to be omitted from the 
> clone without triggering a download of the corresponding number of 
> promised objects.

Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.

> > A repository that is missing an object but has that object promised is not
> > considered to be in error, so also teach fsck this. As part of doing
> > this, object.{h,c} has been modified to generate "struct object" based
> > on only the information available to promised objects, without requiring
> > the object itself.
> 
> In your work on this, did you investigate if there are other commands 
> (ie repack/gc) that will need to learn about promised objects? Have you 
> had a chance (or have plans) to hack up the test suite so that it runs 
> all tests with promised objects and see what (if anything) breaks?

In one of the subsequent patches, I tried to ensure that all
object-reading functions in sha1_file.c somewhat works (albeit slowly)
in the presence of promised objects - that would cover the functionality
of the other commands. As for hacking up the test suite to run with
promised objects, that would be ideal, but I haven't figured out how to
do that yet.


Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Ben Peart



On 7/19/2017 8:21 PM, Jonathan Tan wrote:

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.



Great to see this idea making progress. Making git able to gracefully 
handle partial clones (beyond the existing shallow clone support) is a 
key piece of dealing with very large objects and repos.



As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.


If I'm reading this patch correctly, for a repo to successfully pass 
"git fsck" either the object or a promise must exist for everything fsck 
checks.  From the documentation for fsck it says "git fsck defaults to 
using the index file, all SHA-1 references in refs namespace, and all 
reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
imply objects or promises must exist for all objects referenced by any 
of these locations?


We're currently in the hundreds of millions of objects on some of our 
repos so even downloading the promises for all the objects in the index 
is unreasonable as it is gigabytes of data and growing.


I think we should have a flag (off by default) that enables someone to 
say that promised objects are optional. If the flag is set, 
"is_promised_object" will return success and pass the OBJ_ANY type and a 
size of -1.


Nothing today is using the size and in the two places where the object 
type is being checked for consistency (fsck_cache_tree and 
fsck_handle_ref) the test can add a test for OBJ_ANY as well.


This will enable very large numbers of objects to be omitted from the 
clone without triggering a download of the corresponding number of 
promised objects.




A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.


In your work on this, did you investigate if there are other commands 
(ie repack/gc) that will need to learn about promised objects? Have you 
had a chance (or have plans) to hack up the test suite so that it runs 
all tests with promised objects and see what (if anything) breaks?




Signed-off-by: Jonathan Tan 
---
  Documentation/technical/repository-version.txt |   6 ++
  Makefile   |   1 +
  builtin/fsck.c |  18 +++-
  cache.h|   2 +
  environment.c  |   1 +
  fsck.c |   6 +-
  object.c   |  19 
  object.h   |  19 
  promised-object.c  | 130 +
  promised-object.h  |  22 +
  setup.c|   7 +-
  t/t3907-promised-object.sh |  41 
  t/test-lib-functions.sh|   6 ++
  13 files changed, 273 insertions(+), 5 deletions(-)
  create mode 100644 promised-object.c
  create mode 100644 promised-object.h
  create mode 100755 t/t3907-promised-object.sh

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad37986..f8b82c1c7 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,9 @@ for testing format-1 compatibility.
  When the config key `extensions.preciousObjects` is set to `true`,
  objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
  `git repack -d`).
+
+`promisedObjects`
+~
+
+(Explain this - basically a string containing a command to be run
+whenever a missing object needs to be fetched.)
diff --git a/Makefile b/Makefile
index 9c9c42f8f..c1446d5ef 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
  LIB_OBJS += pretty.o
  LIB_OBJS += prio-queue.o
  LIB_OBJS += progress.o
+LIB_OBJS += promised-object.o
  LIB_OBJS += prompt.o
  LIB_OBJS += quote.o
  LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 462b8643b..49e21f361 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
  #include "progress.h"
  #include "streaming.h"
  #include "decorate.h"
+#include "promised-object.h"
  
  #define 

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Jonathan Tan
On Thu, 20 Jul 2017 11:07:29 -0700
Stefan Beller  wrote:

> > +   if (fsck_promised_objects()) {
> > +   error("Errors found in promised object list");
> > +   errors_found |= ERROR_PROMISED_OBJECT;
> > +   }
> 
> This got me thinking: It is an error if we do not have an object
> and also do not promise it, but what about the other case:
> having and object and promising it, too?
> That is not harmful to the operation, except that the promise
> machinery may be slower due to its size. (Should that be a soft
> warning then? Do we have warnings in fsck?)

Good question - having an object and also having it promised is not an
error condition (and I don't think it's a good idea to make it so,
because objects can appear quite easily from various sources). In the
future, I expect "git gc" to be extended to remove such redundant lines
from the "promised" list.

> >   * The object type is stored in 3 bits.
> >   */
> 
> We may want to remove this comment while we're here as it
> sounds stale despite being technically correct.
> 1974632c66 (Remove TYPE_* constant macros and use
> object_type enums consistently., 2006-07-11)

I agree that the comment is unnecessary, but in this commit I didn't
modify anything to do with the type, so I left it there.

> >  struct object {
> > +   /*
> > +* Set if this object is parsed. If set, "type" is populated and 
> > this
> > +* object can be casted to "struct commit" or an equivalent.
> > +*/
> > unsigned parsed : 1;
> > +   /*
> > +* Set if this object is not in the repo but is promised. If set,
> > +* "type" is populated, but this object cannot be casted to "struct
> > +* commit" or an equivalent.
> > +*/
> > +   unsigned promised : 1;
> 
> Would it make sense to have a bit field instead:
> 
> #define STATE_BITS 2
> #define STATE_PARSED (1<<0)
> #define STATE_PROMISED (1<<1)
> 
> unsigned state : STATE_BITS
> 
> This would be similar to the types and flags?

Both type and flag have to be bit fields (for different reasons), but
since we don't need such a combined field for "parsed" and "promised", I
prefer separating them each into their own field.

> > +test_expect_success 'fsck fails on missing objects' '
> > +   test_create_repo repo &&
> > +
> > +   test_commit -C repo 1 &&
> > +   test_commit -C repo 2 &&
> > +   test_commit -C repo 3 &&
> > +   git -C repo tag -a annotated_tag -m "annotated tag" &&
> > +   C=$(git -C repo rev-parse 1) &&
> > +   T=$(git -C repo rev-parse 2^{tree}) &&
> > +   B=$(git hash-object repo/3.t) &&
> > +   AT=$(git -C repo rev-parse annotated_tag) &&
> > +
> > +   # missing commit, tree, blob, and tag
> > +   rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut 
> > -c3-40) &&
> 
> This is a pretty cool test as it promises all sorts of objects
> from different parts of the graph.

Thanks.


Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Stefan Beller
On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan  wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
>
> Signed-off-by: Jonathan Tan 
> ---
>  Documentation/technical/repository-version.txt |   6 ++
>  Makefile   |   1 +
>  builtin/fsck.c |  18 +++-
>  cache.h|   2 +
>  environment.c  |   1 +
>  fsck.c |   6 +-
>  object.c   |  19 
>  object.h   |  19 
>  promised-object.c  | 130 
> +
>  promised-object.h  |  22 +
>  setup.c|   7 +-
>  t/t3907-promised-object.sh |  41 
>  t/test-lib-functions.sh|   6 ++
>  13 files changed, 273 insertions(+), 5 deletions(-)
>  create mode 100644 promised-object.c
>  create mode 100644 promised-object.h
>  create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt 
> b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
>  When the config key `extensions.preciousObjects` is set to `true`,
>  objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
>  `git repack -d`).
> +
> +`promisedObjects`
> +~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
>  LIB_OBJS += pretty.o
>  LIB_OBJS += prio-queue.o
>  LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
>  LIB_OBJS += prompt.o
>  LIB_OBJS += quote.o
>  LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
>  #include "progress.h"
>  #include "streaming.h"
>  #include "decorate.h"
> +#include "promised-object.h"
>
>  #define REACHABLE 0x0001
>  #define SEEN  0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
>  #define ERROR_REACHABLE 02
>  #define ERROR_PACK 04
>  #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
>  static const char *describe_object(struct object *obj)
>  {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const 
> struct object_id *oid,
>  {
> struct object *obj;
>
> -   obj = parse_object(oid);
> +   obj = parse_or_promise_object(oid);
> if (!obj) {
> error("%s: invalid sha1 pointer %s", refname, 
> oid_to_hex(oid));
> errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
> fprintf(stderr, "Checking cache tree\n");
>
> if (0 <= it->entry_count) {
> -   struct object *obj = parse_object(>oid);
> +   struct object *obj = parse_or_promise_object(>oid);
> if (!obj) {
> error("%s: invalid sha1 pointer in cache-tree",
>   oid_to_hex(>oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct 
> object_id *oid,
> return 0;
>  }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> +   mark_object_for_connectivity(oid);
> +   return 0;
> +}
> +
>  static char const * const fsck_usage[] = {
> N_("git fsck [] [...]"),
> NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char 
> *prefix)
>
>  

[RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-19 Thread Jonathan Tan
Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.

As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.

A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.

Signed-off-by: Jonathan Tan 
---
 Documentation/technical/repository-version.txt |   6 ++
 Makefile   |   1 +
 builtin/fsck.c |  18 +++-
 cache.h|   2 +
 environment.c  |   1 +
 fsck.c |   6 +-
 object.c   |  19 
 object.h   |  19 
 promised-object.c  | 130 +
 promised-object.h  |  22 +
 setup.c|   7 +-
 t/t3907-promised-object.sh |  41 
 t/test-lib-functions.sh|   6 ++
 13 files changed, 273 insertions(+), 5 deletions(-)
 create mode 100644 promised-object.c
 create mode 100644 promised-object.h
 create mode 100755 t/t3907-promised-object.sh

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad37986..f8b82c1c7 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,9 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`promisedObjects`
+~
+
+(Explain this - basically a string containing a command to be run
+whenever a missing object needs to be fetched.)
diff --git a/Makefile b/Makefile
index 9c9c42f8f..c1446d5ef 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
+LIB_OBJS += promised-object.o
 LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 462b8643b..49e21f361 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "promised-object.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -44,6 +45,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_PROMISED_OBJECT 011
 
 static const char *describe_object(struct object *obj)
 {
@@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const 
struct object_id *oid,
 {
struct object *obj;
 
-   obj = parse_object(oid);
+   obj = parse_or_promise_object(oid);
if (!obj) {
error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
@@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
fprintf(stderr, "Checking cache tree\n");
 
if (0 <= it->entry_count) {
-   struct object *obj = parse_object(>oid);
+   struct object *obj = parse_or_promise_object(>oid);
if (!obj) {
error("%s: invalid sha1 pointer in cache-tree",
  oid_to_hex(>oid));
@@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int mark_have_promised_object(const struct object_id *oid, void *data)
+{
+   mark_object_for_connectivity(oid);
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
 
git_config(fsck_config, NULL);
 
+   if (fsck_promised_objects()) {
+   error("Errors found in promised object list");
+   errors_found |= ERROR_PROMISED_OBJECT;
+   }
+
fsck_head_link();
if (connectivity_only) {