Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-15 Thread Junio C Hamano
Jonathan Tan  writes:

> There is indeed no reason why we need to keep multiple ones separate for
> an extended period of time - my thinking was to let fetch/clone be fast
> by not needing to scan through the entire existing manifest (in order to
> create the new one),  letting GC take care of consolidating them ...

Given that fetch/clone already incur network cost and the users
expect to wait for them to finish, I wouldn't have made such a
trade-off.

>> > +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
>> > +{
>> 
>> This function that answers "is it expected to be missing?" is
>> confusingly named.  Is it missing, or does it exist?
>
> Renamed to in_missing_blob_manifest().

Either that, or "is_known_to_be_missing()", would be OK.

Thanks.


Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-15 Thread Jonathan Tan
A reroll is coming soon, but there is an interesting discussion point
here so I'll reply to this e-mail first.

On Thu, 15 Jun 2017 11:34:45 -0700
Junio C Hamano  wrote:

> Jonathan Tan  writes:
> 
> > +struct missing_blob_manifest {
> > +   struct missing_blob_manifest *next;
> > +   const char *data;
> > +};
> > +struct missing_blob_manifest *missing_blobs;
> > +int missing_blobs_initialized;
> 
> I do not think you meant to make these non-static.  The type of the
> former is not even visible to the outside world, and the latter is
> something that could be made into static to prepare_missing_blobs()
> function (unless and until you start allowing the missing-blobs
> manifest to be re-initialized).  Your ensure_configured() below
> seems to do the "static" right, on the other hand ;-).

Good catch - done.

> Do we expect that we will have only a handful of these missing blob
> manifests?  Each manifest seems to be efficiently looked-up with a
> binary search, but it makes me wonder if it is a good idea to
> consolidate these manifests into a single list of object names to
> eliminate the outer loop in has_missing_blob().  Unlike pack .idx
> files that must stay one-to-one with .pack files, it appears to me
> that there is no reason why we need to keep multiple ones separate
> for extended period of time (e.g. whenever we learn that we receieved
> an incomplete pack from the other side with a list of newly missing
> blobs, we could incorporate that into existing missing blob list).

There is indeed no reason why we need to keep multiple ones separate for
an extended period of time - my thinking was to let fetch/clone be fast
by not needing to scan through the entire existing manifest (in order to
create the new one), letting GC take care of consolidating them (since
it would have to check individual entries to delete those corresponding
to objects that have entered the repo through other means). But this is
at the expense of making the individual object lookups a bit slower.

For now, I'll leave the possibility of multiple files open while I try
to create a set of patches that can implement missing blob support from
fetch to day-to-day usage. But I am not opposed to changing it to a
single-file manifest.

> > +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
> > +{
> 
> This function that answers "is it expected to be missing?" is
> confusingly named.  Is it missing, or does it exist?

Renamed to in_missing_blob_manifest().

> > @@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned 
> > char *sha1,
> > return (status < 0) ? status : 0;
> >  }
> >  
> > +static char *missing_blob_command;
> > +static int sha1_file_config(const char *conf_key, const char *value, void 
> > *cb)
> > +{
> > +   if (!strcmp(conf_key, "core.missingblobcommand")) {
> > +   missing_blob_command = xstrdup(value);
> > +   }
> > +   return 0;
> > +}
> > +
> > +static int configured;
> > +static void ensure_configured(void)
> > +{
> > +   if (configured)
> > +   return;
> 
> Do not be selfish and pretend that this is the _only_ kind of
> configuration that needs to be done inside sha1_file.c.  Call the
> function ensure__is_configured() and rename the run-once
> guard to match.

My thinking was that any additional configuration could be added to this
function, but individual configuration for each feature is fine too. I
have renamed things according to your suggestion.

> The run-once guard can be made static to the "ensure" function, and
> if you do so, then its name can stay to be "configured", as at that
> point it is clear what it is guarding.

Done.

> > +pack() {
> 
> Style: "pack () {"

Done.

> 
> > +   perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
> 
> high-nybble first to match ntohll() done in has_missing_blob()?  OK.

Actually it's to match the printf behavior below that prints the high
nybble first (like in English).


Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-15 Thread Junio C Hamano
Jonathan Tan  writes:

> diff --git a/sha1_file.c b/sha1_file.c
> index 98086e21e..75fe2174d 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -27,6 +27,9 @@
>  #include "list.h"
>  #include "mergesort.h"
>  #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
>  
>  #define SZ_FMT PRIuMAX
>  static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -1624,6 +1627,72 @@ static const struct packed_git 
> *has_packed_and_bad(const unsigned char *sha1)
>   return NULL;
>  }
>  
> +struct missing_blob_manifest {
> + struct missing_blob_manifest *next;
> + const char *data;
> +};
> +struct missing_blob_manifest *missing_blobs;
> +int missing_blobs_initialized;

I do not think you meant to make these non-static.  The type of the
former is not even visible to the outside world, and the latter is
something that could be made into static to prepare_missing_blobs()
function (unless and until you start allowing the missing-blobs
manifest to be re-initialized).  Your ensure_configured() below
seems to do the "static" right, on the other hand ;-).

Do we expect that we will have only a handful of these missing blob
manifests?  Each manifest seems to be efficiently looked-up with a
binary search, but it makes me wonder if it is a good idea to
consolidate these manifests into a single list of object names to
eliminate the outer loop in has_missing_blob().  Unlike pack .idx
files that must stay one-to-one with .pack files, it appears to me
that there is no reason why we need to keep multiple ones separate
for extended period of time (e.g. whenever we learn that we receieved
an incomplete pack from the other side with a list of newly missing
blobs, we could incorporate that into existing missing blob list).

> +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
> +{

This function that answers "is it expected to be missing?" is
confusingly named.  Is it missing, or does it exist?

> @@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned char 
> *sha1,
>   return (status < 0) ? status : 0;
>  }
>  
> +static char *missing_blob_command;
> +static int sha1_file_config(const char *conf_key, const char *value, void 
> *cb)
> +{
> + if (!strcmp(conf_key, "core.missingblobcommand")) {
> + missing_blob_command = xstrdup(value);
> + }
> + return 0;
> +}
> +
> +static int configured;
> +static void ensure_configured(void)
> +{
> + if (configured)
> + return;

Do not be selfish and pretend that this is the _only_ kind of
configuration that needs to be done inside sha1_file.c.  Call the
function ensure__is_configured() and rename the run-once
guard to match.

The run-once guard can be made static to the "ensure" function, and
if you do so, then its name can stay to be "configured", as at that
point it is clear what it is guarding.

> diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
> new file mode 100755
> index 0..e0ce0942d
> --- /dev/null
> +++ b/t/t3907-missing-blob.sh
> @@ -0,0 +1,69 @@
> +#!/bin/sh
> +
> +test_description='core.missingblobcommand option'
> +
> +. ./test-lib.sh
> +
> +pack() {

Style: "pack () {"

> + perl -e '$/ = undef; $input = <>; print pack("H*", $input)'

high-nybble first to match ntohll() done in has_missing_blob()?  OK.

> +}
> +
> +test_expect_success 'sha1_object_info_extended and read_sha1_file (through 
> git cat-file -p)' '
> + rm -rf server client &&
> +
> + git init server &&
> + test_commit -C server 1 &&
> + test_config -C server uploadpack.allowanysha1inwant 1 &&
> + HASH=$(git hash-object server/1.t) &&
> +
> + git init client &&
> + test_config -C client core.missingblobcommand \
> + "git -C \"$(pwd)/server\" pack-objects --stdout | git 
> unpack-objects" &&
> +
> + # does not work if missing blob is not registered
> + test_must_fail git -C client cat-file -p "$HASH" &&
> +
> + mkdir -p client/.git/objects/missing &&
> + printf "%016x%s%016x" 1 "$HASH" "$(wc -c  + pack >client/.git/objects/missing/x &&
> +
> + # works when missing blob is registered
> + git -C client cat-file -p "$HASH"
> +'

OK, by passing printf '%016x', implementations of "$(wc -c)" that
gives extra whitespace around its output can still work correctly.
Good.


[PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-13 Thread Jonathan Tan
Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan 
---
 Documentation/config.txt |  10 +++
 builtin/fsck.c   |   7 +++
 cache.h  |   6 ++
 sha1_file.c  | 156 ++-
 t/t3907-missing-blob.sh  |  69 +
 5 files changed, 233 insertions(+), 15 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+   If set, whenever a blob in the local repo is attempted to be
+   read but is missing, invoke this shell