This patch series enables git to request missing objects when they are
not found in the object store. This is a fault-in model where the
"read-object" sub-process will fetch the missing object and store it in
the object store as a loose, alternate, or pack file. On success, git
will retry the operation and find the requested object.

It utilizes the recent sub-process refactoring to spawn a "read-object"
hook as a sub-process on the first request and then all subsequent
requests are made to that existing sub-process. This significantly
reduces the cost of making multiple request within a single git command.

Signed-off-by: Ben Peart <benpe...@microsoft.com>
---
 Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
 cache.h                                          |   1 +
 config.c                                         |   5 +
 contrib/long-running-read-object/example.pl      | 114 +++++++++++++
 environment.c                                    |   1 +
 sha1_file.c                                      | 193 ++++++++++++++++++++++-
 t/t0410-read-object.sh                           |  27 ++++
 t/t0410/read-object                              | 114 +++++++++++++
 8 files changed, 550 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 contrib/long-running-read-object/example.pl
 create mode 100755 t/t0410-read-object.sh
 create mode 100755 t/t0410/read-object

diff --git a/Documentation/technical/read-object-protocol.txt 
b/Documentation/technical/read-object-protocol.txt
new file mode 100644
index 0000000000..a893b46e7c
--- /dev/null
+++ b/Documentation/technical/read-object-protocol.txt
@@ -0,0 +1,102 @@
+Read Object Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read-object process enables Git to read all missing blobs with a
+single process invocation for the entire life of a single Git command.
+This is achieved by using a packet format (pkt-line, see technical/
+protocol-common.txt) based protocol over standard input and standard
+output as follows. All packets, except for the "*CONTENT" packets and
+the "0000" flush packet, are considered text and therefore are
+terminated by a LF.
+
+Git starts the process when it encounters the first missing object that
+needs to be retrieved. After the process is started, Git sends a welcome
+message ("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version number
+from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 0000
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 0000
+packet: git> capability=get
+packet: git> capability=have
+packet: git> capability=put
+packet: git> capability=not-yet-invented
+packet: git> 0000
+packet: git< capability=get
+packet: git< 0000
+------------------------
+The only supported capability in version 1 is "get".
+
+Afterwards Git sends a list of "key=value" pairs terminated with a flush
+packet. The list will contain at least the command (based on the
+supported capabilities) and the sha1 of the object to retrieve. Please
+note, that the process must not send any response before it received the
+final flush packet.
+
+When the process receives the "get" command, it should make the requested
+object available in the git object store and then return success. Git will
+then check the object store again and this time find it and proceed.
+------------------------
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 0000
+------------------------
+
+The process is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the process does not experience
+problems then the list must contain a "success" status.
+------------------------
+packet: git< status=success
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content, it
+is expected to respond with an "error" status.
+------------------------
+packet: git< status=error
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content as
+well as any future content for the lifetime of the Git process, then it
+is expected to respond with an "abort" status at any point in the
+protocol.
+------------------------
+packet: git< status=abort
+packet: git< 0000
+------------------------
+
+Git neither stops nor restarts the process in case the "error"/"abort"
+status is set.
+
+If the process dies during the communication or does not adhere to the
+protocol then Git will stop the process and restart it with the next
+object that needs to be processed.
+
+After the read-object process has processed an object it is expected to
+wait for the next "key=value" list containing a command. Git will close
+the command pipe on exit. The process is expected to detect EOF and exit
+gracefully on its own. Git will wait until the process has stopped.
+
+A long running read-object process demo implementation can be found in
+`contrib/long-running-read-object/example.pl` located in the Git core
+repository. If you develop your own long running process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful for
+debugging (see linkgit:git[1]).
diff --git a/cache.h b/cache.h
index 71fe092644..914379724f 100644
--- a/cache.h
+++ b/cache.h
@@ -804,6 +804,7 @@ enum log_refs_config {
        LOG_REFS_ALWAYS
 };
 extern enum log_refs_config log_all_ref_updates;
+extern int core_virtualize_objects;
 
 enum branch_track {
        BRANCH_TRACK_UNSPECIFIED = -1,
diff --git a/config.c b/config.c
index a9356c1383..cc6e8f3237 100644
--- a/config.c
+++ b/config.c
@@ -1241,6 +1241,11 @@ static int git_default_core_config(const char *var, 
const char *value)
                return 0;
        }
 
+       if (!strcmp(var, "core.virtualizeobjects")) {
+               core_virtualize_objects = git_config_bool(var, value);
+               return 0;
+       }
+
        /* Add other config variables here and to Documentation/config.txt. */
        return 0;
 }
diff --git a/contrib/long-running-read-object/example.pl 
b/contrib/long-running-read-object/example.pl
new file mode 100755
index 0000000000..b8f37f836a
--- /dev/null
+++ b/contrib/long-running-read-object/example.pl
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+#      cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = "/host_repo/.git/";
+
+sub packet_bin_read {
+       my $buffer;
+       my $bytes_read = read STDIN, $buffer, 4;
+       if ( $bytes_read == 0 ) {
+
+               # EOF - Git stopped talking to us!
+               exit();
+       }
+       elsif ( $bytes_read != 4 ) {
+               die "invalid packet: '$buffer'";
+       }
+       my $pkt_size = hex($buffer);
+       if ( $pkt_size == 0 ) {
+               return ( 1, "" );
+       }
+       elsif ( $pkt_size > 4 ) {
+               my $content_size = $pkt_size - 4;
+               $bytes_read = read STDIN, $buffer, $content_size;
+               if ( $bytes_read != $content_size ) {
+                       die "invalid packet ($content_size bytes expected; 
$bytes_read bytes read)";
+               }
+               return ( 0, $buffer );
+       }
+       else {
+               die "invalid packet size: $pkt_size";
+       }
+}
+
+sub packet_txt_read {
+       my ( $res, $buf ) = packet_bin_read();
+       unless ( $buf =~ s/\n$// ) {
+               die "A non-binary line MUST be terminated by an LF.";
+       }
+       return ( $res, $buf );
+}
+
+sub packet_bin_write {
+       my $buf = shift;
+       print STDOUT sprintf( "%04x", length($buf) + 4 );
+       print STDOUT $buf;
+       STDOUT->flush();
+}
+
+sub packet_txt_write {
+       packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+       print STDOUT sprintf( "%04x", 0 );
+       STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad 
initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )                             || die 
"bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version 
end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability 
end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+       my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+       if ( $command eq "get" ) {
+               my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+               packet_bin_read();
+
+               system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . 
' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 
2>&1');
+               packet_txt_write(($?) ? "status=error" : "status=success");
+               packet_flush();
+       } else {
+               die "bad command '$command'";
+       }
+}
diff --git a/environment.c b/environment.c
index 3fd4b10845..ad9403ed6c 100644
--- a/environment.c
+++ b/environment.c
@@ -66,6 +66,7 @@ int precomposed_unicode = -1; /* see 
probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
 enum hide_dotfiles_type hide_dotfiles = HIDE_DOTFILES_DOTGITONLY;
 enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
+int core_virtualize_objects;
 
 #ifndef PROTECT_HFS_DEFAULT
 #define PROTECT_HFS_DEFAULT 0
diff --git a/sha1_file.c b/sha1_file.c
index 5862386cd0..06290f8647 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,9 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "sigchain.h"
+#include "sub-process.h"
+#include "pkt-line.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -647,6 +650,162 @@ void prepare_alt_odb(void)
        read_info_alternates(get_object_directory(), 0);
 }
 
+#define CAP_GET    (1u<<0)
+
+static int subprocess_map_initialized;
+static struct hashmap subprocess_map;
+
+struct read_object_process {
+       struct subprocess_entry subprocess;
+       unsigned int supported_capabilities;
+};
+
+int start_read_object_fn(struct subprocess_entry *subprocess)
+{
+       int err;
+       struct read_object_process *entry = (struct read_object_process 
*)subprocess;
+       struct child_process *process;
+       struct string_list cap_list = STRING_LIST_INIT_NODUP;
+       char *cap_buf;
+       const char *cap_name;
+
+       process = subprocess_get_child_process(&entry->subprocess);
+
+       sigchain_push(SIGPIPE, SIG_IGN);
+
+       err = packet_writel(process->in, "git-read-object-client", "version=1", 
NULL);
+       if (err)
+               goto done;
+
+       err = strcmp(packet_read_line(process->out, NULL), 
"git-read-object-server");
+       if (err) {
+               error("external process '%s' does not support read-object 
protocol version 1", subprocess->cmd);
+               goto done;
+       }
+       err = strcmp(packet_read_line(process->out, NULL), "version=1");
+       if (err)
+               goto done;
+       err = packet_read_line(process->out, NULL) != NULL;
+       if (err)
+               goto done;
+
+       err = packet_writel(process->in, "capability=get", NULL);
+       if (err)
+               goto done;
+
+       for (;;) {
+               cap_buf = packet_read_line(process->out, NULL);
+               if (!cap_buf)
+                       break;
+               string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+               if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, 
"capability"))
+                       continue;
+
+               cap_name = cap_list.items[1].string;
+               if (!strcmp(cap_name, "get")) {
+                       entry->supported_capabilities |= CAP_GET;
+               }
+               else {
+                       warning(
+                               "external process '%s' requested unsupported 
read-object capability '%s'",
+                               subprocess->cmd, cap_name
+                       );
+               }
+
+               string_list_clear(&cap_list, 0);
+       }
+
+done:
+       sigchain_pop(SIGPIPE);
+
+       if (err || errno == EPIPE)
+               return err ? err : errno;
+
+       return 0;
+}
+
+static int read_object_process(const unsigned char *sha1)
+{
+       int err;
+       struct read_object_process *entry;
+       struct child_process *process;
+       struct strbuf status = STRBUF_INIT;
+       const char *cmd = find_hook("read-object");
+       uint64_t start;
+
+       start = getnanotime();
+
+       if (!subprocess_map_initialized) {
+               subprocess_map_initialized = 1;
+               hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 
0);
+               entry = NULL;
+       } else {
+               entry = (struct cmd2process 
*)subprocess_find_entry(&subprocess_map, cmd);
+       }
+       if (!entry) {
+               entry = xmalloc(sizeof(*entry));
+               entry->supported_capabilities = 0;
+
+               if (subprocess_start(&subprocess_map, &entry->subprocess, cmd, 
start_read_object_fn)) {
+                       free(entry);
+                       return -1;
+               }
+       }
+       process = subprocess_get_child_process(&entry->subprocess);
+
+       if (!(CAP_GET & entry->supported_capabilities))
+               return -1;
+
+       sigchain_push(SIGPIPE, SIG_IGN);
+
+       err = packet_write_fmt_gently(process->in, "command=get\n");
+       if (err)
+               goto done;
+
+       err = packet_write_fmt_gently(process->in, "sha1=%s\n", 
sha1_to_hex(sha1));
+       if (err)
+               goto done;
+
+       err = packet_flush_gently(process->in);
+       if (err)
+               goto done;
+
+       err = subprocess_read_status(process->out, &status);
+       err = err ? err : strcmp(status.buf, "success");
+
+done:
+       sigchain_pop(SIGPIPE);
+
+       if (err || errno == EPIPE) {
+               err = err ? err : errno;
+               if (!strcmp(status.buf, "error")) {
+                       /* The process signaled a problem with the file. */
+               }
+               else if (!strcmp(status.buf, "abort")) {
+                       /*
+                        * The process signaled a permanent problem. Don't try 
to read
+                        * objects with the same command for the lifetime of 
the current
+                        * Git process.
+                        */
+                       entry->supported_capabilities &= ~CAP_GET;
+               }
+               else {
+                       /*
+                        * Something went wrong with the read-object process.
+                        * Force shutdown and restart if needed.
+                        */
+                       error("external process '%s' failed", cmd);
+                       subprocess_stop(&subprocess_map, (struct 
subprocess_entry *)entry);
+                       free(entry);
+               }
+       }
+
+       trace_performance_since(start, "read_object_process");
+
+       return err;
+}
+
 /* Returns 1 if we have successfully freshened the file, 0 otherwise. */
 static int freshen_file(const char *fn)
 {
@@ -690,8 +849,19 @@ static int check_and_freshen_nonlocal(const unsigned char 
*sha1, int freshen)
 
 static int check_and_freshen(const unsigned char *sha1, int freshen)
 {
-       return check_and_freshen_local(sha1, freshen) ||
-              check_and_freshen_nonlocal(sha1, freshen);
+       int ret;
+       int already_retried = 0;
+
+retry:
+       ret = check_and_freshen_local(sha1, freshen) ||
+               check_and_freshen_nonlocal(sha1, freshen);
+       if (!ret && core_virtualize_objects && !already_retried) {
+               already_retried = 1;
+               if (!read_object_process(sha1))
+                       goto retry;
+       }
+
+       return ret;
 }
 
 int has_loose_object_nonlocal(const unsigned char *sha1)
@@ -2983,6 +3153,7 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
        const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
                                    lookup_replace_object(sha1) :
                                    sha1;
+       int already_retried = 0;
 
        if (!oi)
                oi = &blank_oi;
@@ -3007,6 +3178,7 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
                }
        }
 
+retry:
        if (!find_pack_entry(real, &e)) {
                /* Most likely it's a loose object. */
                if (!sha1_loose_object_info(real, oi, flags)) {
@@ -3015,13 +3187,19 @@ int sha1_object_info_extended(const unsigned char 
*sha1, struct object_info *oi,
                }
 
                /* Not a loose object; someone else may have just packed it. */
-               if (flags & OBJECT_INFO_QUICK) {
-                       return -1;
-               } else {
+               if (!(flags & OBJECT_INFO_QUICK)) {
                        reprepare_packed_git();
-                       if (!find_pack_entry(real, &e))
-                               return -1;
+                       if (find_pack_entry(real, &e))
+                               goto found_packed;
+               }
+
+               /* Request the object be retrieved */
+               if (core_virtualize_objects && !already_retried) {
+                       already_retried = 1;
+                       if (!read_object_process(sha1))
+                               goto retry;
                }
+               return -1;
        }
 
        if (oi == &blank_oi)
@@ -3031,6 +3209,7 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
                 */
                return 0;
 
+found_packed:
        rtype = packed_object_info(e.p, e.offset, oi);
        if (rtype < 0) {
                mark_bad_packed_object(e.p, real);
diff --git a/t/t0410-read-object.sh b/t/t0410-read-object.sh
new file mode 100755
index 0000000000..b8d7521c2c
--- /dev/null
+++ b/t/t0410-read-object.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='tests for long running read-object process'
+
+. ./test-lib.sh
+
+test_expect_success 'setup host repo with a root commit' '
+       test_commit zero &&
+       hash1=$(git ls-tree HEAD | grep zero.t | cut -f1 | cut -d\  -f3)
+'
+
+test_expect_success 'blobs can be retrieved from the host repo' '
+       git init guest-repo &&
+       (cd guest-repo &&
+        mkdir -p .git/hooks &&
+        cp $TEST_DIRECTORY/t0410/read-object .git/hooks/ &&
+        git config core.virtualizeobjects true &&
+        git cat-file blob "$hash1")
+'
+
+test_expect_success 'invalid blobs generate errors' '
+       (cd guest-repo &&
+        test_must_fail git cat-file blob "invalid")
+'
+
+
+test_done
diff --git a/t/t0410/read-object b/t/t0410/read-object
new file mode 100755
index 0000000000..85e997c930
--- /dev/null
+++ b/t/t0410/read-object
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+#      cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = "../.git/";
+
+sub packet_bin_read {
+       my $buffer;
+       my $bytes_read = read STDIN, $buffer, 4;
+       if ( $bytes_read == 0 ) {
+
+               # EOF - Git stopped talking to us!
+               exit();
+       }
+       elsif ( $bytes_read != 4 ) {
+               die "invalid packet: '$buffer'";
+       }
+       my $pkt_size = hex($buffer);
+       if ( $pkt_size == 0 ) {
+               return ( 1, "" );
+       }
+       elsif ( $pkt_size > 4 ) {
+               my $content_size = $pkt_size - 4;
+               $bytes_read = read STDIN, $buffer, $content_size;
+               if ( $bytes_read != $content_size ) {
+                       die "invalid packet ($content_size bytes expected; 
$bytes_read bytes read)";
+               }
+               return ( 0, $buffer );
+       }
+       else {
+               die "invalid packet size: $pkt_size";
+       }
+}
+
+sub packet_txt_read {
+       my ( $res, $buf ) = packet_bin_read();
+       unless ( $buf =~ s/\n$// ) {
+               die "A non-binary line MUST be terminated by an LF.";
+       }
+       return ( $res, $buf );
+}
+
+sub packet_bin_write {
+       my $buf = shift;
+       print STDOUT sprintf( "%04x", length($buf) + 4 );
+       print STDOUT $buf;
+       STDOUT->flush();
+}
+
+sub packet_txt_write {
+       packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+       print STDOUT sprintf( "%04x", 0 );
+       STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad 
initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )                             || die 
"bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version 
end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability 
end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+       my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+       if ( $command eq "get" ) {
+               my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+               packet_bin_read();
+
+               system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . 
' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 
2>&1');
+               packet_txt_write(($?) ? "status=error" : "status=success");
+               packet_flush();
+       } else {
+               die "bad command '$command'";
+       }
+}
-- 
2.13.2.windows.1

Reply via email to