[PATCH v4 01/15] upload-pack: add object filtering for partial clone

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach upload-pack to negotiate object filtering over the protocol and
to send filter parameters to pack-objects.  This is intended for partial
clone and fetch.

The idea to make upload-pack configurable using uploadpack.allowFilter
comes from Jonathan Tan's work in [1].

[1] 
https://public-inbox.org/git/f211093280b422c32cc1b7034130072f35c5ed51.1506714999.git.jonathanta...@google.com/

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/config.txt  |  4 
 Documentation/technical/pack-protocol.txt |  8 +++
 Documentation/technical/protocol-capabilities.txt |  8 +++
 list-objects-filter-options.c | 26 +++
 list-objects-filter-options.h |  6 ++
 upload-pack.c | 22 ++-
 6 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1ac0ae6..e528210 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -3268,6 +3268,10 @@ uploadpack.packObjectsHook::
was run. I.e., `upload-pack` will feed input intended for
`pack-objects` to the hook, and expects a completed packfile on
stdout.
+
+uploadpack.allowFilter::
+   If this option is set, `upload-pack` will advertise partial
+   clone and partial fetch object filtering.
 +
 Note that this configuration variable is ignored if it is seen in the
 repository-level config (this is a safety measure against fetching from
diff --git a/Documentation/technical/pack-protocol.txt 
b/Documentation/technical/pack-protocol.txt
index ed1eae8..a43a113 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -212,6 +212,7 @@ out of what the server said it could do with the first 
'want' line.
   upload-request=  want-list
   *shallow-line
   *1depth-request
+  [filter-request]
   flush-pkt
 
   want-list =  first-want
@@ -227,6 +228,8 @@ out of what the server said it could do with the first 
'want' line.
   additional-want   =  PKT-LINE("want" SP obj-id)
 
   depth =  1*DIGIT
+
+  filter-request=  PKT-LINE("filter" SP filter-spec)
 
 
 Clients MUST send all the obj-ids it wants from the reference
@@ -249,6 +252,11 @@ complete those commits. Commits whose parents are not 
received as a
 result are defined as shallow and marked as such in the server. This
 information is sent back to the client in the next step.
 
+The client can optionally request that pack-objects omit various
+objects from the packfile using one of several filtering techniques.
+These are intended for use with partial clone and partial fetch
+operations.  See `rev-list` for possible "filter-spec" values.
+
 Once all the 'want's and 'shallow's (and optional 'deepen') are
 transferred, clients MUST send a flush-pkt, to tell the server side
 that it is done sending the list.
diff --git a/Documentation/technical/protocol-capabilities.txt 
b/Documentation/technical/protocol-capabilities.txt
index 26dcc6f..332d209 100644
--- a/Documentation/technical/protocol-capabilities.txt
+++ b/Documentation/technical/protocol-capabilities.txt
@@ -309,3 +309,11 @@ to accept a signed push certificate, and asks the  
to be
 included in the push certificate.  A send-pack client MUST NOT
 send a push-cert packet unless the receive-pack server advertises
 this capability.
+
+filter
+--
+
+If the upload-pack server advertises the 'filter' capability,
+fetch-pack may send "filter" commands to request a partial clone
+or partial fetch and request that the server omit various objects
+from the packfile.
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index a9298fd..f1fb57b 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -147,3 +147,29 @@ int opt_parse_list_objects_filter(const struct option *opt,
 
return parse_list_objects_filter(filter_options, arg);
 }
+
+/*
+ * The caller wants to pass the value of filter_options->raw_value
+ * to a subordinate program.  Encode the value if necessary to guard
+ * against injection attacks.
+ */
+void list_objects_filter_push_arg(
+   struct argv_array *args,
+   const struct list_objects_filter_options *filter_options)
+{
+   if (!filter_options->choice)
+   return;
+   if (!filter_options->raw_value || !*filter_options->raw_value)
+   return;
+
+   if (filter_options->requires_armor) {
+   struct strbuf buf = STRBUF_INIT;
+   armor_encode_arg(, filter_options->raw_value);
+   argv_array_pushf(args, "--%s=%s", CL_ARG__FILTER, buf.buf);
+   strbuf_release();
+

[PATCH v4 01/10] extension.partialclone: introduce partial clone extension

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Introduce new repository extension option:
`extensions.partialclone`

See the update to Documentation/technical/repository-version.txt
in this patch for more information.

Signed-off-by: Jonathan Tan 
---
 Documentation/technical/repository-version.txt | 12 
 cache.h|  2 ++
 environment.c  |  1 +
 setup.c|  7 ++-
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad379..e03eacc 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,15 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`partialclone`
+~~
+
+When the config key `extensions.partialclone` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects.  Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.
diff --git a/cache.h b/cache.h
index 6440e2b..35e3f5e 100644
--- a/cache.h
+++ b/cache.h
@@ -860,10 +860,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_partial_clone;
 
 struct repository_format {
int version;
int precious_objects;
+   char *partial_clone; /* value of extensions.partialclone */
int is_bare;
char *work_tree;
struct string_list unknown_extensions;
diff --git a/environment.c b/environment.c
index 8289c25..e52aab3 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_partial_clone;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/setup.c b/setup.c
index 03f51e0..58536bd 100644
--- a/setup.c
+++ b/setup.c
@@ -420,7 +420,11 @@ static int check_repo_format(const char *var, const char 
*value, void *vdata)
;
else if (!strcmp(ext, "preciousobjects"))
data->precious_objects = git_config_bool(var, value);
-   else
+   else if (!strcmp(ext, "partialclone")) {
+   if (!value)
+   return config_error_nonbool(var);
+   data->partial_clone = xstrdup(value);
+   } else
string_list_append(>unknown_extensions, ext);
} else if (strcmp(var, "core.bare") == 0) {
data->is_bare = git_config_bool(var, value);
@@ -463,6 +467,7 @@ static int check_repository_format_gently(const char 
*gitdir, int *nongit_ok)
}
 
repository_format_precious_objects = candidate.precious_objects;
+   repository_format_partial_clone = candidate.partial_clone;
string_list_clear(_extensions, 0);
if (!has_common) {
if (candidate.is_bare != -1) {
-- 
2.9.3



[PATCH v4 00/10] Partial clone part 2: fsck and promisors

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

This is part 2 of a 3 part sequence for partial clone.
Part 2 assumes part 1 is in place.

This patch series is labeled V4 to keep it in sync with the
V4 version of part 1.  (There was no V3 of this part.)

Part 2 is concerned with fsck, gc, initial support for dynamic
object fetching, and tracking promisor objects.  Jonathan Tan
originally developed this code.  I have moved it on top of
part 1 and updated it slightly.

Jonathan Tan (10):
  extension.partialclone: introduce partial clone extension
  fsck: introduce partialclone extension
  fsck: support refs pointing to promisor objects
  fsck: support referenced promisor objects
  fsck: support promisor objects as CLI argument
  index-pack: refactor writing of .keep files
  introduce fetch-object: fetch one promisor object
  sha1_file: support lazily fetching missing objects
  rev-list: support termination at promisor objects
  gc: do not repack promisor packfiles

 Documentation/git-pack-objects.txt |  12 +-
 Documentation/gitremote-helpers.txt|   6 +
 Documentation/rev-list-options.txt |  12 +-
 Documentation/technical/repository-version.txt |  12 +
 Makefile   |   1 +
 builtin/cat-file.c |   2 +
 builtin/fetch-pack.c   |  10 +
 builtin/fsck.c |  26 +-
 builtin/gc.c   |   3 +
 builtin/index-pack.c   | 113 
 builtin/pack-objects.c |  36 +++
 builtin/prune.c|   7 +
 builtin/repack.c   |   8 +-
 builtin/rev-list.c |  74 +-
 cache.h|  13 +-
 environment.c  |   1 +
 fetch-object.c |  26 ++
 fetch-object.h |   6 +
 fetch-pack.c   |   8 +-
 fetch-pack.h   |   2 +
 list-objects.c |  29 ++-
 object.c   |   2 +-
 packfile.c |  77 +-
 packfile.h |  13 +
 remote-curl.c  |  14 +-
 revision.c |  33 ++-
 revision.h |   5 +-
 setup.c|   7 +-
 sha1_file.c|  38 ++-
 t/t0410-partial-clone.sh   | 343 +
 transport.c|   8 +
 transport.h|   8 +
 32 files changed, 872 insertions(+), 83 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h
 create mode 100755 t/t0410-partial-clone.sh

-- 
2.9.3



[PATCH v4 09/10] rev-list: support termination at promisor objects

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach rev-list to support termination of an object traversal at any
object from a promisor remote (whether one that the local repo also has,
or one that the local repo knows about because it has another promisor
object that references it).

This will be used subsequently in gc and in the connectivity check used
by fetch.

For efficiency, if an object is referenced by a promisor object, and is
in the local repo only as a non-promisor object, object traversal will
not stop there. This is to avoid building the list of promisor object
references.

(In list-objects.c, the case where obj is NULL in process_blob() and
process_tree() do not need to be changed because those happen only when
there is a conflict between the expected type and the existing object.
If the object doesn't exist, an object will be synthesized, which is
fine.)

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/rev-list-options.txt |  12 -
 builtin/rev-list.c |  74 ---
 list-objects.c |  29 ++-
 object.c   |   2 +-
 revision.c |  33 +++-
 revision.h |   5 +-
 t/t0410-partial-clone.sh   | 101 +
 7 files changed, 243 insertions(+), 13 deletions(-)

diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index c84e465..2beffe3 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -730,7 +730,7 @@ specification contained in .
Only useful with `--filter=`; prints a list of the omitted objects.
Object IDs are prefixed with a ``~'' character.
 
---missing=(error|allow-any|print)::
+--missing=(error|allow-any|allow-promisor|print)::
Specifies how missing objects are handled.  The repository may
have missing objects after a partial clone, for example.
 +
@@ -741,10 +741,20 @@ The value 'allow-any' will allow object traversal to 
continue if a
 missing object is encountered.  Missing objects will silently be omitted
 from the results.
 +
+The value 'allow-promisor' is like 'allow-any' in that it will allow
+object traversal to continue, but only for EXPECTED missing objects.
++
 The value 'print' is like 'allow-any', but will also print a list of the
 missing objects.  Object IDs are prefixed with a ``?'' character.
 endif::git-rev-list[]
 
+--exclude-promisor-objects::
+   (For internal use only.)  Prefilter object traversal at
+   promisor boundary.  This is used with partial clone.  This is
+   stronger than `--missing=allow-promisor` because it limits the
+   traversal, rather than just silencing errors about missing
+   objects.
+
 --no-walk[=(sorted|unsorted)]::
Only show the given commits, but do not traverse their ancestors.
This has no effect if a range is specified. If the argument
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index da4a39b..d144d66 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "reflog-walk.h"
 #include "oidset.h"
+#include "packfile.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -67,6 +68,7 @@ enum missing_action {
MA_ERROR = 0,/* fail if any missing objects are encountered */
MA_ALLOW_ANY,/* silently allow ALL missing objects */
MA_PRINT,/* print ALL missing objects in special section */
+   MA_ALLOW_PROMISOR, /* silently allow all missing PROMISOR objects */
 };
 static enum missing_action arg_missing_action;
 
@@ -197,6 +199,12 @@ static void finish_commit(struct commit *commit, void 
*data)
 
 static inline void finish_object__ma(struct object *obj)
 {
+   /*
+* Whether or not we try to dynamically fetch missing objects
+* from the server, we currently DO NOT have the object.  We
+* can either print, allow (ignore), or conditionally allow
+* (ignore) them.
+*/
switch (arg_missing_action) {
case MA_ERROR:
die("missing blob object '%s'", oid_to_hex(>oid));
@@ -209,25 +217,36 @@ static inline void finish_object__ma(struct object *obj)
oidset_insert(_objects, >oid);
return;
 
+   case MA_ALLOW_PROMISOR:
+   if (is_promisor_object(>oid))
+   return;
+   die("unexpected missing blob object '%s'",
+   oid_to_hex(>oid));
+   return;
+
default:
BUG("unhandled missing_action");
return;
}
 }
 
-static void finish_object(struct object *obj, const char *name, void *cb_data)
+static int fini

[PATCH v4 08/10] sha1_file: support lazily fetching missing objects

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Teach sha1_file to fetch objects from the remote configured in
extensions.partialclone whenever an object is requested but missing.

The fetching of objects can be suppressed through a global variable.
This is used by fsck and index-pack.

However, by default, such fetching is not suppressed. This is meant as a
temporary measure to ensure that all Git commands work in such a
situation. Future patches will update some commands to either tolerate
missing objects (without fetching them) or be more efficient in fetching
them.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file.c that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in packfile.c (because I need to check callers that know
 about the loose/packed distinction and operate on both differently,
 and ensure that they can handle the concept of objects that are
 neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and others.  For
for_each_packed_object, the callers either already work or are fixed in
this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about missing objects
 - builtin/cat-file - warning message added in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
   - find_pack_entry_one
 - this searches a single pack that is provided as an argument; the
   caller already knows (through other means) that the sought object
   is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a file if
 it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promisor objects
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

Signed-off-by: Jonathan Tan 
---
 builtin/cat-file.c   |  2 ++
 builtin/fetch-pack.c |  2 ++
 builtin/fsck.c   |  3 +++
 builtin/index-pack.c |  6 ++
 cache.h  |  8 
 fetch-object.c   |  3 +++
 sha1_file.c  | 38 
 t/t0410-partial-clone.sh | 51 
 8 files changed, 100 insertions(+), 13 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index f5fa4fd..cf9ea5c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -475,6 +475,8 @@ static int batch_objects(struct batch_options *opt)
 
for_each_loose_object(batch_loose_object, , 0);
for_each_packed_object(batch_packed_object, , 0);
+   if (repository_format_partial_clone)
+   warning("This repository has extensions.partialClone 
set. Some objects may not be loaded.");
 
cb.opt = opt;
cb.expand = 
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 9f303cf..9a7ebf6 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -53,6 +53,8 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
struct oid_array shallow = OID_ARRAY_INIT;
struct string_list deepen_not = STRING_LIST_INIT_DUP;
 
+   fetch_if_missing = 0;
+
packet_trace_identity("fetch-pack");
 
memset(, 0, sizeof(args));
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 578a7c8..3b76c0e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -678,6 +678,9 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
int i;
struct alternate_object_database *alt;
 
+   /* fsck knows how to handle missing promisor objects */
+   fetch_if_missing = 0;
+
errors_found = 0;
check_replace_refs = 0;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 24c2f05..a0a35e6 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1657,6 +1657,12 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
unsigned foreign_nr = 1;/* zero is a "good" value, assume bad */
int report_end_of_input = 0;
 
+   /*
+* index-pack never needs to fetch missing objects, since it only
+* accesses the repo to do hash collision checks
+*/
+   fetch_if_missing = 0;
+
if (argc == 2 && !strcmp(argv[1], "-h"))
usage(index_pack_usage);
 
diff --git a/cache.h b/cache.h
index c76f2e9..6980072 100644
--- a/cache.h
+++ b/cache.h
@@ -1727,6 +1727,14 

[PATCH v4 04/10] fsck: support referenced promisor objects

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Teach fsck to not treat missing promisor objects indirectly pointed to
by refs as an error when extensions.partialclone is set.

Signed-off-by: Jonathan Tan 
---
 builtin/fsck.c   | 11 +++
 t/t0410-partial-clone.sh | 23 +++
 2 files changed, 34 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ee937bb..4c2a56d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void 
*data, struct fsck_opt
if (obj->flags & REACHABLE)
return 0;
obj->flags |= REACHABLE;
+
+   if (is_promisor_object(>oid))
+   /*
+* Further recursion does not need to be performed on this
+* object since it is a promisor object (so it does not need to
+* be added to "pending").
+*/
+   return 0;
+
if (!(obj->flags & HAS_OBJ)) {
if (parent && !has_object_file(>oid)) {
printf("broken link from %7s %s\n",
@@ -208,6 +217,8 @@ static void check_reachable_object(struct object *obj)
 * do a full fsck
 */
if (!(obj->flags & HAS_OBJ)) {
+   if (is_promisor_object(>oid))
+   return;
if (has_sha1_pack(obj->oid.hash))
return; /* it is in pack - forget about it */
printf("missing %s %s\n", printable_type(obj),
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index bf75162..4f9931f 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -102,4 +102,27 @@ test_expect_success 'missing ref object, but promised, 
passes fsck' '
git -C repo fsck
 '
 
+test_expect_success 'missing object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo 1 &&
+   test_commit -C repo 2 &&
+   test_commit -C repo 3 &&
+   git -C repo tag -a annotated_tag -m "annotated tag" &&
+
+   C=$(git -C repo rev-parse 1) &&
+   T=$(git -C repo rev-parse 2^{tree}) &&
+   B=$(git hash-object repo/3.t) &&
+   AT=$(git -C repo rev-parse annotated_tag) &&
+
+   promise_and_delete "$C" &&
+   promise_and_delete "$T" &&
+   promise_and_delete "$B" &&
+   promise_and_delete "$AT" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialclone "arbitrary string" &&
+   git -C repo fsck
+'
+
 test_done
-- 
2.9.3



[PATCH v4 10/10] gc: do not repack promisor packfiles

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach gc to stop traversal at promisor objects, and to leave promisor
packfiles alone. This has the effect of only repacking non-promisor
packfiles, and preserves the distinction between promisor packfiles and
non-promisor packfiles.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt | 12 -
 builtin/gc.c   |  3 +++
 builtin/pack-objects.c | 36 ++
 builtin/prune.c|  7 +
 builtin/repack.c   |  8 --
 t/t0410-partial-clone.sh   | 52 +-
 6 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 5fad696..33a824e 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -242,9 +242,19 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
the resulting packfile.  See linkgit:git-rev-list[1] for valid
`` forms.
 
---missing=(error|allow-any):
+--missing=(error|allow-any|allow-promisor):
Specifies how missing objects are handled.  This is useful, for
example, when there are missing objects from a prior partial clone.
+   This is stronger than `--missing=allow-promisor` because it limits
+   the traversal, rather than just silencing errors about missing
+   objects.
+
+--exclude-promisor-objects::
+   Omit objects that are known to be in the promisor remote". (This
+   option has the purpose of operating only on locally created objects,
+   so that when we repack, we still maintain a distinction between
+   locally created objects [without .promisor] and objects from the
+   promisor remote [with .promisor].)  This is used with partial clone.
 
 SEE ALSO
 
diff --git a/builtin/gc.c b/builtin/gc.c
index 3c5eae0..77fa720 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -458,6 +458,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_push(, prune_expire);
if (quiet)
argv_array_push(, "--no-progress");
+   if (repository_format_partial_clone)
+   argv_array_push(,
+   "--exclude-promisor-objects");
if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
return error(FAILED_RUN, prune.argv[0]);
}
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 45ad35d..4534209 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -75,6 +75,8 @@ static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
+static int exclude_promisor_objects;
+
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 256 * 1024 * 1024;
 static unsigned long cache_max_small_delta_size = 1000;
@@ -86,6 +88,7 @@ static struct list_objects_filter_options filter_options;
 enum missing_action {
MA_ERROR = 0,/* fail if any missing objects are encountered */
MA_ALLOW_ANY,/* silently allow ALL missing objects */
+   MA_ALLOW_PROMISOR, /* silently allow all missing PROMISOR objects */
 };
 static enum missing_action arg_missing_action;
 static show_object_fn fn_show_object;
@@ -2577,6 +2580,20 @@ static void show_object__ma_allow_any(struct object 
*obj, const char *name, void
show_object(obj, name, data);
 }
 
+static void show_object__ma_allow_promisor(struct object *obj, const char 
*name, void *data)
+{
+   assert(arg_missing_action == MA_ALLOW_PROMISOR);
+
+   /*
+* Quietly ignore EXPECTED missing objects.  This avoids problems with
+* staging them now and getting an odd error later.
+*/
+   if (!has_object_file(>oid) && is_promisor_object(>oid))
+   return;
+
+   show_object(obj, name, data);
+}
+
 static int option_parse_missing_action(const struct option *opt,
   const char *arg, int unset)
 {
@@ -2591,10 +2608,18 @@ static int option_parse_missing_action(const struct 
option *opt,
 
if (!strcmp(arg, "allow-any")) {
arg_missing_action = MA_ALLOW_ANY;
+   fetch_if_missing = 0;
fn_show_object = show_object__ma_allow_any;
return 0;
}
 
+   if (!strcmp(arg, "allow-promisor")) {
+   arg_missing_action = MA_ALLOW_PROMISOR;
+   fetch_if_missing = 0;
+   fn_show_object = show_object__ma_allow_promisor;
+   return 0;
+   }
+
die(_("invalid value for --mi

[PATCH v4 07/10] introduce fetch-object: fetch one promisor object

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Introduce fetch-object, providing the ability to fetch one object from a
promisor remote.

This uses fetch-pack. To do this, the transport mechanism has been
updated with 2 flags, "from-promisor" to indicate that the resulting
pack comes from a promisor remote (and thus should be annotated as such
by index-pack), and "no-haves" to suppress the sending of "have" lines.

This will be tested in a subsequent commit.

NEEDSWORK: update this when we have more information about protocol v2,
which should allow a way to suppress the ref advertisement and
officially allow any object type to be "want"-ed.

Signed-off-by: Jonathan Tan 
---
 Documentation/gitremote-helpers.txt |  6 ++
 Makefile|  1 +
 builtin/fetch-pack.c|  8 
 builtin/index-pack.c| 16 +---
 fetch-object.c  | 23 +++
 fetch-object.h  |  6 ++
 fetch-pack.c|  8 ++--
 fetch-pack.h|  2 ++
 remote-curl.c   | 14 +-
 transport.c |  8 
 transport.h |  8 
 11 files changed, 94 insertions(+), 6 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h

diff --git a/Documentation/gitremote-helpers.txt 
b/Documentation/gitremote-helpers.txt
index 4a584f3..1ceab89 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -466,6 +466,12 @@ set by Git if the remote helper has the 'option' 
capability.
Transmit  as a push option. As the push option
must not contain LF or NUL characters, the string is not encoded.
 
+'option from-promisor' {'true'|'false'}::
+   Indicate that these objects are being fetch by a promisor.
+
+'option no-haves' {'true'|'false'}::
+   Do not send "have" lines.
+
 SEE ALSO
 
 linkgit:git-remote[1]
diff --git a/Makefile b/Makefile
index ca378a4..795e0c7 100644
--- a/Makefile
+++ b/Makefile
@@ -792,6 +792,7 @@ LIB_OBJS += ewah/ewah_bitmap.o
 LIB_OBJS += ewah/ewah_io.o
 LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
+LIB_OBJS += fetch-object.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
 LIB_OBJS += gettext.o
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 366b9d1..9f303cf 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -143,6 +143,14 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
args.update_shallow = 1;
continue;
}
+   if (!strcmp("--from-promisor", arg)) {
+   args.from_promisor = 1;
+   continue;
+   }
+   if (!strcmp("--no-haves", arg)) {
+   args.no_haves = 1;
+   continue;
+   }
usage(fetch_pack_usage);
}
if (deepen_not.nr)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 4f305a7..24c2f05 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1429,14 +1429,16 @@ static void write_special_file(const char *suffix, 
const char *msg,
if (close(fd) != 0)
die_errno(_("cannot close written %s file '%s'"),
  suffix, filename);
-   *report = suffix;
+   if (report)
+   *report = suffix;
}
strbuf_release(_buf);
 }
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
  const char *final_index_name, const char *curr_index_name,
- const char *keep_msg, unsigned char *sha1)
+ const char *keep_msg, const char *promisor_msg,
+ unsigned char *sha1)
 {
const char *report = "pack";
struct strbuf pack_name = STRBUF_INIT;
@@ -1455,6 +1457,9 @@ static void final(const char *final_pack_name, const char 
*curr_pack_name,
if (keep_msg)
write_special_file("keep", keep_msg, final_pack_name, sha1,
   );
+   if (promisor_msg)
+   write_special_file("promisor", promisor_msg, final_pack_name,
+  sha1, NULL);
 
if (final_pack_name != curr_pack_name) {
if (!final_pack_name)
@@ -1644,6 +1649,7 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
const char *curr_index;
const char *index_name = NULL, *pack_name = NULL;
const char *keep_msg = NULL;
+   const char *promisor_msg = NULL;
struct strbuf index_name_buf = STRBUF_INIT;
struct pack_idx_entry **idx_objects;
struct pack_idx_option opts;
@@ -1693,6 +1699,10 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
 

[PATCH v4 02/10] fsck: introduce partialclone extension

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage. In such an arrangement, the full set of objects is usually
available in remote storage, ready to be lazily downloaded.

Teach fsck about the new state of affairs. In this commit, teach fsck
that missing promisor objects referenced from the reflog are not an
error case; in future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan 
---
 builtin/fsck.c   |  2 +-
 cache.h  |  3 +-
 packfile.c   | 77 +++--
 packfile.h   | 13 
 t/t0410-partial-clone.sh | 81 
 5 files changed, 171 insertions(+), 5 deletions(-)
 create mode 100755 t/t0410-partial-clone.sh

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 56afe40..2934299 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -398,7 +398,7 @@ static void fsck_handle_reflog_oid(const char *refname, 
struct object_id *oid,
xstrfmt("%s@{%"PRItime"}", refname, 
timestamp));
obj->flags |= USED;
mark_object_reachable(obj);
-   } else {
+   } else if (!is_promisor_object(oid)) {
error("%s: invalid reflog entry %s", refname, 
oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
}
diff --git a/cache.h b/cache.h
index 35e3f5e..c76f2e9 100644
--- a/cache.h
+++ b/cache.h
@@ -1587,7 +1587,8 @@ extern struct packed_git {
unsigned pack_local:1,
 pack_keep:1,
 freshened:1,
-do_not_close:1;
+do_not_close:1,
+pack_promisor:1;
unsigned char sha1[20];
struct revindex_entry *revindex;
/* something like ".git/objects/pack/x.pack" */
diff --git a/packfile.c b/packfile.c
index 4a5fe7a..234797c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -8,6 +8,11 @@
 #include "list.h"
 #include "streaming.h"
 #include "sha1-lookup.h"
+#include "commit.h"
+#include "object.h"
+#include "tag.h"
+#include "tree-walk.h"
+#include "tree.h"
 
 char *odb_pack_name(struct strbuf *buf,
const unsigned char *sha1,
@@ -643,10 +648,10 @@ struct packed_git *add_packed_git(const char *path, 
size_t path_len, int local)
return NULL;
 
/*
-* ".pack" is long enough to hold any suffix we're adding (and
+* ".promisor" is long enough to hold any suffix we're adding (and
 * the use xsnprintf double-checks that)
 */
-   alloc = st_add3(path_len, strlen(".pack"), 1);
+   alloc = st_add3(path_len, strlen(".promisor"), 1);
p = alloc_packed_git(alloc);
memcpy(p->pack_name, path, path_len);
 
@@ -654,6 +659,10 @@ struct packed_git *add_packed_git(const char *path, size_t 
path_len, int local)
if (!access(p->pack_name, F_OK))
p->pack_keep = 1;
 
+   xsnprintf(p->pack_name + path_len, alloc - path_len, ".promisor");
+   if (!access(p->pack_name, F_OK))
+   p->pack_promisor = 1;
+
xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
if (stat(p->pack_name, ) || !S_ISREG(st.st_mode)) {
free(p);
@@ -781,7 +790,8 @@ static void prepare_packed_git_one(char *objdir, int local)
if (ends_with(de->d_name, ".idx") ||
ends_with(de->d_name, ".pack") ||
ends_with(de->d_name, ".bitmap") ||
-   ends_with(de->d_name, ".keep"))
+   ends_with(de->d_name, ".keep") ||
+   ends_with(de->d_name, ".promisor"))
string_list_append(, path.buf);
else
report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -1889,6 +1899,9 @@ int for_each_packed_object(each_packed_object_fn cb, void 
*data, unsigned flags)
for (p = packed_git; p; p = p->next) {
if ((flags & FOR_EACH_OBJECT_LOCAL_ONLY) && !p->pack_local)
continue;
+   if ((flags & FOR_EACH_OBJECT_PROMISOR_ONLY) &&
+   !p->pack_promisor)
+   continue;
if (open_pack_index(p)) {
pack_errors = 1;
continue;
@@ -1899,3 +1912,61 @@ int for_each_packed_object(each_packed_object_fn cb, 
void *data, unsigned flags)
}
return r ? r : pack_errors;
 }
+
+static int add_promisor_object(const struct object_id *oid,
+   

[PATCH v4 03/10] fsck: support refs pointing to promisor objects

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Teach fsck to not treat refs referring to missing promisor objects as an
error when extensions.partialclone is set.

For the purposes of warning about no default refs, such refs are still
treated as legitimate refs.

Signed-off-by: Jonathan Tan 
---
 builtin/fsck.c   |  8 
 t/t0410-partial-clone.sh | 24 
 2 files changed, 32 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 2934299..ee937bb 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -434,6 +434,14 @@ static int fsck_handle_ref(const char *refname, const 
struct object_id *oid,
 
obj = parse_object(oid);
if (!obj) {
+   if (is_promisor_object(oid)) {
+   /*
+* Increment default_refs anyway, because this is a
+* valid ref.
+*/
+default_refs++;
+return 0;
+   }
error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
/* We'll continue with the rest despite the error.. */
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 3ddb3b9..bf75162 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -13,6 +13,14 @@ pack_as_from_promisor () {
>repo/.git/objects/pack/pack-$HASH.promisor
 }
 
+promise_and_delete () {
+   HASH=$(git -C repo rev-parse "$1") &&
+   git -C repo tag -a -m message my_annotated_tag "$HASH" &&
+   git -C repo rev-parse my_annotated_tag | pack_as_from_promisor &&
+   git -C repo tag -d my_annotated_tag &&
+   delete_object repo "$HASH"
+}
+
 test_expect_success 'missing reflog object, but promised by a commit, passes 
fsck' '
test_create_repo repo &&
test_commit -C repo my_commit &&
@@ -78,4 +86,20 @@ test_expect_success 'missing reflog object alone fails fsck, 
even with extension
test_must_fail git -C repo fsck
 '
 
+test_expect_success 'missing ref object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo my_commit &&
+
+   A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+
+   # Reference $A only from ref
+   git -C repo branch my_branch "$A" &&
+   promise_and_delete "$A" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialclone "arbitrary string" &&
+   git -C repo fsck
+'
+
 test_done
-- 
2.9.3



[PATCH v4 05/10] fsck: support promisor objects as CLI argument

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

Teach fsck to not treat missing promisor objects provided on the CLI as
an error when extensions.partialclone is set.

Signed-off-by: Jonathan Tan 
---
 builtin/fsck.c   |  2 ++
 t/t0410-partial-clone.sh | 13 +
 2 files changed, 15 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4c2a56d..578a7c8 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -750,6 +750,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
struct object *obj = lookup_object(oid.hash);
 
if (!obj || !(obj->flags & HAS_OBJ)) {
+   if (is_promisor_object())
+   continue;
error("%s: object missing", oid_to_hex());
errors_found |= ERROR_OBJECT;
continue;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 4f9931f..e96f436 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -125,4 +125,17 @@ test_expect_success 'missing object, but promised, passes 
fsck' '
git -C repo fsck
 '
 
+test_expect_success 'missing CLI object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo my_commit &&
+
+   A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+   promise_and_delete "$A" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialclone "arbitrary string" &&
+   git -C repo fsck "$A"
+'
+
 test_done
-- 
2.9.3



[PATCH v4 06/10] index-pack: refactor writing of .keep files

2017-11-16 Thread Jeff Hostetler
From: Jonathan Tan 

In a subsequent commit, index-pack will be taught to write ".promisor"
files which are similar to the ".keep" files it knows how to write.
Refactor the writing of ".keep" files, so that the implementation of
writing ".promisor" files becomes easier.

Signed-off-by: Jonathan Tan 
---
 builtin/index-pack.c | 99 
 1 file changed, 53 insertions(+), 46 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 8ec459f..4f305a7 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1389,15 +1389,58 @@ static void fix_unresolved_deltas(struct sha1file *f)
free(sorted_by_pos);
 }
 
+static const char *derive_filename(const char *pack_name, const char *suffix,
+  struct strbuf *buf)
+{
+   size_t len;
+   if (!strip_suffix(pack_name, ".pack", ))
+   die(_("packfile name '%s' does not end with '.pack'"),
+   pack_name);
+   strbuf_add(buf, pack_name, len);
+   strbuf_addch(buf, '.');
+   strbuf_addstr(buf, suffix);
+   return buf->buf;
+}
+
+static void write_special_file(const char *suffix, const char *msg,
+  const char *pack_name, const unsigned char *sha1,
+  const char **report)
+{
+   struct strbuf name_buf = STRBUF_INIT;
+   const char *filename;
+   int fd;
+   int msg_len = strlen(msg);
+
+   if (pack_name)
+   filename = derive_filename(pack_name, suffix, _buf);
+   else
+   filename = odb_pack_name(_buf, sha1, suffix);
+
+   fd = odb_pack_keep(filename);
+   if (fd < 0) {
+   if (errno != EEXIST)
+   die_errno(_("cannot write %s file '%s'"),
+ suffix, filename);
+   } else {
+   if (msg_len > 0) {
+   write_or_die(fd, msg, msg_len);
+   write_or_die(fd, "\n", 1);
+   }
+   if (close(fd) != 0)
+   die_errno(_("cannot close written %s file '%s'"),
+ suffix, filename);
+   *report = suffix;
+   }
+   strbuf_release(_buf);
+}
+
 static void final(const char *final_pack_name, const char *curr_pack_name,
  const char *final_index_name, const char *curr_index_name,
- const char *keep_name, const char *keep_msg,
- unsigned char *sha1)
+ const char *keep_msg, unsigned char *sha1)
 {
const char *report = "pack";
struct strbuf pack_name = STRBUF_INIT;
struct strbuf index_name = STRBUF_INIT;
-   struct strbuf keep_name_buf = STRBUF_INIT;
int err;
 
if (!from_stdin) {
@@ -1409,28 +1452,9 @@ static void final(const char *final_pack_name, const 
char *curr_pack_name,
die_errno(_("error while closing pack file"));
}
 
-   if (keep_msg) {
-   int keep_fd, keep_msg_len = strlen(keep_msg);
-
-   if (!keep_name)
-   keep_name = odb_pack_name(_name_buf, sha1, "keep");
-
-   keep_fd = odb_pack_keep(keep_name);
-   if (keep_fd < 0) {
-   if (errno != EEXIST)
-   die_errno(_("cannot write keep file '%s'"),
- keep_name);
-   } else {
-   if (keep_msg_len > 0) {
-   write_or_die(keep_fd, keep_msg, keep_msg_len);
-   write_or_die(keep_fd, "\n", 1);
-   }
-   if (close(keep_fd) != 0)
-   die_errno(_("cannot close written keep file 
'%s'"),
- keep_name);
-   report = "keep";
-   }
-   }
+   if (keep_msg)
+   write_special_file("keep", keep_msg, final_pack_name, sha1,
+  );
 
if (final_pack_name != curr_pack_name) {
if (!final_pack_name)
@@ -1472,7 +1496,6 @@ static void final(const char *final_pack_name, const char 
*curr_pack_name,
 
strbuf_release(_name);
strbuf_release(_name);
-   strbuf_release(_name_buf);
 }
 
 static int git_index_pack_config(const char *k, const char *v, void *cb)
@@ -1615,26 +1638,13 @@ static void show_pack_info(int stat_only)
}
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-  struct strbuf *buf)
-{
-   size_t len;
-   if (!strip_suffix(pack_name, ".pack", ))
-   die(_("packfile name '%s' does not end with '.pack'"),
-   pack_name);
-   strbuf_add(buf, pack_name, len);
-   strbuf_addstr(buf, suffix);
-   

[PATCH v4 4/6] list-objects: filter objects in traverse_commit_list

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.

Update traverse_commit_list() to be a wrapper for the above
with a null filter to minimize the number of callers that
needed to be changed.

Object filtering will be used in a future commit by rev-list
and pack-objects for partial clone and fetch to omit unwanted
objects from the result.

traverse_bitmap_commit_list() does not work with filtering.
If a packfile bitmap is present, it will not be used.  It
should be possible to extend such support in the future (at
least to simple filters that do not require object pathnames),
but that is beyond the scope of this patch series.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile  |   2 +
 list-objects-filter-options.c | 149 
 list-objects-filter-options.h |  57 ++
 list-objects-filter.c | 401 ++
 list-objects-filter.h |  77 
 list-objects.c|  95 --
 list-objects.h|  13 +-
 object.h  |   1 +
 8 files changed, 778 insertions(+), 17 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h

diff --git a/Makefile b/Makefile
index cd75985..ca378a4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,8 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 000..a9298fd
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,149 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "argv-array.h"
+#include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Return 1 if the given string needs armoring because of "special"
+ * characters that may cause injection problems when a command passes
+ * the argument to a subordinate command (such as when upload-pack
+ * launches pack-objects).
+ *
+ * The usual alphanumeric and key punctuation do not trigger it.
+ */ 
+static int arg_needs_armor(const char *arg)
+{
+   const unsigned char *p;
+
+   for (p = (const unsigned char *)arg; *p; p++) {
+   if (*p >= 'a' && *p <= 'z')
+   continue;
+   if (*p >= 'A' && *p <= 'Z')
+   continue;
+   if (*p >= '0' && *p <= '9')
+   continue;
+   if (*p == '-' || *p == '_' || *p == '.' || *p == '/')
+   continue;
+
+   return 1;
+   }
+   return 0;
+}
+
+void armor_encode_arg(struct strbuf *buf, const char *arg)
+{
+   static const char hex[] = "0123456789abcdef";
+   const unsigned char *p;
+
+   for (p = (const unsigned char *)arg; *p; p++) {
+   unsigned int val = *p;
+   strbuf_addch(buf, hex[val >> 4]);
+   strbuf_addch(buf, hex[val & 0xf]);
+   }
+}
+
+int armor_decode_arg(struct strbuf *buf, const char *arg)
+{
+   const char *p;
+
+   for (p = arg; *p; p += 2) {
+   int val = hex2chr(p);
+   unsigned char ch;
+   if (val < 0)
+   return -1;
+   ch = val;
+   strbuf_addch(buf, ch);
+   }
+   return 0;
+}
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like:
+ *   --filter=
+ * and in the pack protocol as:
+ *   "filter" SP 
+ *
+ * The filter keyword will be used by many commands.
+ * See Documentation/rev-list-options.txt for allowed values for .
+ *
+ * Capture the given arg as the "raw_value".  This can be forwarded to
+ * subordinate commands when necessary.  We also "intern" the arg for
+ * the convenience of the current command.
+ */
+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)
+{
+   const char *v0;
+
+   if (filter_options->choice)
+   die(_("multiple object filter types cannot be combined"));
+
+   filter_options->raw_value = strdup(arg);
+
+   if (!strcmp(arg, "blob:none")) {
+   filter_options->choice = LOFC_BLOB_NONE;
+   return 0;
+   }
+
+   if (ski

[PATCH v4 0/6] Partial clone part 1: object filtering

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Here is V4 of the list-object filtering, rev-list, and pack-objects.

This version addresses comments on the V3 version series.

This version replaces the code to scan and reject the filter-spec
for injection characters with a new hex-encoding technique.  The
purpose of this is only to guard against injection attacks containing
characters like semicolon, quotes, spaces, and etc. when a filter-spec
is handed to a subordinate command.  It does not eliminate the need
for the recipient to validate the contents.

This version also combines the various command line flags for
handling missing objects into a single --missing={error,print,allow-any}
flag.


Jeff Hostetler (6):
  dir: allow exclusions from blob in addition to file
  oidmap: add oidmap iterator methods
  oidset: add iterator methods to oidset
  list-objects: filter objects in traverse_commit_list
  rev-list: add list-objects filtering support
  pack-objects: add list-objects filtering

 Documentation/git-pack-objects.txt |  12 +-
 Documentation/git-rev-list.txt |   4 +-
 Documentation/rev-list-options.txt |  37 +++
 Makefile   |   2 +
 builtin/pack-objects.c |  64 +-
 builtin/rev-list.c | 108 -
 dir.c  | 132 ---
 dir.h  |   3 +
 list-objects-filter-options.c  | 149 
 list-objects-filter-options.h  |  57 +
 list-objects-filter.c  | 401 +
 list-objects-filter.h  |  77 +++
 list-objects.c |  95 ++--
 list-objects.h |  13 +-
 object.h   |   1 +
 oidmap.h   |  22 ++
 oidset.c   |  10 +
 oidset.h   |  36 +++
 t/t5317-pack-objects-filter-objects.sh | 375 ++
 t/t6112-rev-list-filters-objects.sh| 225 ++
 20 files changed, 1770 insertions(+), 53 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h
 create mode 100755 t/t5317-pack-objects-filter-objects.sh
 create mode 100755 t/t6112-rev-list-filters-objects.sh

-- 
2.9.3



[PATCH v4 6/6] pack-objects: add list-objects filtering

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach pack-objects to use the filtering provided by the
traverse_commit_list_filtered() interface to omit unwanted
objects from the resulting packfile.

This feature is intended for partial clone/fetch.

Filtering requires the use of the "--stdout" option.

Add t5317 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt |  12 +-
 builtin/pack-objects.c |  64 +-
 t/t5317-pack-objects-filter-objects.sh | 375 +
 3 files changed, 449 insertions(+), 2 deletions(-)
 create mode 100755 t/t5317-pack-objects-filter-objects.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 473a161..5fad696 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--filter=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,15 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--filter=::
+   Requires `--stdout`.  Omits certain objects (usually blobs) from
+   the resulting packfile.  See linkgit:git-rev-list[1] for valid
+   `` forms.
+
+--missing=(error|allow-any):
+   Specifies how missing objects are handled.  This is useful, for
+   example, when there are missing objects from a prior partial clone.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6e77dfd..45ad35d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -15,6 +15,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack-objects.h"
 #include "progress.h"
 #include "refs.h"
@@ -79,6 +81,15 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static struct list_objects_filter_options filter_options;
+
+enum missing_action {
+   MA_ERROR = 0,/* fail if any missing objects are encountered */
+   MA_ALLOW_ANY,/* silently allow ALL missing objects */
+};
+static enum missing_action arg_missing_action;
+static show_object_fn fn_show_object;
+
 /*
  * stats
  */
@@ -2552,6 +2563,42 @@ static void show_object(struct object *obj, const char 
*name, void *data)
obj->flags |= OBJECT_ADDED;
 }
 
+static void show_object__ma_allow_any(struct object *obj, const char *name, 
void *data)
+{
+   assert(arg_missing_action == MA_ALLOW_ANY);
+
+   /*
+* Quietly ignore ALL missing objects.  This avoids problems with
+* staging them now and getting an odd error later.
+*/
+   if (!has_object_file(>oid))
+   return;
+
+   show_object(obj, name, data);
+}
+
+static int option_parse_missing_action(const struct option *opt,
+  const char *arg, int unset)
+{
+   assert(arg);
+   assert(!unset);
+
+   if (!strcmp(arg, "error")) {
+   arg_missing_action = MA_ERROR;
+   fn_show_object = show_object;
+   return 0;
+   }
+
+   if (!strcmp(arg, "allow-any")) {
+   arg_missing_action = MA_ALLOW_ANY;
+   fn_show_object = show_object__ma_allow_any;
+   return 0;
+   }
+
+   die(_("invalid value for --missing"));
+   return 0;
+}
+
 static void show_edge(struct commit *commit)
 {
add_preferred_base(commit->object.oid.hash);
@@ -2816,7 +2863,12 @@ static void get_object_list(int ac, const char **av)
if (prepare_revision_walk())
die("revision walk setup failed");
mark_edges_uninteresting(, show_edge);
-   traverse_commit_list(, show_commit, show_object, NULL);
+
+   if (!fn_show_object)
+   fn_show_object = show_object;
+   traverse_commit_list_filtered(_options, ,
+ show_commit, fn_show_object, NULL,
+ NULL);
 
if (unpack_unreachable_expiration) {
revs.ignore_missing_links = 1;
@@ -2952,6 +3004,10 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 N_("use a bitmap index if available to speed up 
counting objects")),
OPT_BOOL(0, &q

[PATCH v4 3/6] oidset: add iterator methods to oidset

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual iterator methods to oidset.
Add oidset_remove().

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidset.c | 10 ++
 oidset.h | 36 
 2 files changed, 46 insertions(+)

diff --git a/oidset.c b/oidset.c
index f1f874a..454c54f 100644
--- a/oidset.c
+++ b/oidset.c
@@ -24,6 +24,16 @@ int oidset_insert(struct oidset *set, const struct object_id 
*oid)
return 0;
 }
 
+int oidset_remove(struct oidset *set, const struct object_id *oid)
+{
+   struct oidmap_entry *entry;
+
+   entry = oidmap_remove(>map, oid);
+   free(entry);
+
+   return (entry != NULL);
+}
+
 void oidset_clear(struct oidset *set)
 {
oidmap_free(>map, 1);
diff --git a/oidset.h b/oidset.h
index f4c9e0f..783abce 100644
--- a/oidset.h
+++ b/oidset.h
@@ -24,6 +24,12 @@ struct oidset {
 
 #define OIDSET_INIT { OIDMAP_INIT }
 
+
+static inline void oidset_init(struct oidset *set, size_t initial_size)
+{
+   return oidmap_init(>map, initial_size);
+}
+
 /**
  * Returns true iff `set` contains `oid`.
  */
@@ -39,9 +45,39 @@ int oidset_contains(const struct oidset *set, const struct 
object_id *oid);
 int oidset_insert(struct oidset *set, const struct object_id *oid);
 
 /**
+ * Remove the oid from the set.
+ *
+ * Returns 1 if the oid was present in the set, 0 otherwise.
+ */
+int oidset_remove(struct oidset *set, const struct object_id *oid);
+
+/**
  * Remove all entries from the oidset, freeing any resources associated with
  * it.
  */
 void oidset_clear(struct oidset *set);
 
+struct oidset_iter {
+   struct oidmap_iter m_iter;
+};
+
+static inline void oidset_iter_init(struct oidset *set,
+   struct oidset_iter *iter)
+{
+   oidmap_iter_init(>map, >m_iter);
+}
+
+static inline struct object_id *oidset_iter_next(struct oidset_iter *iter)
+{
+   struct oidmap_entry *e = oidmap_iter_next(>m_iter);
+   return e ? >oid : NULL;
+}
+
+static inline struct object_id *oidset_iter_first(struct oidset *set,
+ struct oidset_iter *iter)
+{
+   oidset_iter_init(set, iter);
+   return oidset_iter_next(iter);
+}
+
 #endif /* OIDSET_H */
-- 
2.9.3



[PATCH v4 1/6] dir: allow exclusions from blob in addition to file

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID without assuming a local
worktree or index exists.

Refactor read_skip_worktree_file_from_index() and add
do_read_blob() to eliminate duplication of preliminary
processing of blob contents.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 dir.c | 132 ++
 dir.h |   3 ++
 2 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..1962374 100644
--- a/dir.c
+++ b/dir.c
@@ -220,6 +220,57 @@ int within_depth(const char *name, int namelen,
return 1;
 }
 
+/*
+ * Read the contents of the blob with the given OID into a buffer.
+ * Append a trailing LF to the end if the last line doesn't have one.
+ *
+ * Returns:
+ *-1 when the OID is invalid or unknown or does not refer to a blob.
+ * 0 when the blob is empty.
+ * 1 along with { data, size } of the (possibly augmented) buffer
+ *   when successful.
+ *
+ * Optionally updates the given sha1_stat with the given OID (when valid).
+ */
+static int do_read_blob(const struct object_id *oid,
+   struct sha1_stat *sha1_stat,
+   size_t *size_out,
+   char **data_out)
+{
+   enum object_type type;
+   unsigned long sz;
+   char *data;
+
+   *size_out = 0;
+   *data_out = NULL;
+
+   data = read_sha1_file(oid->hash, , );
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return -1;
+   }
+
+   if (sha1_stat) {
+   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
+   hashcpy(sha1_stat->sha1, oid->hash);
+   }
+
+   if (sz == 0) {
+   free(data);
+   return 0;
+   }
+
+   if (data[sz - 1] != '\n') {
+   data = xrealloc(data, st_add(sz, 1));
+   data[sz++] = '\n';
+   }
+
+   *size_out = xsize_t(sz);
+   *data_out = data;
+
+   return 1;
+}
+
 #define DO_MATCH_EXCLUDE   (1<<0)
 #define DO_MATCH_DIRECTORY (1<<1)
 #define DO_MATCH_SUBMODULE (1<<2)
@@ -600,32 +651,22 @@ void add_exclude(const char *string, const char *base,
x->el = el;
 }
 
-static void *read_skip_worktree_file_from_index(const struct index_state 
*istate,
-   const char *path, size_t *size,
-   struct sha1_stat *sha1_stat)
+static int read_skip_worktree_file_from_index(const struct index_state *istate,
+ const char *path,
+ size_t *size_out,
+ char **data_out,
+ struct sha1_stat *sha1_stat)
 {
int pos, len;
-   unsigned long sz;
-   enum object_type type;
-   void *data;
 
len = strlen(path);
pos = index_name_pos(istate, path, len);
if (pos < 0)
-   return NULL;
+   return -1;
if (!ce_skip_worktree(istate->cache[pos]))
-   return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, , );
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   *size = xsize_t(sz);
-   if (sha1_stat) {
-   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
-   hashcpy(sha1_stat->sha1, istate->cache[pos]->oid.hash);
-   }
-   return data;
+   return -1;
+
+   return do_read_blob(>cache[pos]->oid, sha1_stat, size_out, 
data_out);
 }
 
 /*
@@ -739,6 +780,10 @@ static void invalidate_directory(struct untracked_cache 
*uc,
dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +799,10 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
 {
struct stat st;
-   int fd, i, lineno = 1;
+   int r;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
 
fd = open(fname, O_RDONLY);
if (fd < 0 || fstat(fd, ) < 0) {
@@ -764,17 +810,13 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
warn_on_fopen_errors(fname);

[PATCH v4 2/6] oidmap: add oidmap iterator methods

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual map iterator functions to oidmap.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidmap.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/oidmap.h b/oidmap.h
index 18f54cd..d3cd2bb 100644
--- a/oidmap.h
+++ b/oidmap.h
@@ -65,4 +65,26 @@ extern void *oidmap_put(struct oidmap *map, void *entry);
  */
 extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
 
+
+struct oidmap_iter {
+   struct hashmap_iter h_iter;
+};
+
+static inline void oidmap_iter_init(struct oidmap *map, struct oidmap_iter 
*iter)
+{
+   hashmap_iter_init(>map, >h_iter);
+}
+
+static inline void *oidmap_iter_next(struct oidmap_iter *iter)
+{
+   return hashmap_iter_next(>h_iter);
+}
+
+static inline void *oidmap_iter_first(struct oidmap *map,
+ struct oidmap_iter *iter)
+{
+   oidmap_iter_init(map, iter);
+   return oidmap_iter_next(iter);
+}
+
 #endif
-- 
2.9.3



[PATCH v4 5/6] rev-list: add list-objects filtering support

2017-11-16 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach rev-list to use the filtering provided by the
traverse_commit_list_filtered() interface to omit
unwanted objects from the result.  This feature is
intended to help with partial clone.

Object filtering is only allowed when one of the "--objects*"
options are used.

When the "--filter-print-omitted" option is used, the omitted
objects are printed at the end.  These are marked with a "~".
This option can be combined with "--quiet" to get a list of
just the omitted objects.

Added "--missing=(error|print|omit)" argument to specify how
rev-list should behave when it encounters a missing object
(presumably from a prior partial clone).

When "--missing=print" is used, rev-list will print a list of
any missing objects that should have been included in the output.
These are marked with a "?".

Add t6112 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-rev-list.txt  |   4 +-
 Documentation/rev-list-options.txt  |  37 ++
 builtin/rev-list.c  | 108 -
 t/t6112-rev-list-filters-objects.sh | 225 
 4 files changed, 371 insertions(+), 3 deletions(-)
 create mode 100755 t/t6112-rev-list-filters-objects.sh

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index ef22f17..397a0dd 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -47,7 +47,9 @@ SYNOPSIS
 [ --fixed-strings | -F ]
 [ --date=]
 [ [ --objects | --objects-edge | --objects-edge-aggressive ]
-  [ --unpacked ] ]
+  [ --unpacked ]
+  [ --filter= [ --filter-print-omitted ] ] ]
+[ --missing=(error|allow-any|print) ]
 [ --pretty | --header ]
 [ --bisect ]
 [ --bisect-vars ]
diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index 13501e1..c84e465 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -706,6 +706,43 @@ ifdef::git-rev-list[]
 --unpacked::
Only useful with `--objects`; print the object IDs that are not
in packs.
+
+--filter=::
+   Only useful with one of the `--objects*`; omits objects (usually
+   blobs) from the list of printed objects.  The ''
+   may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=[kmg]' omits blobs larger than n bytes
+or units.  The value may be zero.  Special files matching '.git*' are
+alwayse included, regardless of size.
++
+The form '--filter=sparse:oid=' uses a sparse-checkout
+specification contained in the object (or the object that the expression
+evaluates to) to omit blobs not required by the corresponding sparse
+checkout.
++
+The form '--filter=sparse:path=' similarly uses a sparse-checkout
+specification contained in .
+
+--filter-print-omitted::
+   Only useful with `--filter=`; prints a list of the omitted objects.
+   Object IDs are prefixed with a ``~'' character.
+
+--missing=(error|allow-any|print)::
+   Specifies how missing objects are handled.  The repository may
+   have missing objects after a partial clone, for example.
++
+The value 'error' requests that rev-list stop with an error if a missing
+object is encountered.  This is the default action.
++
+The value 'allow-any' will allow object traversal to continue if a
+missing object is encountered.  Missing objects will silently be omitted
+from the results.
++
+The value 'print' is like 'allow-any', but will also print a list of the
+missing objects.  Object IDs are prefixed with a ``?'' character.
 endif::git-rev-list[]
 
 --no-walk[=(sorted|unsorted)]::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4..da4a39b 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -4,6 +4,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "builtin.h"
@@ -12,6 +14,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "oidset.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -55,6 +58,20 @@ static const char rev_list_usage[] =
 static struct progress *progress;
 static unsigned progress_counter;
 
+static struct list_objects_filter_options filter_options;
+static struct oidset omitted_objects;
+static int arg_print_omitted; /* print objects omitted by filter */
+
+static struct oidset missing_objects;
+enum missing_action {
+   MA_ERROR = 0,/* fail if any missing objects are encount

Re: [PATCH 04/14] fetch: add object filtering for partial fetch

2017-11-16 Thread Jeff Hostetler



On 11/3/2017 4:38 PM, Jonathan Tan wrote:

@@ -1242,6 +1249,20 @@ static int fetch_multiple(struct string_list *list)
int i, result = 0;
struct argv_array argv = ARGV_ARRAY_INIT;
  
+	if (filter_options.choice) {

+   /*
+* We currently only support partial-fetches to the remote
+* used for the partial-clone because we only support 1
+* promisor remote, so we DO NOT allow explicit command
+* line filter arguments.
+*
+* Note that the loop below will spawn background fetches
+* for each remote and one of them MAY INHERIT the proper
+* partial-fetch settings, so everything is consistent.
+*/
+   die(_("partial-fetch is not supported on multiple remotes"));
+   }
+
if (!append && !dry_run) {
int errcode = truncate_fetch_head();
if (errcode)


My intention in doing the "fetch: refactor calculation of remote list"
patch is so that the interaction between the provided list of remotes
and the specification of the filter can be handled using the following
diff:

 -  if (remote)
 +  if (remote) {
 +  if (filter_options.choice &&
 +  strcmp(remote->name, 
repository_format_partial_clone_remote))
 +  die(_("--blob-max-bytes can only be used with the remote 
configured in core.partialClone"));
result = fetch_one(remote, argc, argv);
 -  else
 +  } else {
 +  if (filter_options.choice)
 +  die(_("--blob-max-bytes can only be used with the remote 
configured in core.partialClone"));
result = fetch_multiple();
 +  }

(Ignore the "blob-max-bytes" in the error message - that needs to be
updated.)

The GitHub link I provided above has this diff, and it seems to work.



I put the filter_options.choice tests inside the fetch_{one,multiple}
routines because the former needs to be able to register partial clone
with the config and/or inherit the default filter-spec for the
promisor remote and that took more code that what can neatly fit inline
here.  This will be more apparent in my next patch series.

Jeff


Re: [PATCH 02/14] clone, fetch-pack, index-pack, transport: partial clone

2017-11-16 Thread Jeff Hostetler



On 11/8/2017 1:01 PM, Adam Dinwoodie wrote:

On Friday 03 November 2017 at 01:32 pm -0700, Jonathan Tan wrote:

On Thu,  2 Nov 2017 20:31:17 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index a0a35e6..31cd5ba 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -222,6 +222,16 @@ static unsigned check_object(struct object *obj)
if (!(obj->flags & FLAG_CHECKED)) {
unsigned long size;
int type = sha1_object_info(obj->oid.hash, );
+
+   if (type <= 0) {
+   /*
+* TODO Use the promisor code to conditionally
+* try to fetch this object -or- assume it is ok.
+*/
+   obj->flags |= FLAG_CHECKED;
+   return 0;
+   }
+
if (type <= 0)
die(_("did not receive expected object %s"),
  oid_to_hex(>oid));


This causes some repo corruption tests to fail.


Confirmed: I see this patch, or at least f7e0dbc38 ("clone, fetch-pack,
index-pack, transport: partial clone", 2017-11-02), causing t5300.26 to
fail on 64-bit Cygwin.

For the sake of anyone trying to reproduce this, I needed to cherry pick
66d4c7a58 ("fixup! upload-pack: add object filtering for partial clone",
2017-11-08) onto that commit before I was able to get it to compile.

Adam



Thanks.  I've removed this from my next version.  I think it was
left over from a pre-promisor version.

Jeff


Re: [PATCH v3 4/6] list-objects: filter objects in traverse_commit_list

2017-11-16 Thread Jeff Hostetler



On 11/8/2017 12:01 AM, Junio C Hamano wrote:

Jonathan Tan  writes:


Having said that, though, it might be safer to still introduce one, and
relax it later if necessary - it is much easier to relax a constraint
than to increase one.


It would also be more error prone to have such a long switch ()
statement, each of whose case arm needs to be carefully looked at.

While protection against attacks over the wire against the process
that receives the request is necessary and doing the quoting right
at this layer is one valuable component of it, we would need to be
careful about what features we allow the other side to request.

For example, an innocent-looking use of get_oid_with_context() can
trigger an expensive operation, e.g. "master^{/sekritCodeName}", may
not just waste resources but also may reveal the presence of an
object that we might not want to leak to a stranger.  Limiting such
an abuse must sit at a lot higher layer than a byte-by-byte check
over the request like the code does.



Right.  I could see adding another server-side variable in the
spirit of the existing "uploadpack.allow*" variables.

My main concern at this point has been avoiding injections.

Jeff



Re: [PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-16 Thread Jeff Hostetler



On 11/8/2017 4:51 PM, Jonathan Tan wrote:

On Wed, 8 Nov 2017 15:32:21 -0500
Jeff Hostetler <g...@jeffhostetler.com> wrote:


Thanks Jonathan.

I moved my version of part 2 on top of yesterday's part 1.
There are a few changes between my version and yours. Could
you take a quick look at them and see if they make sense?
(I'll spare the mailing list another patch series until after
I attend to the feed back on part 1.)

https://github.com/jeffhostetler/git/commits/core/pc3_p2


Thanks - the differences are quite minor, and they generally make sense.
The main one is that finish_object() in builtin/rev-list.c now returns
int instead of void, but that makes sense.

Other than that:

  - I think you accidentally squashed the rev-list commit into
"sha1_file: support lazily fetching missing objects".


fixed. thanks.


  - The documentation for --exclude-promisor-objects in
git-pack-objects.txt should be "Omit objects that are known to be in
the promisor remote". (This option has the purpose of operating only
on locally created objects, so that when we repack, we still maintain
a distinction between locally created objects [without .promisor] and
objects from the promisor remote [with .promisor].)



  - The transport options in gitremote-helpers.txt could have the same
documentation as in transport.h.


fixed. thanks.
 


Re: [PATCH v3 4/6] list-objects: filter objects in traverse_commit_list

2017-11-16 Thread Jeff Hostetler



On 11/7/2017 6:20 PM, Jonathan Tan wrote:

On Tue,  7 Nov 2017 19:35:44 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


+/*
+ * Reject the arg if it contains any characters that might
+ * require quoting or escaping when handing to a sub-command.
+ */
+static int reject_injection_chars(const char *arg)
+{

[snip]

+}


Someone pointed me to quote.{c,h}, which is probably sufficient to
ensure shell safety if we do invoke subcommands through the shell. If
that is so, we probably don't need a blacklist.

Having said that, though, it might be safer to still introduce one, and
relax it later if necessary - it is much easier to relax a constraint
than to increase one.


I couldn't use quote.[ch] because it is more concerned with
quoting pathnames because of LF and CR characters within
them -- rather than semicolons and quotes and the like which
I was concerned about.

Anyway, in my next patch series I've replaced all of the
injection code from my last series with something a little
stronger and not restricting.




+   } else if (skip_prefix(arg, "sparse:", )) {
+
+   if (skip_prefix(v0, "oid=", )) {
+   struct object_context oc;
+   struct object_id sparse_oid;
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, ))
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   return 0;
+   }


In your recent e-mail [1], you said that you will change it to always pass
the original expression - is that still the plan?

[1] 
https://public-inbox.org/git/f698d5a8-bf31-cea1-a8da-88b755b0b...@jeffhostetler.com/


yes.  I always pass filter_options.raw_value over the wire.
The code above tries to parse it and put it in an OID for
private use by the current process -- just like the size limit
value in the blob:limit filter.


+/* Remember to update object flag allocation in object.h */


You probably can delete this line.


Every other place that defined flag bits included this comment,
so I did too.  (It really made it easier to find the other
random places that define bits, actually.)




+/*
+ * FILTER_SHOWN_BUT_REVISIT -- we set this bit on tree objects
+ * that have been shown, but should be revisited if they appear
+ * in the traversal (until we mark it SEEN).  This is a way to
+ * let us silently de-dup calls to show() in the caller.


This is unclear to me at first reading. Maybe something like:

   FILTER_SHOWN_BUT_REVISIT -- we set this bit on tree objects that have
   been shown, but should not be skipped over if they reappear in the
   traversal. This ensures that the tree's descendants are re-processed
   if the tree reappears subsequently, and that the tree is not shown
   twice.


+ * This
+ * is subtly different from the "revision.h:SHOWN" and the
+ * "sha1_name.c:ONELINE_SEEN" bits.  And also different from
+ * the non-de-dup usage in pack-bitmap.c
+ */


Optional: I'm not sure if this comparison is useful. (Maybe it is useful
to others, though.)


I was thinking the first comment about my FILTER_SHOWN field
would be to ask why I wasn't just using the existing SHOWN bit.
There are subtle differences between the bits and I wanted to
point out that I was not just duplicating the usage of an existing
bit.
 



+/*
+ * A filter driven by a sparse-checkout specification to only
+ * include blobs that a sparse checkout would populate.
+ *
+ * The sparse-checkout spec can be loaded from a blob with the
+ * given OID or from a local pathname.  We allow an OID because
+ * the repo may be bare or we may be doing the filtering on the
+ * server.
+ */
+struct frame {
+   /*
+* defval is the usual default include/exclude value that
+* should be inherited as we recurse into directories based
+* upon pattern matching of the directory itself or of a
+* containing directory.
+*/
+   int defval;


Can this be an "unsigned defval : 1" as well? In the function below, I
see that you assign to an "int val" first (which can take -1, 0, and 1)
before assigning to this, so that is fine.

Also, maybe a better name would be "exclude", with the documentation:

   1 if the directory is excluded, 0 otherwise. Excluded directories will
   still be recursed through, because an "include" rule for an object
   might override an "exclude" rule for one of its ancestors.



The name "defval" is used unpack-trees.c during the clear_ce_flags()
recursion while looking at the exclusion list.  I was just trying to
match that behavior.

Thanks
Jeff


Re: [PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-08 Thread Jeff Hostetler



On 11/8/2017 4:51 PM, Jonathan Tan wrote:

On Wed, 8 Nov 2017 15:32:21 -0500
Jeff Hostetler <g...@jeffhostetler.com> wrote:


Thanks Jonathan.

I moved my version of part 2 on top of yesterday's part 1.
There are a few changes between my version and yours. Could
you take a quick look at them and see if they make sense?
(I'll spare the mailing list another patch series until after
I attend to the feed back on part 1.)

https://github.com/jeffhostetler/git/commits/core/pc3_p2


Thanks - the differences are quite minor, and they generally make sense.
The main one is that finish_object() in builtin/rev-list.c now returns
int instead of void, but that makes sense.

Other than that:

  - I think you accidentally squashed the rev-list commit into
"sha1_file: support lazily fetching missing objects".
  - The documentation for --exclude-promisor-objects in
git-pack-objects.txt should be "Omit objects that are known to be in
the promisor remote". (This option has the purpose of operating only
on locally created objects, so that when we repack, we still maintain
a distinction between locally created objects [without .promisor] and
objects from the promisor remote [with .promisor].)
  - The transport options in gitremote-helpers.txt could have the same
documentation as in transport.h.



thanks for the quick turn around.  i'll get these into my next
version next week.

Jeff


Re: Test failures on 'pu' branch

2017-11-08 Thread Jeff Hostetler



On 11/8/2017 3:36 PM, Stefan Beller wrote:

On Wed, Nov 8, 2017 at 12:28 PM, Ramsay Jones
 wrote:


t5300-pack-object.sh (Wstat: 256 Tests: 40 Failed: 
2)



t5500-fetch-pack.sh  (Wstat: 256 Tests: 355 Failed: 
6)


These are series


t5601-clone.sh   (Wstat: 256 Tests: 102 Failed: 
4)


This one is a spurious test. I had that flake on me once in the last weeks, too.
But upon investigation I could not reproduce.
See https://public-inbox.org/git/xmqq376ipdpx@gitster.mtv.corp.google.com/



I suspect that the failures related to the jh/partial-clone-* branches
are probably due to slight differences between yesterday's part-1 and
last week's version of part-2 and part-3.

I'm going to be on vacation until Monday, so can we just pull those
parts out of 'pu' until I can get you new versions of parts 2 and 3 ?

Thanks
Jeff


Re: [PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-08 Thread Jeff Hostetler



On 11/6/2017 2:16 PM, Jonathan Tan wrote:

On Mon, 6 Nov 2017 12:32:45 -0500
Jeff Hostetler <g...@jeffhostetler.com> wrote:


Yes, that is a point I wanted to ask about.  I renamed the
extensions.partialclone that you created and then I moved your
remote..blob-max-bytes setting to be in extensions too.
Moving it to core.partialclonefilter is fine.


OK - in that case, it might be easier to just reuse my first patch in
its entirety. "core.partialclonefilter" is not used until the
fetching/cloning part anyway.



Good point.  I'll take a look at refactoring that.
If it looks like the result will be mostly/effectively
your original patches, I'll let you know and hand part 2
back to you.


Sounds good. I uploaded the result of rebasing my part 2 patches on top
of your part 1 patches here, if you would like it as a reference:

https://github.com/jonathantanmy/git/commits/pc20171106



Thanks Jonathan.

I moved my version of part 2 on top of yesterday's part 1.
There are a few changes between my version and yours. Could
you take a quick look at them and see if they make sense?
(I'll spare the mailing list another patch series until after
I attend to the feed back on part 1.)

https://github.com/jeffhostetler/git/commits/core/pc3_p2

Thanks
Jeff



Re: [PATCH v2 0/6] Partial clone part 1: object filtering

2017-11-08 Thread Jeff Hostetler



On 11/7/2017 7:54 PM, Junio C Hamano wrote:

Jonathan Tan  writes:


I can see some use for this parameter - for example, when doing a report
for statistical purposes (percentage of objects missing, for example) or
for a background task that downloads missing objects into a cache. Also,
power users who know what they're doing (or normal users in an
emergency) can use this option when they have no network connection if
they really need to find something out from the local repo.

In these cases, the promisor check (after detecting that the object is
missing) is indeed not so useful, I think. (Or we can do the
--exclude=missing and --exclude=promisor idea that Jeff mentioned -
--exclude=missing now, and --exclude=promisor after we add promisor
support.)


This sounds like a reasonable thing to have in the endgame state to
me.


OK thanks, I'll change it to --exclude=missing in my next version.
 



Having said that, I would be OK if we didn't have tolerance (and/or
reporting) of missing objects right now. As far as I know, for the
initial implementation of partial clone, only the server performs any
filtering, and we assume that the server possesses all objects (so it
does not need to filter out any missing objects).


True.  It does not have to exist in an early part, but I do not
think we would terribly mind if it does, if only to help debugging
and development.

Thanks for thinking it through.



Right, it could come later, but having it here in part 1 as part
of the initial series completes the pre-promisor portion of these
commands.  Having a print-missing option lets rev-list -- as is --
be used in a bulk-fetch-object pre-checkout hook or as part of a
"give me what I need before I go offline" command.  This is useful
by itself.  It does augment the dynamic fetch-object code added in
part 2 and the unpack-trees changes in part 3 to call fetch-object.

Jeff




[PATCH v3 4/6] list-objects: filter objects in traverse_commit_list

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.

Update traverse_commit_list() to be a wrapper for the above
with a null filter to minimize the number of callers that
needed to be changed.

Object filtering will be used in a future commit by rev-list
and pack-objects for partial clone and fetch to omit unwanted
objects from the result.

traverse_bitmap_commit_list() does not work with filtering.
If a packfile bitmap is present, it will not be used.  It
should be possible to extend such support in the future (at
least to simple filters that do not require object pathnames),
but that is beyond the scope of this patch series.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile  |   2 +
 list-objects-filter-options.c | 148 
 list-objects-filter-options.h |  50 ++
 list-objects-filter.c | 401 ++
 list-objects-filter.h |  77 
 list-objects.c|  95 --
 list-objects.h|  13 +-
 object.h  |   1 +
 8 files changed, 770 insertions(+), 17 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h

diff --git a/Makefile b/Makefile
index cd75985..ca378a4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,8 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 000..21c4830
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,148 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "argv-array.h"
+#include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Reject the arg if it contains any characters that might
+ * require quoting or escaping when handing to a sub-command.
+ */
+static int reject_injection_chars(const char *arg)
+{
+   const unsigned char *p;
+
+   for (p = (const unsigned char *)arg; *p; p++) {
+   if (*p < 0x20) /* control character */
+   return 1;
+   if (*p >= '0' && *p <= '9')
+   continue;
+   if (*p >= 'A' && *p <= 'Z')
+   continue;
+   if (*p >= 'a' && *p <= 'z')
+   continue;
+   if (*p >= 0x80)
+   continue;
+
+   switch (*p) {
+   case ' ': return 1; /* 0x20 */
+   case '!': continue; /* 0x21 */
+   case '"': return 1; /* 0x22 */
+   case '#': return 1; /* 0x23 */
+   case '$': return 1; /* 0x24 */
+   case '%': continue; /* 0x25 */
+   case '&': return 1; /* 0x26 */
+   case '\'':return 1; /* 0x27 */
+   case '(': continue; /* 0x28 */
+   case ')': continue; /* 0x29 */
+   case '*': return 1; /* 0x2a */
+   case '+': return 1; /* 0x2b */
+   case ',': continue; /* 0x2c */
+   case '-': continue; /* 0x2d */
+   case '.': continue; /* 0x2e */
+   case '/': continue; /* 0x2f */
+
+   case ':': continue; /* 0x3a */
+   case ';': return 1; /* 0x3b */
+   case '<': return 1; /* 0x3c */
+   case '=': continue; /* 0x3d */
+   case '>': return 1; /* 0x3e */
+   case '?': continue; /* 0x3f */
+
+   case '@': continue; /* 0x40 */
+
+   case '[': continue; /* 0x5b */
+   case '\\':return 1; /* 0x5c */
+   case ']': continue; /* 0x5d */
+   case '^': continue; /* 0x5e */
+   case '_': continue; /* 0x5f */
+
+   case '`': return 1; /* 0x60 */
+
+   case '{': continue; /* 0x7b */
+   case '|': return 1; /* 0x7c */
+   case '}': continue; /* 0x7d */
+   case '~': continue; /* 0x7e */
+   case 0x7f:return 1; /* 0x7f */
+   default:  continue;
+   }
+   }
+   return 0;
+}
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like:
+ *   --filter=
+ * and in the pack protocol as:
+ *   "filter" SP 
+ *
+ *  ::= blob:n

[PATCH v3 1/6] dir: allow exclusions from blob in addition to file

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID without assuming a local
worktree or index exists.

Refactor read_skip_worktree_file_from_index() and add
do_read_blob() to eliminate duplication of preliminary
processing of blob contents.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 dir.c | 132 ++
 dir.h |   3 ++
 2 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..1962374 100644
--- a/dir.c
+++ b/dir.c
@@ -220,6 +220,57 @@ int within_depth(const char *name, int namelen,
return 1;
 }
 
+/*
+ * Read the contents of the blob with the given OID into a buffer.
+ * Append a trailing LF to the end if the last line doesn't have one.
+ *
+ * Returns:
+ *-1 when the OID is invalid or unknown or does not refer to a blob.
+ * 0 when the blob is empty.
+ * 1 along with { data, size } of the (possibly augmented) buffer
+ *   when successful.
+ *
+ * Optionally updates the given sha1_stat with the given OID (when valid).
+ */
+static int do_read_blob(const struct object_id *oid,
+   struct sha1_stat *sha1_stat,
+   size_t *size_out,
+   char **data_out)
+{
+   enum object_type type;
+   unsigned long sz;
+   char *data;
+
+   *size_out = 0;
+   *data_out = NULL;
+
+   data = read_sha1_file(oid->hash, , );
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return -1;
+   }
+
+   if (sha1_stat) {
+   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
+   hashcpy(sha1_stat->sha1, oid->hash);
+   }
+
+   if (sz == 0) {
+   free(data);
+   return 0;
+   }
+
+   if (data[sz - 1] != '\n') {
+   data = xrealloc(data, st_add(sz, 1));
+   data[sz++] = '\n';
+   }
+
+   *size_out = xsize_t(sz);
+   *data_out = data;
+
+   return 1;
+}
+
 #define DO_MATCH_EXCLUDE   (1<<0)
 #define DO_MATCH_DIRECTORY (1<<1)
 #define DO_MATCH_SUBMODULE (1<<2)
@@ -600,32 +651,22 @@ void add_exclude(const char *string, const char *base,
x->el = el;
 }
 
-static void *read_skip_worktree_file_from_index(const struct index_state 
*istate,
-   const char *path, size_t *size,
-   struct sha1_stat *sha1_stat)
+static int read_skip_worktree_file_from_index(const struct index_state *istate,
+ const char *path,
+ size_t *size_out,
+ char **data_out,
+ struct sha1_stat *sha1_stat)
 {
int pos, len;
-   unsigned long sz;
-   enum object_type type;
-   void *data;
 
len = strlen(path);
pos = index_name_pos(istate, path, len);
if (pos < 0)
-   return NULL;
+   return -1;
if (!ce_skip_worktree(istate->cache[pos]))
-   return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, , );
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   *size = xsize_t(sz);
-   if (sha1_stat) {
-   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
-   hashcpy(sha1_stat->sha1, istate->cache[pos]->oid.hash);
-   }
-   return data;
+   return -1;
+
+   return do_read_blob(>cache[pos]->oid, sha1_stat, size_out, 
data_out);
 }
 
 /*
@@ -739,6 +780,10 @@ static void invalidate_directory(struct untracked_cache 
*uc,
dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +799,10 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
 {
struct stat st;
-   int fd, i, lineno = 1;
+   int r;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
 
fd = open(fname, O_RDONLY);
if (fd < 0 || fstat(fd, ) < 0) {
@@ -764,17 +810,13 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
warn_on_fopen_errors(fname);

[PATCH v3 6/6] pack-objects: add list-objects filtering

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach pack-objects to use the filtering provided by the
traverse_commit_list_filtered() interface to omit unwanted
objects from the resulting packfile.

This feature is intended for partial clone/fetch.

Filtering requires the use of the "--stdout" option.

Add t5317 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt |  12 +-
 builtin/pack-objects.c |  28 ++-
 t/t5317-pack-objects-filter-objects.sh | 369 +
 3 files changed, 407 insertions(+), 2 deletions(-)
 create mode 100755 t/t5317-pack-objects-filter-objects.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 473a161..6786351 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--filter=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,15 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--filter=::
+   Requires `--stdout`.  Omits certain objects (usually blobs) from
+   the resulting packfile.  See linkgit:git-rev-list[1] for valid
+   `` forms.
+
+--filter-ignore-missing:
+   Ignore missing objects without error.  This may be used with
+   or without and of the above filtering.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6e77dfd..e16722f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -15,6 +15,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack-objects.h"
 #include "progress.h"
 #include "refs.h"
@@ -79,6 +81,9 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static struct list_objects_filter_options filter_options;
+static int arg_ignore_missing;
+
 /*
  * stats
  */
@@ -2547,6 +2552,15 @@ static void show_commit(struct commit *commit, void 
*data)
 
 static void show_object(struct object *obj, const char *name, void *data)
 {
+   /*
+* Quietly ignore missing objects when they are expected.  This
+* avoids staging them and getting an odd error later.  If we are
+* not expecting them, stage it and let the normal error handling
+* deal with it.
+*/
+   if (arg_ignore_missing && !has_object_file(>oid))
+   return;
+
add_preferred_base_object(name);
add_object_entry(obj->oid.hash, obj->type, name, 0);
obj->flags |= OBJECT_ADDED;
@@ -2816,7 +2830,10 @@ static void get_object_list(int ac, const char **av)
if (prepare_revision_walk())
die("revision walk setup failed");
mark_edges_uninteresting(, show_edge);
-   traverse_commit_list(, show_commit, show_object, NULL);
+
+   traverse_commit_list_filtered(_options, ,
+ show_commit, show_object, NULL,
+ NULL);
 
if (unpack_unreachable_expiration) {
revs.ignore_missing_links = 1;
@@ -2952,6 +2969,9 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 N_("use a bitmap index if available to speed up 
counting objects")),
OPT_BOOL(0, "write-bitmap-index", _bitmap_index,
 N_("write a bitmap index together with the pack 
index")),
+   OPT_PARSE_LIST_OBJECTS_FILTER(_options),
+   OPT_BOOL(0, "filter-ignore-missing", _ignore_missing,
+N_("ignore and omit missing objects from packfile")),
OPT_END(),
};
 
@@ -3028,6 +3048,12 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (!rev_list_all || !rev_list_reflog || !rev_list_index)
unpack_unreachable_expiration = 0;
 
+   if (filter_options.choice) {
+   if (!pack_to_stdout)
+   die("cannot use filtering with an indexable pack.");
+   use_bitmap_index = 0;
+   }
+
/*
 * "soft" reasons not to use bitmaps - for on-disk repack by default we 
want
 *
diff --git a/t/t5317

[PATCH v3 5/6] rev-list: add list-objects filtering support

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach rev-list to use the filtering provided by the
traverse_commit_list_filtered() interface to omit
unwanted objects from the result.  This feature is
intended to help with partial clone.

Object filtering is only allowed when one of the "--objects*"
options are used.

When the "--filter-print-omitted" option is used, the omitted
objects are printed at the end.  These are marked with a "~".
This option can be combined with "--quiet" to get a list of
just the omitted objects.

Normally, rev-list will stop with an error when there are
missing objects.

When the "--filter-print-missing" option is used, rev-list
will print a list of any missing objects that should have
been included in the output (rather than stopping).
These are marked with a "?".

When the "--filter-ignore-missing" option is used, rev-list
will silently ignore any missing objects and continue without
error.

Add t6112 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-rev-list.txt  |   6 +-
 Documentation/rev-list-options.txt  |  34 ++
 builtin/rev-list.c  |  75 +++-
 t/t6112-rev-list-filters-objects.sh | 225 
 4 files changed, 337 insertions(+), 3 deletions(-)
 create mode 100755 t/t6112-rev-list-filters-objects.sh

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index ef22f17..b8a3a5b 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -47,7 +47,11 @@ SYNOPSIS
 [ --fixed-strings | -F ]
 [ --date=]
 [ [ --objects | --objects-edge | --objects-edge-aggressive ]
-  [ --unpacked ] ]
+  [ --unpacked ]
+  [ --filter= ] ]
+[ --filter-print-missing ]
+[ --filter-print-omitted ]
+[ --filter-ignore-missing ]
 [ --pretty | --header ]
 [ --bisect ]
 [ --bisect-vars ]
diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index 13501e1..9233134 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -706,6 +706,40 @@ ifdef::git-rev-list[]
 --unpacked::
Only useful with `--objects`; print the object IDs that are not
in packs.
+
+--filter=::
+   Only useful with one of the `--objects*`; omits objects (usually
+   blobs) from the list of printed objects.  The ''
+   may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=[kmg]' omits blobs larger than n bytes
+or units.  The value may be zero.  Special files matching '.git*' are
+alwayse included, regardless of size.
++
+The form '--filter=sparse:oid=' uses a sparse-checkout
+specification contained in the object (or the object that the expression
+evaluates to) to omit blobs not required by the corresponding sparse
+checkout.
++
+The form '--filter=sparse:path=' similarly uses a sparse-checkout
+specification contained in .
+
+--filter-print-missing::
+   Prints a list of the missing objects for the requested traversal.
+   Object IDs are prefixed with a ``?'' character.  The object type
+   is printed after the ID.  This may be used with or without any of
+   the above filtering options.
+
+--filter-ignore-missing::
+   Ignores missing objects encountered during the requested traversal.
+   This may be used with or without any of the above filtering options.
+
+--filter-print-omitted::
+   Only useful with one of the above `--filter*`; prints a list
+   of the omitted objects.  Object IDs are prefixed with a ``~''
+   character.
 endif::git-rev-list[]
 
 --no-walk[=(sorted|unsorted)]::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4..cc9fa40 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -4,6 +4,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "builtin.h"
@@ -12,6 +14,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "oidset.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -54,6 +57,15 @@ static const char rev_list_usage[] =
 
 static struct progress *progress;
 static unsigned progress_counter;
+static struct list_objects_filter_options filter_options;
+static struct oidset missing_objects;
+static struct oidset omitted_objects;
+static int arg_print_missing;
+static int arg_print_omitted;
+static int arg_ignore_missing;
+
+#define DEFAULT_OIDSET_SIZE (16*1024)
+
 
 static void finish_commit(

[PATCH v3 2/6] oidmap: add oidmap iterator methods

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual map iterator functions to oidmap.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidmap.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/oidmap.h b/oidmap.h
index 18f54cd..d3cd2bb 100644
--- a/oidmap.h
+++ b/oidmap.h
@@ -65,4 +65,26 @@ extern void *oidmap_put(struct oidmap *map, void *entry);
  */
 extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
 
+
+struct oidmap_iter {
+   struct hashmap_iter h_iter;
+};
+
+static inline void oidmap_iter_init(struct oidmap *map, struct oidmap_iter 
*iter)
+{
+   hashmap_iter_init(>map, >h_iter);
+}
+
+static inline void *oidmap_iter_next(struct oidmap_iter *iter)
+{
+   return hashmap_iter_next(>h_iter);
+}
+
+static inline void *oidmap_iter_first(struct oidmap *map,
+ struct oidmap_iter *iter)
+{
+   oidmap_iter_init(map, iter);
+   return oidmap_iter_next(iter);
+}
+
 #endif
-- 
2.9.3



[PATCH v3 3/6] oidset: add iterator methods to oidset

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual iterator methods to oidset.
Add oidset_remove().

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidset.c | 10 ++
 oidset.h | 36 
 2 files changed, 46 insertions(+)

diff --git a/oidset.c b/oidset.c
index f1f874a..454c54f 100644
--- a/oidset.c
+++ b/oidset.c
@@ -24,6 +24,16 @@ int oidset_insert(struct oidset *set, const struct object_id 
*oid)
return 0;
 }
 
+int oidset_remove(struct oidset *set, const struct object_id *oid)
+{
+   struct oidmap_entry *entry;
+
+   entry = oidmap_remove(>map, oid);
+   free(entry);
+
+   return (entry != NULL);
+}
+
 void oidset_clear(struct oidset *set)
 {
oidmap_free(>map, 1);
diff --git a/oidset.h b/oidset.h
index f4c9e0f..783abce 100644
--- a/oidset.h
+++ b/oidset.h
@@ -24,6 +24,12 @@ struct oidset {
 
 #define OIDSET_INIT { OIDMAP_INIT }
 
+
+static inline void oidset_init(struct oidset *set, size_t initial_size)
+{
+   return oidmap_init(>map, initial_size);
+}
+
 /**
  * Returns true iff `set` contains `oid`.
  */
@@ -39,9 +45,39 @@ int oidset_contains(const struct oidset *set, const struct 
object_id *oid);
 int oidset_insert(struct oidset *set, const struct object_id *oid);
 
 /**
+ * Remove the oid from the set.
+ *
+ * Returns 1 if the oid was present in the set, 0 otherwise.
+ */
+int oidset_remove(struct oidset *set, const struct object_id *oid);
+
+/**
  * Remove all entries from the oidset, freeing any resources associated with
  * it.
  */
 void oidset_clear(struct oidset *set);
 
+struct oidset_iter {
+   struct oidmap_iter m_iter;
+};
+
+static inline void oidset_iter_init(struct oidset *set,
+   struct oidset_iter *iter)
+{
+   oidmap_iter_init(>map, >m_iter);
+}
+
+static inline struct object_id *oidset_iter_next(struct oidset_iter *iter)
+{
+   struct oidmap_entry *e = oidmap_iter_next(>m_iter);
+   return e ? >oid : NULL;
+}
+
+static inline struct object_id *oidset_iter_first(struct oidset *set,
+ struct oidset_iter *iter)
+{
+   oidset_iter_init(set, iter);
+   return oidset_iter_next(iter);
+}
+
 #endif /* OIDSET_H */
-- 
2.9.3



[PATCH v3 0/6] Partial clone part 1: object filtering

2017-11-07 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Here is V3 of the list-object filtering.  This addresses the
comments on the mailing list for the V2 series as well as the
various TODO items I left in the code.  I also documented some
of the bit flags and fields that I added.

In the blob size filter, I removed the ".git*" pattern matching
for special files.  I don't think we need it any more and it
simplifies the code.  This patch series does not support
traverse_bitmap_commit_list() and the --use-bitmap-index feature
in rev-list, but by removing the ".git*" pattern matching now
we should be able allow filtering and bitmaps to be used
together in a future effort.  (That is beyond the scope of
the current partial-clone effort.)

With this patch series, I think part 1 is complete unless there
are further comments or questions.


Jeff Hostetler (6):
  dir: allow exclusions from blob in addition to file
  oidmap: add oidmap iterator methods
  oidset: add iterator methods to oidset
  list-objects: filter objects in traverse_commit_list
  rev-list: add list-objects filtering support
  pack-objects: add list-objects filtering

 Documentation/git-pack-objects.txt |  12 +-
 Documentation/git-rev-list.txt |   6 +-
 Documentation/rev-list-options.txt |  34 +++
 Makefile   |   2 +
 builtin/pack-objects.c |  28 ++-
 builtin/rev-list.c |  75 +-
 dir.c  | 132 ---
 dir.h  |   3 +
 list-objects-filter-options.c  | 148 
 list-objects-filter-options.h  |  50 
 list-objects-filter.c  | 401 +
 list-objects-filter.h  |  77 +++
 list-objects.c |  95 ++--
 list-objects.h |  13 +-
 object.h   |   1 +
 oidmap.h   |  22 ++
 oidset.c   |  10 +
 oidset.h   |  36 +++
 t/t5317-pack-objects-filter-objects.sh | 369 ++
 t/t6112-rev-list-filters-objects.sh| 225 ++
 20 files changed, 1686 insertions(+), 53 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h
 create mode 100755 t/t5317-pack-objects-filter-objects.sh
 create mode 100755 t/t6112-rev-list-filters-objects.sh

-- 
2.9.3



Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-07 Thread Jeff Hostetler



On 11/2/2017 3:32 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 17:50:11 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/


I think this would lead to confusing behavior - for example, a fetch
with "--filter=oid=mybranch:sparseconfig" would have different results
depending on whether "mybranch" refers to a valid object locally.

The way I see it, this should either (i) only accept full 40-character
OIDs or (ii) retain the raw string to be interpreted only when the
filtering is done. (i) is simpler and safer, but is not so useful. In
both cases, if the user really wants client-side interpretation, they
can still use "$(git rev-parse foo)" to make it explicit.


Good point. I'll change it to always pass the original expression
so that it is evaluated wherever the filtering is actually performed.





+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+   /*
+* We could not turn the  into an
+* OID.  Leave the raw_value as is in case
+* the server can parse it.  (It may refer to
+* a branch, commit, or blob we don't have.)
+*/
+   }
+   return 0;
+   }
+
+   if (skip_prefix(v0, "path=", )) {
+   filter_options->choice = LOFC_SPARSE_PATH;
+   filter_options->sparse_path_value = strdup(v1);
+   return 0;
+   }
+   }
+
+   die(_("invalid filter expression '%s'"), arg);
+   return 0;
+}


[snip]


+void arg_format_list_objects_filter(
+   struct argv_array *argv_array,
+   const struct list_objects_filter_options *filter_options)


Is this function used anywhere (in this patch or subsequent patches)?


It is used in upload-pack.c in part 3.  I'll remove it from part 1
and revisit in part 3.
 




diff --git a/list-objects-filter.c b/list-objects-filter.c
+/* See object.h and revision.h */
+#define FILTER_REVISIT (1<<25)


Looking later in the code, this flag indicates that a tree has been
SHOWN, so it might be better to just call this FILTER_SHOWN.


I'll amend this. There are already several SHOWN bits that behave
slightly differently.  I'll update and document this better.  Thanks.




[snip]


+struct frame {
+   int defval;


Document this variable?


+   int child_prov_omit : 1;


I think it's clearer if we use "unsigned" here. Also, document this
(e.g. "1 if any descendant of this tree object was provisionally
omitted").


got it. thanks.



+enum list_objects_filter_type {
+   LOFT_BEGIN_TREE,
+   LOFT_END_TREE,
+   LOFT_BLOB
+};


Optional: probably a better name would be list_objects_filter_situation.


got it. thanks.

 

+void traverse_commit_list_filtered(
+   struct list_objects_filter_options *filter_options,
+   struct rev_info *revs,
+   show_commit_fn show_commit,
+   show_object_fn show_object,
+   void *show_data,
+   struct oidset *omitted)
+{
+   filter_object_fn filter_fn = NULL;
+   filter_free_fn filter_free_fn = NULL;
+   void *filter_data = NULL;
+
+   filter_data = list_objects_filter__init(omitted, filter_options,
+   _fn, _free_fn);
+   do_traverse(revs, show_commit, show_object, show_data,
+   filter_fn, filter_data);
+   if (filter_data && filter_free_fn)
+   filter_free_fn(filter_data);
+}


This function traverse_commit_list_filtered() is in list-objects.c but
in list-objects-filter.h, if I'm reading the diff correctly?


oops.  thanks.




Overall, this looks like a goo

Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-06 Thread Jeff Hostetler



On 11/2/2017 1:50 PM, Jeff Hostetler wrote:

From: Jeff Hostetler <jeffh...@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.
...
diff --git a/list-objects-filter.c b/list-objects-filter.c
new file mode 100644
index 000..7f28425
--- /dev/null
+++ b/list-objects-filter.c
...
+/*
+ * A filter for list-objects to omit large blobs,
+ * but always include ".git*" special files.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */


Jonathan and I were talking off-list about the performance
effects of inspecting the pathnames to identify the ".git*"
special files. I added it in my first draft back in the spring,
thinking that even if you set the blob-limit to a small
number (or zero), you'd probably still always want the
.gitattribute and .gitignore files.  But now with the addition
of the sparse filter and functional dynamic object fetching,
I'm not sure I see the need for this.

Also, if the primary use of the blob-limit is to filter out
giant binary assets, it is unlikely anyone is going to have
a 1MB+ .git* file, so it is unlikely that the is_special_file
would include anything that wouldn't already be included by
the size criteria.

So, if there's no objections, I think I'll remove this and
simplify the blob-limit filter function.  (That would let me
get rid of the provisional omit code here.)

Jeff


Re: [PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-06 Thread Jeff Hostetler



On 11/3/2017 2:39 PM, Jonathan Tan wrote:

On Fri, 3 Nov 2017 09:57:18 -0400
Jeff Hostetler <g...@jeffhostetler.com> wrote:


On 11/2/2017 6:24 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 20:20:44 +0000
Jeff Hostetler <g...@jeffhostetler.com> wrote:


From: Jeff Hostetler <jeffh...@microsoft.com>

Introduce the ability to have missing objects in a repo.  This
functionality is guarded by new repository extension options:
  `extensions.partialcloneremote` and
  `extensions.partialclonefilter`.


With this, it is unclear what happens if extensions.partialcloneremote
is not set but extensions.partialclonefilter is set. For something as
significant as a repository extension (which Git uses to determine if it
will even attempt to interact with a repo), I think - I would prefer
just extensions.partialclone (or extensions.partialcloneremote, though I
prefer the former) which determines the remote (the important part,
which controls the dynamic object fetching), and have another option
"core.partialclonefilter" which is only useful if
"extensions.partialclone" is set.


Yes, that is a point I wanted to ask about.  I renamed the
extensions.partialclone that you created and then I moved your
remote..blob-max-bytes setting to be in extensions too.
Moving it to core.partialclonefilter is fine.


OK - in that case, it might be easier to just reuse my first patch in
its entirety. "core.partialclonefilter" is not used until the
fetching/cloning part anyway.



Good point.  I'll take a look at refactoring that.
If it looks like the result will be mostly/effectively
your original patches, I'll let you know and hand part 2
back to you.


I agree that "core.partialclonefilter" (or another place not in
"remote") instead of "remote..blob-max-bytes" is a good idea - in
the future, we might want to reuse the same filter setting for
non-fetching functionality.



Jeff


Re: [PATCH v2 0/6] Partial clone part 1: object filtering

2017-11-03 Thread Jeff Hostetler



On 11/3/2017 11:05 AM, Junio C Hamano wrote:

Jeff Hostetler <g...@jeffhostetler.com> writes:


Yes, I thought we should have both (perhaps renamed or combined
into 1 parameter with value, such as --exclude=missing vs --exclude=promisor)
and let the user decide how strict they want to be.


Assuming we eventually get promisor support working, would there be
any use case where "any missing is OK" mode would be useful in a
sense more reasonable than "because we could have such a mode" and
"it is not our business to prevent users from playing with fire"?



For now, I'd like to keep my "any missing is OK" option.
I do think it has value all by itself.

We are essentially using something like that now with our GVFS
users on the gigantic Windows repo and haven't had any issues.

But yes, when we get promisor support working, we could revisit
the need for this parameter.

However, I do have some scaling concerns here.  If for example,
I have 100M missing blobs (because we did an only commits-and-trees
clone), the cost to compute "promisor missing" vs "just missing"
might be prohibitively expensive.  It could be something we want
fsck/gc to be aware of, but other commands may want to just assume
any missing objects are expected and continue.

Hopefully, we won't have a scale problem, but we just don't know
yet.

Jeff


Re: [PATCH] fix an 'dubious one-bit signed bitfield' error

2017-11-03 Thread Jeff Hostetler

d'oh.  thanks!

On 11/3/2017 1:05 PM, Ramsay Jones wrote:


Signed-off-by: Ramsay Jones 
---

Hi Jeff,

If you need to re-roll your 'jh/object-filtering' branch, could
you please squash this into the relevant commit (b87fd93d81,
"list-objects: filter objects in traverse_commit_list", 02-11-2017).

[This error was issued by sparse]

Thanks!

ATB,
Ramsay Jones

  list-objects-filter.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/list-objects-filter.c b/list-objects-filter.c
index 7f2842547..d9e626be8 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -191,7 +191,7 @@ static void *filter_blobs_limit__init(
   */
  struct frame {
int defval;
-   int child_prov_omit : 1;
+   unsigned int child_prov_omit : 1;
  };
  
  struct filter_sparse_data {




Re: [PATCH 00/14] WIP Partial clone part 3: clone, fetch, fetch-pack, upload-pack, and tests

2017-11-03 Thread Jeff Hostetler



On 11/2/2017 7:41 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 20:31:15 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


From: Jeff Hostetler <jeffh...@microsoft.com>

This is part 3 of 3 for partial clone.
It assumes that part 1 [1] and part 2 [2] are in place.

Part 3 is concerned with the commands: clone, fetch, upload-pack, fetch-pack,
remote-curl, index-pack, and the pack-protocol.

Jonathan and I independently started on this task.  This is a first
pass at merging those efforts.  So there are several places that need
refactoring and cleanup.  In particular, the test cases should be
squashed and new tests added.


Thanks. What are your future plans with this patch set? In particular, the
tests don't pass at HEAD^.


Patch 14/14 fixed 2 existing tests.  I think I want to merge that with
patch 2/14 as part of the cleanup.

Bigger picture, I would like squash all this down.  But first I wanted
you to see if there was anything I missed during the merge.


I took a quick glance to see if there were any issues that I could
immediately spot, but couldn't find any. I thought of fetch_if_missing,
but it seems that it is indeed used in this patch set (as expected).

I'll look at it more thorougly, and feel free to let me know if there is
anything in particular you would like comments on.



Thanks, will do.
Jeff



Re: [PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-03 Thread Jeff Hostetler



On 11/2/2017 6:24 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 20:20:44 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


From: Jeff Hostetler <jeffh...@microsoft.com>

Introduce the ability to have missing objects in a repo.  This
functionality is guarded by new repository extension options:
 `extensions.partialcloneremote` and
 `extensions.partialclonefilter`.


With this, it is unclear what happens if extensions.partialcloneremote
is not set but extensions.partialclonefilter is set. For something as
significant as a repository extension (which Git uses to determine if it
will even attempt to interact with a repo), I think - I would prefer
just extensions.partialclone (or extensions.partialcloneremote, though I
prefer the former) which determines the remote (the important part,
which controls the dynamic object fetching), and have another option
"core.partialclonefilter" which is only useful if
"extensions.partialclone" is set.


Yes, that is a point I wanted to ask about.  I renamed the
extensions.partialclone that you created and then I moved your
remote..blob-max-bytes setting to be in extensions too.
Moving it to core.partialclonefilter is fine.



I also don't think extensions.partialclonefilter (or
core.partialclonefilter, if we use my suggestion) needs to be introduced
so early in the patch set when it will only be used once we start
fetching/cloning.


+void partial_clone_utils_register(
+   const struct list_objects_filter_options *filter_options,
+   const char *remote,
+   const char *cmd_name)
+{


This function is useful once we have fetch/clone, but probably not
before that. Since the fetch/clone patches are several patches ahead,
could this be moved there?


Sure.




@@ -420,6 +420,19 @@ static int check_repo_format(const char *var, const char 
*value, void *vdata)
;
else if (!strcmp(ext, "preciousobjects"))
data->precious_objects = git_config_bool(var, value);
+
+   else if (!strcmp(ext, KEY_PARTIALCLONEREMOTE))
+   if (!value)
+   return config_error_nonbool(var);
+   else
+   data->partial_clone_remote = xstrdup(value);
+
+   else if (!strcmp(ext, KEY_PARTIALCLONEFILTER))
+   if (!value)
+   return config_error_nonbool(var);
+   else
+   data->partial_clone_filter = xstrdup(value);
+
else
string_list_append(>unknown_extensions, ext);
} else if (strcmp(var, "core.bare") == 0) {


With a complicated block, probably better to use braces in these
clauses.



Good point.

Thanks,
Jeff



Re: [PATCH v2 0/6] Partial clone part 1: object filtering

2017-11-03 Thread Jeff Hostetler



On 11/2/2017 3:44 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 17:50:07 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


From: Jeff Hostetler <jeffh...@microsoft.com>

Here is V2 of the list-object filtering. It replaces [1]
and reflect a refactoring and simplification of the original.


Thanks, overall this looks quite good. I reviewed patches 2-6 (skipping
1 since it's already in next), made my comments on 4, and don't have any
for the rest (besides what's below).


I've added "--filter-ignore-missing" parameter to rev-list and
pack-objects to ignore missing objects rather than error out.
This allows this patch series to better stand on its own eliminates
the need in part 1 for "patch 9" from V1.

This is a brute force ignore all missing objects.  Later, in part
2 or part 3 when --exclude-promisor-objects is introduced, we will
be able to ignore EXPECTED missing objects.


(This is regarding patches 5 and 6.) Is the intention to support both
flags? (That is, --ignore-missing to ignore without checking whether the
object being missing is not unexpected, and --exclude-promisor-objects
to check and ignore.)



Yes, I thought we should have both (perhaps renamed or combined
into 1 parameter with value, such as --exclude=missing vs --exclude=promisor)
and let the user decide how strict they want to be.

Jeff



Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-03 Thread Jeff Hostetler



On 11/3/2017 7:54 AM, Johannes Schindelin wrote:

Hi Jonathan,

On Thu, 2 Nov 2017, Jonathan Tan wrote:


On Thu,  2 Nov 2017 17:50:11 +
Jeff Hostetler <g...@jeffhostetler.com> wrote:


+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)


Returning void is fine, I think. It seems that all your code paths
either return 0 or die.


Can we please start to encourage libified code, rather than discourage it?



I did that so that I could call it from the opt_parse_... version below
it that is used by the OPT_ macros.

And Johannes is right, it bothers me that there doesn't seem to be a hard
line where one should or should not call die() vs returning an error code.

Jeff



[PATCH 01/14] upload-pack: add object filtering for partial clone

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach upload-pack to negotiate object filtering over the protocol and
to send filter parameters to pack-objects.  This is intended for partial
clone and fetch.

The idea to make upload-pack configurable using uploadpack.allowFilter
comes from Jonathan Tan's work in [1].

[1] 
https://public-inbox.org/git/f211093280b422c32cc1b7034130072f35c5ed51.1506714999.git.jonathanta...@google.com/

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/config.txt  |  4 
 Documentation/technical/pack-protocol.txt |  8 
 Documentation/technical/protocol-capabilities.txt |  8 
 upload-pack.c | 20 +++-
 4 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1ac0ae6..e528210 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -3268,6 +3268,10 @@ uploadpack.packObjectsHook::
was run. I.e., `upload-pack` will feed input intended for
`pack-objects` to the hook, and expects a completed packfile on
stdout.
+
+uploadpack.allowFilter::
+   If this option is set, `upload-pack` will advertise partial
+   clone and partial fetch object filtering.
 +
 Note that this configuration variable is ignored if it is seen in the
 repository-level config (this is a safety measure against fetching from
diff --git a/Documentation/technical/pack-protocol.txt 
b/Documentation/technical/pack-protocol.txt
index ed1eae8..a43a113 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -212,6 +212,7 @@ out of what the server said it could do with the first 
'want' line.
   upload-request=  want-list
   *shallow-line
   *1depth-request
+  [filter-request]
   flush-pkt
 
   want-list =  first-want
@@ -227,6 +228,8 @@ out of what the server said it could do with the first 
'want' line.
   additional-want   =  PKT-LINE("want" SP obj-id)
 
   depth =  1*DIGIT
+
+  filter-request=  PKT-LINE("filter" SP filter-spec)
 
 
 Clients MUST send all the obj-ids it wants from the reference
@@ -249,6 +252,11 @@ complete those commits. Commits whose parents are not 
received as a
 result are defined as shallow and marked as such in the server. This
 information is sent back to the client in the next step.
 
+The client can optionally request that pack-objects omit various
+objects from the packfile using one of several filtering techniques.
+These are intended for use with partial clone and partial fetch
+operations.  See `rev-list` for possible "filter-spec" values.
+
 Once all the 'want's and 'shallow's (and optional 'deepen') are
 transferred, clients MUST send a flush-pkt, to tell the server side
 that it is done sending the list.
diff --git a/Documentation/technical/protocol-capabilities.txt 
b/Documentation/technical/protocol-capabilities.txt
index 26dcc6f..332d209 100644
--- a/Documentation/technical/protocol-capabilities.txt
+++ b/Documentation/technical/protocol-capabilities.txt
@@ -309,3 +309,11 @@ to accept a signed push certificate, and asks the  
to be
 included in the push certificate.  A send-pack client MUST NOT
 send a push-cert packet unless the receive-pack server advertises
 this capability.
+
+filter
+--
+
+If the upload-pack server advertises the 'filter' capability,
+fetch-pack may send "filter" commands to request a partial clone
+or partial fetch and request that the server omit various objects
+from the packfile.
diff --git a/upload-pack.c b/upload-pack.c
index e25f725..64a57a4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -10,6 +10,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "run-command.h"
 #include "connect.h"
 #include "sigchain.h"
@@ -64,6 +66,10 @@ static int advertise_refs;
 static int stateless_rpc;
 static const char *pack_objects_hook;
 
+static int filter_capability_requested;
+static int filter_advertise;
+static struct list_objects_filter_options filter_options;
+
 static void reset_timeout(void)
 {
alarm(timeout);
@@ -131,6 +137,7 @@ static void create_pack_file(void)
argv_array_push(_objects.args, "--delta-base-offset");
if (use_include_tag)
argv_array_push(_objects.args, "--include-tag");
+   arg_format_list_objects_filter(_objects.args, _options);
 
pack_objects.in = -1;
pack_objects.out = -1;
@@ -794,6 +801,12 @@ static void receive_needs(void)
deepen_rev_list = 1;
continue;
}
+  

[PATCH 07/14] fetch-pack: test support excluding large blobs

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Created tests to verify fetch-pack and upload-pack support
for excluding large blobs using --filter=blobs:limit=
parameter.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t5500-fetch-pack.sh | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 80a1a32..fdb98a8 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -755,4 +755,31 @@ test_expect_success 'fetching deepen' '
)
 '
 
+test_expect_success 'filtering by size' '
+   rm -rf server client &&
+   test_create_repo server &&
+   test_commit -C server one &&
+   test_config -C server uploadpack.allowfilter 1 &&
+
+   test_create_repo client &&
+   git -C client fetch-pack --filter=blobs:limit=0 ../server HEAD &&
+
+   # Ensure that object is not inadvertently fetched
+   test_must_fail git -C client cat-file -e $(git hash-object server/one.t)
+'
+
+test_expect_success 'filtering by size has no effect if support for it is not 
advertised' '
+   rm -rf server client &&
+   test_create_repo server &&
+   test_commit -C server one &&
+
+   test_create_repo client &&
+   git -C client fetch-pack --filter=blobs:limit=0 ../server HEAD 2> err &&
+
+   # Ensure that object is fetched
+   git -C client cat-file -e $(git hash-object server/one.t) &&
+
+   test_i18ngrep "filtering not recognized by server" err
+'
+
 test_done
-- 
2.9.3



[PATCH 03/14] fetch: refactor calculation of remote list

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan 

Separate out the calculation of remotes to be fetched from and the
actual fetching. This will allow us to include an additional step before
the actual fetching in a subsequent commit.

Signed-off-by: Jonathan Tan 
---
 builtin/fetch.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 225c734..1b1f039 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1322,7 +1322,7 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
 {
int i;
struct string_list list = STRING_LIST_INIT_DUP;
-   struct remote *remote;
+   struct remote *remote = NULL;
int result = 0;
struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
 
@@ -1367,17 +1367,14 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
else if (argc > 1)
die(_("fetch --all does not make sense with refspecs"));
(void) for_each_remote(get_one_remote_for_fetch, );
-   result = fetch_multiple();
} else if (argc == 0) {
/* No arguments -- use default remote */
remote = remote_get(NULL);
-   result = fetch_one(remote, argc, argv);
} else if (multiple) {
/* All arguments are assumed to be remotes or groups */
for (i = 0; i < argc; i++)
if (!add_remote_or_group(argv[i], ))
die(_("No such remote or remote group: %s"), 
argv[i]);
-   result = fetch_multiple();
} else {
/* Single remote or group */
(void) add_remote_or_group(argv[0], );
@@ -1385,14 +1382,19 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
/* More than one remote */
if (argc > 1)
die(_("Fetching a group and specifying refspecs 
does not make sense"));
-   result = fetch_multiple();
} else {
/* Zero or one remotes */
remote = remote_get(argv[0]);
-   result = fetch_one(remote, argc-1, argv+1);
+   argc--;
+   argv++;
}
}
 
+   if (remote)
+   result = fetch_one(remote, argc, argv);
+   else
+   result = fetch_multiple();
+
if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
struct argv_array options = ARGV_ARRAY_INIT;
 
-- 
2.9.3



[PATCH 13/14] fetch-pack: restore save_commit_buffer after use

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

In fetch-pack, the global variable save_commit_buffer is set to 0, but
not restored to its original value after use.

In particular, if show_log() (in log-tree.c) is invoked after
fetch_pack() in the same process, show_log() will return before printing
out the commit message (because the invocation to
get_cached_commit_buffer() returns NULL, because the commit buffer was
not saved). I discovered this when attempting to run "git log -S" in a
partial clone, triggering the case where revision walking lazily loads
missing objects.

Therefore, restore save_commit_buffer to its original value after use.

An alternative to solve the problem I had is to replace
get_cached_commit_buffer() with get_commit_buffer(). That invocation was
introduced in commit a97934d ("use get_cached_commit_buffer where
appropriate", 2014-06-13) to replace "commit->buffer" introduced in
commit 3131b71 ("Add "--show-all" revision walker flag for debugging",
2008-02-13). In the latter commit, the commit author seems to be
deciding between not showing an unparsed commit at all and showing an
unparsed commit without the message (which is what the commit does), and
did not mention parsing the unparsed commit, so I prefer to preserve the
existing behavior.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 fetch-pack.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 895e8f9..121f03e 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -717,6 +717,7 @@ static int everything_local(struct fetch_pack_args *args,
 {
struct ref *ref;
int retval;
+   int old_save_commit_buffer = save_commit_buffer;
timestamp_t cutoff = 0;
 
save_commit_buffer = 0;
@@ -784,6 +785,9 @@ static int everything_local(struct fetch_pack_args *args,
print_verbose(args, _("already have %s (%s)"), 
oid_to_hex(remote),
  ref->name);
}
+
+   save_commit_buffer = old_save_commit_buffer;
+
return retval;
 }
 
-- 
2.9.3



[PATCH 06/14] pack-objects: test support for blob filtering

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, test pack-objects support for --filter=blobs:limit=, packing only
blobs not exceeding that size unless the blob corresponds to a file
whose name starts with ".git". upload-pack will eventually be taught to
use this new parameter if needed to exclude certain blobs during a fetch
or clone, potentially drastically reducing network consumption when
serving these very large repositories.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t5300-pack-object.sh  | 45 +
 t/test-lib-functions.sh | 12 
 2 files changed, 57 insertions(+)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 9c68b99..0739a07 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -457,6 +457,51 @@ test_expect_success !PTHREADS,C_LOCALE_OUTPUT 
'pack-objects --threads=N or pack.
grep -F "no threads support, ignoring pack.threads" err
 '
 
+lcut () {
+   perl -e '$/ = undef; $_ = <>; s/^.{'$1'}//s; print $_'
+}
+
+test_expect_success 'filtering by size works with multiple excluded' '
+   rm -rf server &&
+   git init server &&
+   printf a > server/a &&
+   printf b > server/b &&
+   printf c-very-long-file > server/c &&
+   printf d-very-long-file > server/d &&
+   git -C server add a b c d &&
+   git -C server commit -m x &&
+
+   git -C server rev-parse HEAD >objects &&
+   git -C server pack-objects --revs --stdout --filter=blobs:limit=10 
my.pack &&
+
+   # Ensure that only the small blobs are in the packfile
+   git index-pack my.pack &&
+   git verify-pack -v my.idx >objectlist &&
+   grep $(git hash-object server/a) objectlist &&
+   grep $(git hash-object server/b) objectlist &&
+   ! grep $(git hash-object server/c) objectlist &&
+   ! grep $(git hash-object server/d) objectlist
+'
+
+test_expect_success 'filtering by size never excludes special files' '
+   rm -rf server &&
+   git init server &&
+   printf a-very-long-file > server/a &&
+   printf a-very-long-file > server/.git-a &&
+   printf b-very-long-file > server/b &&
+   git -C server add a .git-a b &&
+   git -C server commit -m x &&
+
+   git -C server rev-parse HEAD >objects &&
+   git -C server pack-objects --revs --stdout --filter=blobs:limit=10 
my.pack &&
+
+   # Ensure that the .git-a blob is in the packfile, despite also
+   # appearing as a non-.git file
+   git index-pack my.pack &&
+   git verify-pack -v my.idx >objectlist &&
+   grep $(git hash-object server/a) objectlist
+'
+
 #
 # WARNING!
 #
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 1701fe2..07b79c7 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1020,3 +1020,15 @@ nongit () {
"$@"
)
 }
+
+# Converts big-endian pairs of hexadecimal digits into bytes. For example,
+# "printf 61620d0a | hex_pack" results in "ab\r\n".
+hex_pack () {
+   perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
+
+# Converts bytes into big-endian pairs of hexadecimal digits. For example,
+# "printf 'ab\r\n' | hex_unpack" results in "61620d0a".
+hex_unpack () {
+   perl -e '$/ = undef; $input = <>; print unpack("H2" x length($input), 
$input)'
+}
-- 
2.9.3



[PATCH 09/14] t5500: add fetch-pack tests for partial clone

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t5500-fetch-pack.sh | 36 
 1 file changed, 36 insertions(+)

diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index fdb98a8..7c8339f 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -782,4 +782,40 @@ test_expect_success 'filtering by size has no effect if 
support for it is not ad
test_i18ngrep "filtering not recognized by server" err
 '
 
+fetch_blob_max_bytes () {
+ SERVER="$1"
+ URL="$2"
+
+   rm -rf "$SERVER" client &&
+   test_create_repo "$SERVER" &&
+   test_commit -C "$SERVER" one &&
+   test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+
+   git clone "$URL" client &&
+   test_config -C client extensions.partialcloneremote origin &&
+
+   test_commit -C "$SERVER" two &&
+
+   git -C client fetch --filter=blobs:limit=0 origin HEAD:somewhere &&
+
+   # Ensure that commit is fetched, but blob is not
+   test_config -C client extensions.partialcloneremote "arbitrary string" 
&&
+   git -C client cat-file -e $(git -C "$SERVER" rev-parse two) &&
+   test_must_fail git -C client cat-file -e $(git hash-object 
"$SERVER/two.t")
+}
+
+test_expect_success 'fetch with filtering' '
+fetch_blob_max_bytes server server
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'fetch with filtering and HTTP' '
+fetch_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" 
"$HTTPD_URL/smart/server"
+'
+
+stop_httpd
+
+
 test_done
-- 
2.9.3



[PATCH 12/14] unpack-trees: batch fetching of missing blobs

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

When running checkout, first prefetch all blobs that are to be updated
but are missing. This means that only one pack is downloaded during such
operations, instead of one per missing blob.

This operates only on the blob level - if a repository has a missing
tree, they are still fetched one at a time.

This does not use the delayed checkout mechanism introduced in commit
2841e8f ("convert: add "status=delayed" to filter process protocol",
2017-06-30) due to significant conceptual differences - in particular,
for partial clones, we already know what needs to be fetched based on
the contents of the local repo alone, whereas for status=delayed, it is
the filter process that tells us what needs to be checked in the end.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 fetch-object.c   | 27 +++
 fetch-object.h   |  5 +
 t/t5601-clone.sh | 52 
 unpack-trees.c   | 22 ++
 4 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/fetch-object.c b/fetch-object.c
index 369b61c..21b4dfa 100644
--- a/fetch-object.c
+++ b/fetch-object.c
@@ -3,12 +3,12 @@
 #include "pkt-line.h"
 #include "strbuf.h"
 #include "transport.h"
+#include "fetch-object.h"
 
-void fetch_object(const char *remote_name, const unsigned char *sha1)
+static void fetch_refs(const char *remote_name, struct ref *ref)
 {
struct remote *remote;
struct transport *transport;
-   struct ref *ref;
int original_fetch_if_missing = fetch_if_missing;
 
fetch_if_missing = 0;
@@ -17,10 +17,29 @@ void fetch_object(const char *remote_name, const unsigned 
char *sha1)
die(_("Remote with no URL"));
transport = transport_get(remote, remote->url[0]);
 
-   ref = alloc_ref(sha1_to_hex(sha1));
-   hashcpy(ref->old_oid.hash, sha1);
transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
transport_set_option(transport, TRANS_OPT_NO_HAVES, "1");
transport_fetch_refs(transport, ref);
fetch_if_missing = original_fetch_if_missing;
 }
+
+void fetch_object(const char *remote_name, const unsigned char *sha1)
+{
+   struct ref *ref = alloc_ref(sha1_to_hex(sha1));
+   hashcpy(ref->old_oid.hash, sha1);
+   fetch_refs(remote_name, ref);
+}
+
+void fetch_objects(const char *remote_name, const struct oid_array *to_fetch)
+{
+   struct ref *ref = NULL;
+   int i;
+
+   for (i = 0; i < to_fetch->nr; i++) {
+   struct ref *new_ref = alloc_ref(oid_to_hex(_fetch->oid[i]));
+   oidcpy(_ref->old_oid, _fetch->oid[i]);
+   new_ref->next = ref;
+   ref = new_ref;
+   }
+   fetch_refs(remote_name, ref);
+}
diff --git a/fetch-object.h b/fetch-object.h
index f371300..4b269d0 100644
--- a/fetch-object.h
+++ b/fetch-object.h
@@ -1,6 +1,11 @@
 #ifndef FETCH_OBJECT_H
 #define FETCH_OBJECT_H
 
+#include "sha1-array.h"
+
 extern void fetch_object(const char *remote_name, const unsigned char *sha1);
 
+extern void fetch_objects(const char *remote_name,
+ const struct oid_array *to_fetch);
+
 #endif
diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 567161e..3211f86 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -611,6 +611,58 @@ test_expect_success 'partial clone: warn if server does 
not support object filte
test_i18ngrep "filtering not recognized by server" err
 '
 
+test_expect_success 'batch missing blob request during checkout' '
+   rm -rf server client &&
+
+   test_create_repo server &&
+   echo a >server/a &&
+   echo b >server/b &&
+   git -C server add a b &&
+
+   git -C server commit -m x &&
+   echo aa >server/a &&
+   echo bb >server/b &&
+   git -C server add a b &&
+   git -C server commit -m x &&
+
+   test_config -C server uploadpack.allowfilter 1 &&
+   test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+   git clone --filter=blobs:limit=0 "file://$(pwd)/server" client &&
+
+   # Ensure that there is only one negotiation by checking that there is
+   # only "done" line sent. ("done" marks the end of negotiation.)
+   GIT_TRACE_PACKET="$(pwd)/trace" git -C client checkout HEAD^ &&
+   grep "git> done" trace >done_lines &&
+   test_line_count = 1 done_lines
+'
+
+test_expect_success 'batch missing blob request does not inadvertently try to 
fetch gitlinks' '
+   rm -rf server client &&
+
+   test_create_repo repo_

[PATCH 14/14] index-pack: silently assume missing objects are promisor

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach index-pack to not complain about missing objects
when the --promisor flag is given.  The assumption is that
index-pack is currently building the idx and promisor data
and the is_promisor_object() query would fail anyway.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/index-pack.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 31cd5ba..51693dc 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -82,6 +82,7 @@ static int verbose;
 static int show_resolving_progress;
 static int show_stat;
 static int check_self_contained_and_connected;
+static int arg_promisor_given;
 
 static struct progress *progress;
 
@@ -223,10 +224,11 @@ static unsigned check_object(struct object *obj)
unsigned long size;
int type = sha1_object_info(obj->oid.hash, );
 
-   if (type <= 0) {
+   if (type <= 0 && arg_promisor_given) {
/*
-* TODO Use the promisor code to conditionally
-* try to fetch this object -or- assume it is ok.
+* Assume this missing object is promised.  We can't
+* confirm it because we are indexing the packfile
+* that omitted it.
 */
obj->flags |= FLAG_CHECKED;
return 0;
@@ -1717,8 +1719,10 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
keep_msg = arg + 7;
} else if (!strcmp(arg, "--promisor")) {
promisor_msg = "";
+   arg_promisor_given = 1;
} else if (starts_with(arg, "--promisor=")) {
promisor_msg = arg + strlen("--promisor=");
+   arg_promisor_given = 1;
} else if (starts_with(arg, "--threads=")) {
char *end;
nr_threads = strtoul(arg+10, , 0);
-- 
2.9.3



[PATCH 11/14] t5500: more tests for partial clone and fetch

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t5500-fetch-pack.sh | 60 +++
 1 file changed, 56 insertions(+), 4 deletions(-)

diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 7c8339f..86cf653 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -782,7 +782,7 @@ test_expect_success 'filtering by size has no effect if 
support for it is not ad
test_i18ngrep "filtering not recognized by server" err
 '
 
-fetch_blob_max_bytes () {
+setup_blob_max_bytes () {
  SERVER="$1"
  URL="$2"
 
@@ -794,7 +794,11 @@ fetch_blob_max_bytes () {
git clone "$URL" client &&
test_config -C client extensions.partialcloneremote origin &&
 
-   test_commit -C "$SERVER" two &&
+   test_commit -C "$SERVER" two
+}
+
+do_blob_max_bytes() {
+   SERVER="$1" &&
 
git -C client fetch --filter=blobs:limit=0 origin HEAD:somewhere &&
 
@@ -805,14 +809,62 @@ fetch_blob_max_bytes () {
 }
 
 test_expect_success 'fetch with filtering' '
-fetch_blob_max_bytes server server
+   setup_blob_max_bytes server server &&
+   do_blob_max_bytes server
+'
+
+test_expect_success 'fetch respects configured filtering' '
+   setup_blob_max_bytes server server &&
+
+   test_config -C client extensions.partialclonefilter blobs:limit=0 &&
+
+   git -C client fetch origin HEAD:somewhere &&
+
+   # Ensure that commit is fetched, but blob is not
+   test_config -C client extensions.partialcloneremote "arbitrary string" 
&&
+   git -C client cat-file -e $(git -C server rev-parse two) &&
+   test_must_fail git -C client cat-file -e $(git hash-object server/two.t)
+'
+
+test_expect_success 'pull respects configured filtering' '
+   setup_blob_max_bytes server server &&
+
+   # Hide two.t from tip so that client does not load it upon the
+   # automatic checkout that pull performs
+   git -C server rm two.t &&
+   test_commit -C server three &&
+
+   test_config -C server uploadpack.allowanysha1inwant 1 &&
+   test_config -C client extensions.partialclonefilter blobs:limit=0 &&
+
+   git -C client pull origin &&
+
+   # Ensure that commit is fetched, but blob is not
+   test_config -C client extensions.partialcloneremote "arbitrary string" 
&&
+   git -C client cat-file -e $(git -C server rev-parse two) &&
+   test_must_fail git -C client cat-file -e $(git hash-object server/two.t)
+'
+
+test_expect_success 'clone configures filtering' '
+   rm -rf server client &&
+   test_create_repo server &&
+   test_commit -C server one &&
+   test_commit -C server two &&
+   test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+   git clone --filter=blobs:limit=12345 server client &&
+
+   # Ensure that we can, for example, checkout HEAD^
+   rm -rf client/.git/objects/* &&
+   git -C client checkout HEAD^
 '
 
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
 test_expect_success 'fetch with filtering and HTTP' '
-fetch_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" 
"$HTTPD_URL/smart/server"
+   setup_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" 
"$HTTPD_URL/smart/server" &&
+   do_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server"
 '
 
 stop_httpd
-- 
2.9.3



[PATCH 00/14] WIP Partial clone part 3: clone, fetch, fetch-pack, upload-pack, and tests

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

This is part 3 of 3 for partial clone.
It assumes that part 1 [1] and part 2 [2] are in place.

Part 3 is concerned with the commands: clone, fetch, upload-pack, fetch-pack,
remote-curl, index-pack, and the pack-protocol.

Jonathan and I independently started on this task.  This is a first
pass at merging those efforts.  So there are several places that need
refactoring and cleanup.  In particular, the test cases should be
squashed and new tests added.

[1] 
https://public-inbox.org/git/20171102124445.fbffd43521cd35f6a71e1...@google.com/T/
[2] TODO


Jeff Hostetler (5):
  upload-pack: add object filtering for partial clone
  clone, fetch-pack, index-pack, transport: partial clone
  fetch: add object filtering for partial fetch
  remote-curl: add object filtering for partial clone
  index-pack: silently assume missing objects are promisor

Jonathan Tan (9):
  fetch: refactor calculation of remote list
  pack-objects: test support for blob filtering
  fetch-pack: test support excluding large blobs
  fetch: add from_promisor and exclude-promisor-objects parameters
  t5500: add fetch-pack tests for partial clone
  t5601: test for partial clone
  t5500: more tests for partial clone and fetch
  unpack-trees: batch fetching of missing blobs
  fetch-pack: restore save_commit_buffer after use

 Documentation/config.txt  |   4 +
 Documentation/gitremote-helpers.txt   |   4 +
 Documentation/technical/pack-protocol.txt |   8 ++
 Documentation/technical/protocol-capabilities.txt |   8 ++
 builtin/clone.c   |  24 -
 builtin/fetch-pack.c  |   4 +
 builtin/fetch.c   |  83 ++--
 builtin/index-pack.c  |  14 +++
 connected.c   |   3 +
 fetch-object.c|  27 -
 fetch-object.h|   5 +
 fetch-pack.c  |  17 
 fetch-pack.h  |   2 +
 remote-curl.c |  10 +-
 t/t5300-pack-object.sh|  45 +
 t/t5500-fetch-pack.sh | 115 ++
 t/t5601-clone.sh  | 101 +++
 t/test-lib-functions.sh   |  12 +++
 transport-helper.c|   5 +
 transport.c   |   4 +
 transport.h   |   5 +
 unpack-trees.c|  22 +
 upload-pack.c |  20 +++-
 23 files changed, 526 insertions(+), 16 deletions(-)

-- 
2.9.3



[PATCH 02/14] clone, fetch-pack, index-pack, transport: partial clone

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/clone.c  |  9 +
 builtin/fetch-pack.c |  4 
 builtin/index-pack.c | 10 ++
 fetch-pack.c | 13 +
 fetch-pack.h |  2 ++
 transport-helper.c   |  5 +
 transport.c  |  4 
 transport.h  |  5 +
 8 files changed, 52 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index dbddd98..fceb9e7 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -26,6 +26,7 @@
 #include "run-command.h"
 #include "connected.h"
 #include "packfile.h"
+#include "list-objects-filter-options.h"
 
 /*
  * Overall FIXMEs:
@@ -60,6 +61,7 @@ static struct string_list option_optional_reference = 
STRING_LIST_INIT_NODUP;
 static int option_dissociate;
 static int max_jobs = -1;
 static struct string_list option_recurse_submodules = STRING_LIST_INIT_NODUP;
+static struct list_objects_filter_options filter_options;
 
 static int recurse_submodules_cb(const struct option *opt,
 const char *arg, int unset)
@@ -135,6 +137,7 @@ static struct option builtin_clone_options[] = {
TRANSPORT_FAMILY_IPV4),
OPT_SET_INT('6', "ipv6", , N_("use IPv6 addresses only"),
TRANSPORT_FAMILY_IPV6),
+   OPT_PARSE_LIST_OBJECTS_FILTER(_options),
OPT_END()
 };
 
@@ -1073,6 +1076,8 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
warning(_("--shallow-since is ignored in local clones; 
use file:// instead."));
if (option_not.nr)
warning(_("--shallow-exclude is ignored in local 
clones; use file:// instead."));
+   if (filter_options.choice)
+   warning(_("--filter is ignored in local clones; use 
file:// instead."));
if (!access(mkpath("%s/shallow", path), F_OK)) {
if (option_local > 0)
warning(_("source repository is shallow, 
ignoring --local"));
@@ -1104,6 +1109,10 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
transport_set_option(transport, TRANS_OPT_UPLOADPACK,
 option_upload_pack);
 
+   if (filter_options.choice)
+   transport_set_option(transport, TRANS_OPT_LIST_OBJECTS_FILTER,
+filter_options.raw_value);
+
if (transport->smart_options && !deepen)
transport->smart_options->check_self_contained_and_connected = 
1;
 
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 9a7ebf6..d0fdaa8 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -153,6 +153,10 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
args.no_haves = 1;
continue;
}
+   if (skip_prefix(arg, ("--" CL_ARG__FILTER "="), )) {
+   parse_list_objects_filter(_options, arg);
+   continue;
+   }
usage(fetch_pack_usage);
}
if (deepen_not.nr)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index a0a35e6..31cd5ba 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -222,6 +222,16 @@ static unsigned check_object(struct object *obj)
if (!(obj->flags & FLAG_CHECKED)) {
unsigned long size;
int type = sha1_object_info(obj->oid.hash, );
+
+   if (type <= 0) {
+   /*
+* TODO Use the promisor code to conditionally
+* try to fetch this object -or- assume it is ok.
+*/
+   obj->flags |= FLAG_CHECKED;
+   return 0;
+   }
+
if (type <= 0)
die(_("did not receive expected object %s"),
  oid_to_hex(>oid));
diff --git a/fetch-pack.c b/fetch-pack.c
index 4640b4e..895e8f9 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -29,6 +29,7 @@ static int deepen_not_ok;
 static int fetch_fsck_objects = -1;
 static int transfer_fsck_objects = -1;
 static int agent_supported;
+static int server_supports_filtering;
 static struct lock_file shallow_lock;
 static const char *alternate_shallow_file;
 
@@ -379,6 +380,8 @@ static int find_common(struct fetch_pack_args *args,
if (deepen_not_ok)  strbuf_addstr(, " 
deepen-not");
if (agent_supported)strbuf_addf(, " agent=%s",

git_user_age

[PATCH 04/14] fetch: add object filtering for partial fetch

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach fetch to use the list-objects filtering parameters
to allow a "partial fetch" following a "partial clone".

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fetch.c | 66 -
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 1b1f039..150ca0a 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -18,6 +18,7 @@
 #include "argv-array.h"
 #include "utf8.h"
 #include "packfile.h"
+#include "list-objects-filter-options.h"
 
 static const char * const builtin_fetch_usage[] = {
N_("git fetch [] [ [...]]"),
@@ -55,6 +56,7 @@ static int recurse_submodules_default = 
RECURSE_SUBMODULES_ON_DEMAND;
 static int shown_url = 0;
 static int refmap_alloc, refmap_nr;
 static const char **refmap_array;
+static struct list_objects_filter_options filter_options;
 
 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -160,6 +162,7 @@ static struct option builtin_fetch_options[] = {
TRANSPORT_FAMILY_IPV4),
OPT_SET_INT('6', "ipv6", , N_("use IPv6 addresses only"),
TRANSPORT_FAMILY_IPV6),
+   OPT_PARSE_LIST_OBJECTS_FILTER(_options),
OPT_END()
 };
 
@@ -754,6 +757,7 @@ static int store_updated_refs(const char *raw_url, const 
char *remote_name,
const char *filename = dry_run ? "/dev/null" : git_path_fetch_head();
int want_status;
int summary_width = transport_summary_width(ref_map);
+   struct check_connected_options opt = CHECK_CONNECTED_INIT;
 
fp = fopen(filename, "a");
if (!fp)
@@ -765,7 +769,7 @@ static int store_updated_refs(const char *raw_url, const 
char *remote_name,
url = xstrdup("foreign");
 
rm = ref_map;
-   if (check_connected(iterate_ref_map, , NULL)) {
+   if (check_connected(iterate_ref_map, , )) {
rc = error(_("%s did not send all necessary objects\n"), url);
goto abort;
}
@@ -1044,6 +1048,9 @@ static struct transport *prepare_transport(struct remote 
*remote, int deepen)
set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, "yes");
if (update_shallow)
set_option(transport, TRANS_OPT_UPDATE_SHALLOW, "yes");
+   if (filter_options.choice)
+   set_option(transport, TRANS_OPT_LIST_OBJECTS_FILTER,
+  filter_options.raw_value);
return transport;
 }
 
@@ -1242,6 +1249,20 @@ static int fetch_multiple(struct string_list *list)
int i, result = 0;
struct argv_array argv = ARGV_ARRAY_INIT;
 
+   if (filter_options.choice) {
+   /*
+* We currently only support partial-fetches to the remote
+* used for the partial-clone because we only support 1
+* promisor remote, so we DO NOT allow explicit command
+* line filter arguments.
+*
+* Note that the loop below will spawn background fetches
+* for each remote and one of them MAY INHERIT the proper
+* partial-fetch settings, so everything is consistent.
+*/
+   die(_("partial-fetch is not supported on multiple remotes"));
+   }
+
if (!append && !dry_run) {
int errcode = truncate_fetch_head();
if (errcode)
@@ -1267,6 +1288,46 @@ static int fetch_multiple(struct string_list *list)
return result;
 }
 
+static inline void partial_fetch_one_setup(struct remote *remote)
+{
+#if 0 /* TODO */
+   if (filter_options.choice) {
+   /*
+* A partial-fetch was explicitly requested.
+*
+* If this is the first partial-* command on
+* this repo, we must register the partial
+* settings in the repository extension.
+*
+* If this follows a previous partial-* command
+* we must ensure the args are consistent with
+* the existing registration (because we don't
+* currently support mixing-and-matching).
+*/
+   partial_clone_utils_register(_options,
+remote->name, "fetch");
+   return;
+   }
+
+   if (is_partial_clone_registered() &&
+   !strcmp(remote->name, repository_format_partial_clone_remote)) {
+   /*
+* If a partial-* command has already been used on
+* this repo and it was to this remote, we should
+* inherit the filter settings used previously.
+  

[PATCH 08/14] fetch: add from_promisor and exclude-promisor-objects parameters

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach fetch to use from-promisor and exclude-promisor-objects
parameters with sub-commands.  Initialize fetch_if_missing
global variable.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fetch.c | 9 ++---
 connected.c | 3 +++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 150ca0a..ab53df3 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -19,6 +19,7 @@
 #include "utf8.h"
 #include "packfile.h"
 #include "list-objects-filter-options.h"
+#include "partial-clone-utils.h"
 
 static const char * const builtin_fetch_usage[] = {
N_("git fetch [] [ [...]]"),
@@ -1048,9 +1049,11 @@ static struct transport *prepare_transport(struct remote 
*remote, int deepen)
set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, "yes");
if (update_shallow)
set_option(transport, TRANS_OPT_UPDATE_SHALLOW, "yes");
-   if (filter_options.choice)
+   if (filter_options.choice) {
set_option(transport, TRANS_OPT_LIST_OBJECTS_FILTER,
   filter_options.raw_value);
+   set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+   }
return transport;
 }
 
@@ -1290,7 +1293,6 @@ static int fetch_multiple(struct string_list *list)
 
 static inline void partial_fetch_one_setup(struct remote *remote)
 {
-#if 0 /* TODO */
if (filter_options.choice) {
/*
 * A partial-fetch was explicitly requested.
@@ -1325,7 +1327,6 @@ static inline void partial_fetch_one_setup(struct remote 
*remote)
_options,
repository_format_partial_clone_filter);
}
-#endif
 }
 
 static int fetch_one(struct remote *remote, int argc, const char **argv)
@@ -1392,6 +1393,8 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
 
packet_trace_identity("fetch");
 
+   fetch_if_missing = 0;
+
/* Record the command line for the reflog */
strbuf_addstr(_rla, "fetch");
for (i = 1; i < argc; i++)
diff --git a/connected.c b/connected.c
index f416b05..6015316 100644
--- a/connected.c
+++ b/connected.c
@@ -4,6 +4,7 @@
 #include "connected.h"
 #include "transport.h"
 #include "packfile.h"
+#include "partial-clone-utils.h"
 
 /*
  * If we feed all the commits we want to verify to this command
@@ -56,6 +57,8 @@ int check_connected(sha1_iterate_fn fn, void *cb_data,
argv_array_push(_list.args,"rev-list");
argv_array_push(_list.args, "--objects");
argv_array_push(_list.args, "--stdin");
+   if (is_partial_clone_registered())
+   argv_array_push(_list.args, "--exclude-promisor-objects");
argv_array_push(_list.args, "--not");
argv_array_push(_list.args, "--all");
argv_array_push(_list.args, "--quiet");
-- 
2.9.3



[PATCH 10/14] t5601: test for partial clone

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/clone.c  | 17 ++---
 t/t5601-clone.sh | 49 +
 2 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index fceb9e7..08315d8 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -27,6 +27,7 @@
 #include "connected.h"
 #include "packfile.h"
 #include "list-objects-filter-options.h"
+#include "partial-clone-utils.h"
 
 /*
  * Overall FIXMEs:
@@ -889,6 +890,8 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
struct refspec *refspec;
const char *fetch_pattern;
 
+   fetch_if_missing = 0;
+
packet_trace_identity("clone");
argc = parse_options(argc, argv, prefix, builtin_clone_options,
 builtin_clone_usage, 0);
@@ -1109,11 +1112,13 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
transport_set_option(transport, TRANS_OPT_UPLOADPACK,
 option_upload_pack);
 
-   if (filter_options.choice)
+   if (filter_options.choice) {
transport_set_option(transport, TRANS_OPT_LIST_OBJECTS_FILTER,
 filter_options.raw_value);
+   transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+   }
 
-   if (transport->smart_options && !deepen)
+   if (transport->smart_options && !deepen && !filter_options.choice)
transport->smart_options->check_self_contained_and_connected = 
1;
 
refs = transport_get_remote_refs(transport);
@@ -1173,13 +1178,18 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
write_refspec_config(src_ref_prefix, our_head_points_at,
remote_head_points_at, _top);
 
+   if (filter_options.choice)
+   partial_clone_utils_register(_options, "origin",
+"clone");
+
if (is_local)
clone_local(path, git_dir);
else if (refs && complete_refs_before_fetch)
transport_fetch_refs(transport, mapped_refs);
 
update_remote_refs(refs, mapped_refs, remote_head_points_at,
-  branch_top.buf, reflog_msg.buf, transport, 
!is_local);
+  branch_top.buf, reflog_msg.buf, transport,
+  !is_local && !filter_options.choice);
 
update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
@@ -1200,6 +1210,7 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
}
 
junk_mode = JUNK_LEAVE_REPO;
+   fetch_if_missing = 1;
err = checkout(submodule_progress);
 
strbuf_release(_msg);
diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 9c56f77..567161e 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -571,4 +571,53 @@ test_expect_success 'GIT_TRACE_PACKFILE produces a usable 
pack' '
git -C replay.git index-pack -v --stdin  err &&
+
+   test_i18ngrep "filtering not recognized by server" err
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'partial clone using HTTP' '
+partial_clone "$HTTPD_DOCUMENT_ROOT_PATH/server" 
"$HTTPD_URL/smart/server"
+'
+
+stop_httpd
+
 test_done
-- 
2.9.3



[PATCH 05/14] remote-curl: add object filtering for partial clone

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/gitremote-helpers.txt |  4 
 remote-curl.c   | 10 --
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/gitremote-helpers.txt 
b/Documentation/gitremote-helpers.txt
index 6da3f41..d46d561 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -468,6 +468,10 @@ set by Git if the remote helper has the 'option' 
capability.
 
 TODO document 'option from-promisor' and 'option no-haves' ?
 
+'option filter '::
+   An object filter specification for partial clone or fetch
+   as described in rev-list.
+
 SEE ALSO
 
 linkgit:git-remote[1]
diff --git a/remote-curl.c b/remote-curl.c
index 41e8a42..840f3ce 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -13,6 +13,7 @@
 #include "credential.h"
 #include "sha1-array.h"
 #include "send-pack.h"
+#include "list-objects-filter-options.h"
 
 static struct remote *remote;
 /* always ends with a trailing slash */
@@ -22,6 +23,7 @@ struct options {
int verbosity;
unsigned long depth;
char *deepen_since;
+   char *partial_clone_filter;
struct string_list deepen_not;
struct string_list push_options;
unsigned progress : 1,
@@ -163,11 +165,12 @@ static int set_option(const char *name, const char *value)
} else if (!strcmp(name, "from-promisor")) {
options.from_promisor = 1;
return 0;
-
} else if (!strcmp(name, "no-haves")) {
options.no_haves = 1;
return 0;
-
+   } else if (!strcmp(name, "filter")) {
+   options.partial_clone_filter = xstrdup(value);
+   return 0;
} else {
return 1 /* unsupported */;
}
@@ -837,6 +840,9 @@ static int fetch_git(struct discovery *heads,
argv_array_push(, "--from-promisor");
if (options.no_haves)
argv_array_push(, "--no-haves");
+   if (options.partial_clone_filter)
+   argv_array_pushf(, "--%s=%s",
+CL_ARG__FILTER, options.partial_clone_filter);
argv_array_push(, url.buf);
 
for (i = 0; i < nr_heads; i++) {
-- 
2.9.3



[PATCH 8/9] sha1_file: support lazily fetching missing objects

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach sha1_file to fetch objects from the remote configured in
extensions.partialcloneremote whenever an object is requested but missing.

The fetching of objects can be suppressed through a global variable.
This is used by fsck and index-pack.

However, by default, such fetching is not suppressed. This is meant as a
temporary measure to ensure that all Git commands work in such a
situation. Future patches will update some commands to either tolerate
missing objects (without fetching them) or be more efficient in fetching
them.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file.c that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in packfile.c (because I need to check callers that know
 about the loose/packed distinction and operate on both differently,
 and ensure that they can handle the concept of objects that are
 neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and others.  For
for_each_packed_object, the callers either already work or are fixed in
this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about missing objects
 - builtin/cat-file - warning message added in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
   - find_pack_entry_one
 - this searches a single pack that is provided as an argument; the
   caller already knows (through other means) that the sought object
   is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a file if
 it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promisor objects
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/cat-file.c   |   3 +
 builtin/fetch-pack.c |   2 +
 builtin/fsck.c   |   3 +
 builtin/index-pack.c |   6 ++
 builtin/rev-list.c   |  35 +--
 cache.h  |   8 +++
 fetch-object.c   |   3 +
 list-objects.c   |   8 ++-
 object.c |   2 +-
 revision.c   |  32 +-
 revision.h   |   5 +-
 sha1_file.c  |  39 
 t/t0410-partial-clone.sh | 152 +++
 13 files changed, 277 insertions(+), 21 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index f5fa4fd..ba77b73 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -13,6 +13,7 @@
 #include "tree-walk.h"
 #include "sha1-array.h"
 #include "packfile.h"
+#include "partial-clone-utils.h"
 
 struct batch_options {
int enabled;
@@ -475,6 +476,8 @@ static int batch_objects(struct batch_options *opt)
 
for_each_loose_object(batch_loose_object, , 0);
for_each_packed_object(batch_packed_object, , 0);
+   if (is_partial_clone_registered())
+   warning("This repository has partial clone enabled. 
Some objects may not be loaded.");
 
cb.opt = opt;
cb.expand = 
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 9f303cf..9a7ebf6 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -53,6 +53,8 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
struct oid_array shallow = OID_ARRAY_INIT;
struct string_list deepen_not = STRING_LIST_INIT_DUP;
 
+   fetch_if_missing = 0;
+
packet_trace_identity("fetch-pack");
 
memset(, 0, sizeof(args));
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 578a7c8..3b76c0e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -678,6 +678,9 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
int i;
struct alternate_object_database *alt;
 
+   /* fsck knows how to handle missing promisor objects */
+   fetch_if_missing = 0;
+
errors_found = 0;
check_replace_refs = 0;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 24c2f05..a0a35e6 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1657,6 +1657,12 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
unsigned foreign_nr = 1;  

[PATCH 1/9] extension.partialclone: introduce partial clone extension

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Introduce the ability to have missing objects in a repo.  This
functionality is guarded by new repository extension options:
`extensions.partialcloneremote` and
`extensions.partialclonefilter`.

See the update to Documentation/technical/repository-version.txt
in this patch for more information.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/technical/repository-version.txt | 22 
 Makefile   |  1 +
 cache.h|  4 ++
 config.h   |  3 +
 environment.c  |  2 +
 partial-clone-utils.c  | 78 ++
 partial-clone-utils.h  | 34 +++
 setup.c| 15 +
 8 files changed, 159 insertions(+)
 create mode 100644 partial-clone-utils.c
 create mode 100644 partial-clone-utils.h

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad379..9d488db 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,25 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`partialcloneremote`
+
+
+When the config key `extensions.partialcloneremote` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects.  Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.
+
+`partialclonefilter`
+
+
+When the config key `extensions.partialclonefilter` is set, it gives
+the initial filter expression used to create the partial clone.
+This value becomed the default filter expression for subsequent
+fetches (called "partial fetches") from the promisor remote.  This
+value may also be set by the first explicit partial fetch following a
+normal clone.
diff --git a/Makefile b/Makefile
index ca378a4..12d141a 100644
--- a/Makefile
+++ b/Makefile
@@ -838,6 +838,7 @@ LIB_OBJS += pack-write.o
 LIB_OBJS += pager.o
 LIB_OBJS += parse-options.o
 LIB_OBJS += parse-options-cb.o
+LIB_OBJS += partial-clone-utils.o
 LIB_OBJS += patch-delta.o
 LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
diff --git a/cache.h b/cache.h
index 6440e2b..4b785c0 100644
--- a/cache.h
+++ b/cache.h
@@ -860,12 +860,16 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_partial_clone_remote;
+extern char *repository_format_partial_clone_filter;
 
 struct repository_format {
int version;
int precious_objects;
int is_bare;
char *work_tree;
+   char *partial_clone_remote; /* value of extensions.partialcloneremote */
+   char *partial_clone_filter; /* value of extensions.partialclonefilter */
struct string_list unknown_extensions;
 };
 
diff --git a/config.h b/config.h
index a49d264..90544ef 100644
--- a/config.h
+++ b/config.h
@@ -34,6 +34,9 @@ struct config_options {
const char *git_dir;
 };
 
+#define KEY_PARTIALCLONEREMOTE "partialcloneremote"
+#define KEY_PARTIALCLONEFILTER "partialclonefilter"
+
 typedef int (*config_fn_t)(const char *, const char *, void *);
 extern int git_default_config(const char *, const char *, void *);
 extern int git_config_from_file(config_fn_t fn, const char *, void *);
diff --git a/environment.c b/environment.c
index 8289c25..2fcf9bb 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,8 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_partial_clone_remote;
+char *repository_format_partial_clone_filter;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/partial-clone-utils.c b/partial-clone-utils.c
new file mode 100644
index 000..32cc20d
--- /dev/null
+++ b/partial-clone-utils.c
@@ -0,0 +1,78 @@
+#include "cache.h"
+#include "config.h"
+#include "partial-clone-utils.h"
+
+int is_partial_clone_registered(void)
+{
+   if (repository_format_partial_clone_remote ||
+   repository_format_partial_clone_filter)
+   return 1;
+
+   return 0;
+}
+
+void partial_clone_utils_register(
+   const struct list_objects_

[PATCH 6/9] index-pack: refactor writing of .keep files

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

In a subsequent commit, index-pack will be taught to write ".promisor"
files which are similar to the ".keep" files it knows how to write.
Refactor the writing of ".keep" files, so that the implementation of
writing ".promisor" files becomes easier.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/index-pack.c | 99 
 1 file changed, 53 insertions(+), 46 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 8ec459f..4f305a7 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1389,15 +1389,58 @@ static void fix_unresolved_deltas(struct sha1file *f)
free(sorted_by_pos);
 }
 
+static const char *derive_filename(const char *pack_name, const char *suffix,
+  struct strbuf *buf)
+{
+   size_t len;
+   if (!strip_suffix(pack_name, ".pack", ))
+   die(_("packfile name '%s' does not end with '.pack'"),
+   pack_name);
+   strbuf_add(buf, pack_name, len);
+   strbuf_addch(buf, '.');
+   strbuf_addstr(buf, suffix);
+   return buf->buf;
+}
+
+static void write_special_file(const char *suffix, const char *msg,
+  const char *pack_name, const unsigned char *sha1,
+  const char **report)
+{
+   struct strbuf name_buf = STRBUF_INIT;
+   const char *filename;
+   int fd;
+   int msg_len = strlen(msg);
+
+   if (pack_name)
+   filename = derive_filename(pack_name, suffix, _buf);
+   else
+   filename = odb_pack_name(_buf, sha1, suffix);
+
+   fd = odb_pack_keep(filename);
+   if (fd < 0) {
+   if (errno != EEXIST)
+   die_errno(_("cannot write %s file '%s'"),
+ suffix, filename);
+   } else {
+   if (msg_len > 0) {
+   write_or_die(fd, msg, msg_len);
+   write_or_die(fd, "\n", 1);
+   }
+   if (close(fd) != 0)
+   die_errno(_("cannot close written %s file '%s'"),
+ suffix, filename);
+   *report = suffix;
+   }
+   strbuf_release(_buf);
+}
+
 static void final(const char *final_pack_name, const char *curr_pack_name,
  const char *final_index_name, const char *curr_index_name,
- const char *keep_name, const char *keep_msg,
- unsigned char *sha1)
+ const char *keep_msg, unsigned char *sha1)
 {
const char *report = "pack";
struct strbuf pack_name = STRBUF_INIT;
struct strbuf index_name = STRBUF_INIT;
-   struct strbuf keep_name_buf = STRBUF_INIT;
int err;
 
if (!from_stdin) {
@@ -1409,28 +1452,9 @@ static void final(const char *final_pack_name, const 
char *curr_pack_name,
die_errno(_("error while closing pack file"));
}
 
-   if (keep_msg) {
-   int keep_fd, keep_msg_len = strlen(keep_msg);
-
-   if (!keep_name)
-   keep_name = odb_pack_name(_name_buf, sha1, "keep");
-
-   keep_fd = odb_pack_keep(keep_name);
-   if (keep_fd < 0) {
-   if (errno != EEXIST)
-   die_errno(_("cannot write keep file '%s'"),
- keep_name);
-   } else {
-   if (keep_msg_len > 0) {
-   write_or_die(keep_fd, keep_msg, keep_msg_len);
-   write_or_die(keep_fd, "\n", 1);
-   }
-   if (close(keep_fd) != 0)
-   die_errno(_("cannot close written keep file 
'%s'"),
- keep_name);
-   report = "keep";
-   }
-   }
+   if (keep_msg)
+   write_special_file("keep", keep_msg, final_pack_name, sha1,
+  );
 
if (final_pack_name != curr_pack_name) {
if (!final_pack_name)
@@ -1472,7 +1496,6 @@ static void final(const char *final_pack_name, const char 
*curr_pack_name,
 
strbuf_release(_name);
strbuf_release(_name);
-   strbuf_release(_name_buf);
 }
 
 static int git_index_pack_config(const char *k, const char *v, void *cb)
@@ -1615,26 +1638,13 @@ static void show_pack_info(int stat_only)
}
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-  struct strbuf *buf)
-{
- 

[PATCH 3/9] fsck: support refs pointing to promisor objects

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach fsck to not treat refs referring to missing promisor objects as an
error when extensions.partialclone is set.

For the purposes of warning about no default refs, such refs are still
treated as legitimate refs.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fsck.c   |  8 
 t/t0410-partial-clone.sh | 24 
 2 files changed, 32 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 2934299..ee937bb 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -434,6 +434,14 @@ static int fsck_handle_ref(const char *refname, const 
struct object_id *oid,
 
obj = parse_object(oid);
if (!obj) {
+   if (is_promisor_object(oid)) {
+   /*
+* Increment default_refs anyway, because this is a
+* valid ref.
+*/
+default_refs++;
+return 0;
+   }
error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
/* We'll continue with the rest despite the error.. */
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 52347fb..5a03ead 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -13,6 +13,14 @@ pack_as_from_promisor () {
>repo/.git/objects/pack/pack-$HASH.promisor
 }
 
+promise_and_delete () {
+   HASH=$(git -C repo rev-parse "$1") &&
+   git -C repo tag -a -m message my_annotated_tag "$HASH" &&
+   git -C repo rev-parse my_annotated_tag | pack_as_from_promisor &&
+   git -C repo tag -d my_annotated_tag &&
+   delete_object repo "$HASH"
+}
+
 test_expect_success 'missing reflog object, but promised by a commit, passes 
fsck' '
test_create_repo repo &&
test_commit -C repo my_commit &&
@@ -78,4 +86,20 @@ test_expect_success 'missing reflog object alone fails fsck, 
even with extension
test_must_fail git -C repo fsck
 '
 
+test_expect_success 'missing ref object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo my_commit &&
+
+   A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+
+   # Reference $A only from ref
+   git -C repo branch my_branch "$A" &&
+   promise_and_delete "$A" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialcloneremote "arbitrary string" &&
+   git -C repo fsck
+'
+
 test_done
-- 
2.9.3



[PATCH 2/9] fsck: introduce partialclone extension

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage. In such an arrangement, the full set of objects is usually
available in remote storage, ready to be lazily downloaded.

Introduce the ability to have missing objects in a repo.  This
functionality is guarded behind a new repository extension option
`extensions.partialcloneremote`.
See Documentation/technical/repository-version.txt for more information.

Teach fsck about the new state of affairs. In this commit, teach fsck
that missing promisor objects referenced from the reflog are not an
error case; in future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fsck.c   |  2 +-
 cache.h  |  3 +-
 packfile.c   | 78 --
 packfile.h   | 13 
 setup.c  |  3 --
 t/t0410-partial-clone.sh | 81 
 6 files changed, 172 insertions(+), 8 deletions(-)
 create mode 100755 t/t0410-partial-clone.sh

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 56afe40..2934299 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -398,7 +398,7 @@ static void fsck_handle_reflog_oid(const char *refname, 
struct object_id *oid,
xstrfmt("%s@{%"PRItime"}", refname, 
timestamp));
obj->flags |= USED;
mark_object_reachable(obj);
-   } else {
+   } else if (!is_promisor_object(oid)) {
error("%s: invalid reflog entry %s", refname, 
oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
}
diff --git a/cache.h b/cache.h
index 4b785c0..5f84103 100644
--- a/cache.h
+++ b/cache.h
@@ -1589,7 +1589,8 @@ extern struct packed_git {
unsigned pack_local:1,
 pack_keep:1,
 freshened:1,
-do_not_close:1;
+do_not_close:1,
+pack_promisor:1;
unsigned char sha1[20];
struct revindex_entry *revindex;
/* something like ".git/objects/pack/x.pack" */
diff --git a/packfile.c b/packfile.c
index 4a5fe7a..b015a54 100644
--- a/packfile.c
+++ b/packfile.c
@@ -8,6 +8,12 @@
 #include "list.h"
 #include "streaming.h"
 #include "sha1-lookup.h"
+#include "commit.h"
+#include "object.h"
+#include "tag.h"
+#include "tree-walk.h"
+#include "tree.h"
+#include "partial-clone-utils.h"
 
 char *odb_pack_name(struct strbuf *buf,
const unsigned char *sha1,
@@ -643,10 +649,10 @@ struct packed_git *add_packed_git(const char *path, 
size_t path_len, int local)
return NULL;
 
/*
-* ".pack" is long enough to hold any suffix we're adding (and
+* ".promisor" is long enough to hold any suffix we're adding (and
 * the use xsnprintf double-checks that)
 */
-   alloc = st_add3(path_len, strlen(".pack"), 1);
+   alloc = st_add3(path_len, strlen(".promisor"), 1);
p = alloc_packed_git(alloc);
memcpy(p->pack_name, path, path_len);
 
@@ -654,6 +660,10 @@ struct packed_git *add_packed_git(const char *path, size_t 
path_len, int local)
if (!access(p->pack_name, F_OK))
p->pack_keep = 1;
 
+   xsnprintf(p->pack_name + path_len, alloc - path_len, ".promisor");
+   if (!access(p->pack_name, F_OK))
+   p->pack_promisor = 1;
+
xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
if (stat(p->pack_name, ) || !S_ISREG(st.st_mode)) {
free(p);
@@ -781,7 +791,8 @@ static void prepare_packed_git_one(char *objdir, int local)
if (ends_with(de->d_name, ".idx") ||
ends_with(de->d_name, ".pack") ||
ends_with(de->d_name, ".bitmap") ||
-   ends_with(de->d_name, ".keep"))
+   ends_with(de->d_name, ".keep") ||
+   ends_with(de->d_name, ".promisor"))
string_list_append(, path.buf);
else
report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -1889,6 +1900,9 @@ int for_each_packed_object(each_packed_object_fn cb, void 
*

[PATCH 9/9] gc: do not repack promisor packfiles

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach gc to stop traversal at promisor objects, and to leave promisor
packfiles alone. This has the effect of only repacking non-promisor
packfiles, and preserves the distinction between promisor packfiles and
non-promisor packfiles.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt |  4 +++
 builtin/gc.c   |  4 +++
 builtin/pack-objects.c | 14 ++
 builtin/prune.c|  7 +
 builtin/repack.c   | 12 +++--
 t/t0410-partial-clone.sh   | 54 --
 6 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 6786351..ee462c6 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -246,6 +246,10 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
Ignore missing objects without error.  This may be used with
or without and of the above filtering.
 
+--exclude-promisor-objects::
+   Silently omit referenced but missing objects from the packfile.
+   This is used with partial clone.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/gc.c b/builtin/gc.c
index 3c5eae0..a17806a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,7 @@
 #include "argv-array.h"
 #include "commit.h"
 #include "packfile.h"
+#include "partial-clone-utils.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -458,6 +459,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_push(, prune_expire);
if (quiet)
argv_array_push(, "--no-progress");
+   if (is_partial_clone_registered())
+   argv_array_push(,
+   "--exclude-promisor-objects");
if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
return error(FAILED_RUN, prune.argv[0]);
}
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e16722f..957e459 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -83,6 +83,7 @@ static unsigned long window_memory_limit = 0;
 
 static struct list_objects_filter_options filter_options;
 static int arg_ignore_missing;
+static int arg_exclude_promisor_objects;
 
 /*
  * stats
@@ -2561,6 +2562,11 @@ static void show_object(struct object *obj, const char 
*name, void *data)
if (arg_ignore_missing && !has_object_file(>oid))
return;
 
+   if (arg_exclude_promisor_objects &&
+   !has_object_file(>oid) &&
+   is_promisor_object(>oid))
+   return;
+
add_preferred_base_object(name);
add_object_entry(obj->oid.hash, obj->type, name, 0);
obj->flags |= OBJECT_ADDED;
@@ -2972,6 +2978,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
OPT_PARSE_LIST_OBJECTS_FILTER(_options),
OPT_BOOL(0, "filter-ignore-missing", _ignore_missing,
 N_("ignore and omit missing objects from packfile")),
+   OPT_BOOL(0, "exclude-promisor-objects", 
_exclude_promisor_objects,
+N_("do not pack objects in promisor packfiles")),
OPT_END(),
};
 
@@ -3017,6 +3025,12 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
argv_array_push(, "--unpacked");
}
 
+   if (arg_exclude_promisor_objects) {
+   use_internal_rev_list = 1;
+   fetch_if_missing = 0;
+   argv_array_push(, "--exclude-promisor-objects");
+   }
+
if (!reuse_object)
reuse_delta = 0;
if (pack_compression_level == -1)
diff --git a/builtin/prune.c b/builtin/prune.c
index cddabf2..be34645 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -101,12 +101,15 @@ int cmd_prune(int argc, const char **argv, const char 
*prefix)
 {
struct rev_info revs;
struct progress *progress = NULL;
+   int exclude_promisor_objects = 0;
const struct option options[] = {
OPT__DRY_RUN(_only, N_("do not remove, show only")),
OPT__VERBOSE(, N_("report pruned objects")),
OPT_BOOL(0, "progress", _progress, N_("show progress")),
OPT_EXPIRY_DATE(0, "expire", ,
N_("expire objects older than ")),
+   OPT_BOOL(0, "exclude-

[PATCH 5/9] fsck: support promisor objects as CLI argument

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach fsck to not treat missing promisor objects provided on the CLI as
an error when extensions.partialcloneremote is set.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fsck.c   |  2 ++
 t/t0410-partial-clone.sh | 13 +
 2 files changed, 15 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4c2a56d..578a7c8 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -750,6 +750,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
struct object *obj = lookup_object(oid.hash);
 
if (!obj || !(obj->flags & HAS_OBJ)) {
+   if (is_promisor_object())
+   continue;
error("%s: object missing", oid_to_hex());
errors_found |= ERROR_OBJECT;
continue;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index b1d404e..002e071 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -125,4 +125,17 @@ test_expect_success 'missing object, but promised, passes 
fsck' '
git -C repo fsck
 '
 
+test_expect_success 'missing CLI object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo my_commit &&
+
+   A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+   promise_and_delete "$A" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialcloneremote "arbitrary string" &&
+   git -C repo fsck "$A"
+'
+
 test_done
-- 
2.9.3



[PATCH 7/9] introduce fetch-object: fetch one promisor object

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Introduce fetch-object, providing the ability to fetch one object from a
promisor remote.

This uses fetch-pack. To do this, the transport mechanism has been
updated with 2 flags, "from-promisor" to indicate that the resulting
pack comes from a promisor remote (and thus should be annotated as such
by index-pack), and "no-haves" to suppress the sending of "have" lines.

This will be tested in a subsequent commit.

NEEDSWORK: update this when we have more information about protocol v2,
which should allow a way to suppress the ref advertisement and
officially allow any object type to be "want"-ed.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/gitremote-helpers.txt |  2 ++
 Makefile|  1 +
 builtin/fetch-pack.c|  8 
 builtin/index-pack.c| 16 +---
 fetch-object.c  | 23 +++
 fetch-object.h  |  6 ++
 fetch-pack.c|  8 ++--
 fetch-pack.h|  2 ++
 remote-curl.c   | 17 -
 transport.c |  8 
 transport.h |  8 
 11 files changed, 93 insertions(+), 6 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h

diff --git a/Documentation/gitremote-helpers.txt 
b/Documentation/gitremote-helpers.txt
index 4a584f3..6da3f41 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -466,6 +466,8 @@ set by Git if the remote helper has the 'option' capability.
Transmit  as a push option. As the push option
must not contain LF or NUL characters, the string is not encoded.
 
+TODO document 'option from-promisor' and 'option no-haves' ?
+
 SEE ALSO
 
 linkgit:git-remote[1]
diff --git a/Makefile b/Makefile
index 12d141a..7a0679a 100644
--- a/Makefile
+++ b/Makefile
@@ -792,6 +792,7 @@ LIB_OBJS += ewah/ewah_bitmap.o
 LIB_OBJS += ewah/ewah_io.o
 LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
+LIB_OBJS += fetch-object.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
 LIB_OBJS += gettext.o
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 366b9d1..9f303cf 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -143,6 +143,14 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
args.update_shallow = 1;
continue;
}
+   if (!strcmp("--from-promisor", arg)) {
+   args.from_promisor = 1;
+   continue;
+   }
+   if (!strcmp("--no-haves", arg)) {
+   args.no_haves = 1;
+   continue;
+   }
usage(fetch_pack_usage);
}
if (deepen_not.nr)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 4f305a7..24c2f05 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1429,14 +1429,16 @@ static void write_special_file(const char *suffix, 
const char *msg,
if (close(fd) != 0)
die_errno(_("cannot close written %s file '%s'"),
  suffix, filename);
-   *report = suffix;
+   if (report)
+   *report = suffix;
}
strbuf_release(_buf);
 }
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
  const char *final_index_name, const char *curr_index_name,
- const char *keep_msg, unsigned char *sha1)
+ const char *keep_msg, const char *promisor_msg,
+ unsigned char *sha1)
 {
const char *report = "pack";
struct strbuf pack_name = STRBUF_INIT;
@@ -1455,6 +1457,9 @@ static void final(const char *final_pack_name, const char 
*curr_pack_name,
if (keep_msg)
write_special_file("keep", keep_msg, final_pack_name, sha1,
   );
+   if (promisor_msg)
+   write_special_file("promisor", promisor_msg, final_pack_name,
+  sha1, NULL);
 
if (final_pack_name != curr_pack_name) {
if (!final_pack_name)
@@ -1644,6 +1649,7 @@ int cmd_index_pack(int argc, const char **argv, const 
char *prefix)
const char *curr_index;
const char *index_name = NULL, *pack_name = NULL;
const char *keep_msg = NULL;
+   const char *promisor_msg = NULL;
struct strbuf index_name_buf = STRBUF_INIT;
struct pack_idx_entry **idx_objects;
struct pack_idx_option opts;
@@ -1693,6 +16

[PATCH 4/9] fsck: support referenced promisor objects

2017-11-02 Thread Jeff Hostetler
From: Jonathan Tan <jonathanta...@google.com>

Teach fsck to not treat missing promisor objects indirectly pointed to
by refs as an error when extensions.partialcloneremote is set.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 builtin/fsck.c   | 11 +++
 t/t0410-partial-clone.sh | 23 +++
 2 files changed, 34 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ee937bb..4c2a56d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void 
*data, struct fsck_opt
if (obj->flags & REACHABLE)
return 0;
obj->flags |= REACHABLE;
+
+   if (is_promisor_object(>oid))
+   /*
+* Further recursion does not need to be performed on this
+* object since it is a promisor object (so it does not need to
+* be added to "pending").
+*/
+   return 0;
+
if (!(obj->flags & HAS_OBJ)) {
if (parent && !has_object_file(>oid)) {
printf("broken link from %7s %s\n",
@@ -208,6 +217,8 @@ static void check_reachable_object(struct object *obj)
 * do a full fsck
 */
if (!(obj->flags & HAS_OBJ)) {
+   if (is_promisor_object(>oid))
+   return;
if (has_sha1_pack(obj->oid.hash))
return; /* it is in pack - forget about it */
printf("missing %s %s\n", printable_type(obj),
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 5a03ead..b1d404e 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -102,4 +102,27 @@ test_expect_success 'missing ref object, but promised, 
passes fsck' '
git -C repo fsck
 '
 
+test_expect_success 'missing object, but promised, passes fsck' '
+   rm -rf repo &&
+   test_create_repo repo &&
+   test_commit -C repo 1 &&
+   test_commit -C repo 2 &&
+   test_commit -C repo 3 &&
+   git -C repo tag -a annotated_tag -m "annotated tag" &&
+
+   C=$(git -C repo rev-parse 1) &&
+   T=$(git -C repo rev-parse 2^{tree}) &&
+   B=$(git hash-object repo/3.t) &&
+   AT=$(git -C repo rev-parse annotated_tag) &&
+
+   promise_and_delete "$C" &&
+   promise_and_delete "$T" &&
+   promise_and_delete "$B" &&
+   promise_and_delete "$AT" &&
+
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialcloneremote "arbitrary string" &&
+   git -C repo fsck
+'
+
 test_done
-- 
2.9.3



[PATCH 0/9] WIP Partial clone part 2: fsck and promisors

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

This is part 2 of a proposed 3 part sequence for partial clone.
Part 2 assumes part 1 [1] is in place.

Part 2 is concerned with fsck, gc, initial support for dynamic
object fetching, and tracking promisor objects.  Jonathan Tan
originally developed this code.  I have moved it on top of [1].

[1] 
https://public-inbox.org/git/20171102124445.fbffd43521cd35f6a71e1...@google.com/T/
[2] https://public-inbox.org/git/cover.1506714999.git.jonathanta...@google.com/


Jeff Hostetler (1):
  extension.partialclone: introduce partial clone extension

Jonathan Tan (8):
  fsck: introduce partialclone extension
  fsck: support refs pointing to promisor objects
  fsck: support referenced promisor objects
  fsck: support promisor objects as CLI argument
  index-pack: refactor writing of .keep files
  introduce fetch-object: fetch one promisor object
  sha1_file: support lazily fetching missing objects
  gc: do not repack promisor packfiles

 Documentation/git-pack-objects.txt |   4 +
 Documentation/gitremote-helpers.txt|   2 +
 Documentation/technical/repository-version.txt |  22 ++
 Makefile   |   2 +
 builtin/cat-file.c |   3 +
 builtin/fetch-pack.c   |  10 +
 builtin/fsck.c |  26 +-
 builtin/gc.c   |   4 +
 builtin/index-pack.c   | 113 
 builtin/pack-objects.c |  14 +
 builtin/prune.c|   7 +
 builtin/repack.c   |  12 +-
 builtin/rev-list.c |  35 ++-
 cache.h|  15 +-
 config.h   |   3 +
 environment.c  |   2 +
 fetch-object.c |  26 ++
 fetch-object.h |   6 +
 fetch-pack.c   |   8 +-
 fetch-pack.h   |   2 +
 list-objects.c |   8 +-
 object.c   |   2 +-
 packfile.c |  78 +-
 packfile.h |  13 +
 partial-clone-utils.c  |  78 ++
 partial-clone-utils.h  |  34 +++
 remote-curl.c  |  17 +-
 revision.c |  32 ++-
 revision.h |   5 +-
 setup.c|  12 +
 sha1_file.c|  39 ++-
 t/t0410-partial-clone.sh   | 343 +
 transport.c|   8 +
 transport.h|   8 +
 34 files changed, 917 insertions(+), 76 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h
 create mode 100644 partial-clone-utils.c
 create mode 100644 partial-clone-utils.h
 create mode 100755 t/t0410-partial-clone.sh

-- 
2.9.3



[PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.

Update traverse_commit_list() to be a wrapper for the above
with a null filter to minimize the number of callers that
needed to be changed.

Object filtering will be used in a future commit by rev-list
and pack-objects for partial clone and fetch to omit unwanted
objects from the result.

traverse_bitmap_commit_list() does not work with filtering.

If a packfile bitmap is present, it will not be used.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile  |   2 +
 list-objects-filter-options.c | 119 
 list-objects-filter-options.h |  55 ++
 list-objects-filter.c | 408 ++
 list-objects-filter.h |  84 +
 list-objects.c|  95 --
 list-objects.h|   2 +-
 7 files changed, 748 insertions(+), 17 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h

diff --git a/Makefile b/Makefile
index cd75985..ca378a4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,8 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 000..31255e7
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,119 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "argv-array.h"
+#include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like:
+ *   --filter=
+ * and in the pack protocol as:
+ *   "filter" SP 
+ *
+ *  ::= blob:none
+ *   blob:limit=[kmg]
+ *   sparse:oid=
+ *   sparse:path=
+ */
+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)
+{
+   struct object_context oc;
+   struct object_id sparse_oid;
+   const char *v0;
+   const char *v1;
+
+   if (filter_options->choice)
+   die(_("multiple object filter types cannot be combined"));
+
+   /*
+* TODO consider rejecting 'arg' if it contains any
+* TODO injection characters (since we might send this
+* TODO to a sub-command or to the server and we don't
+* TODO want to deal with legacy quoting/escaping for
+* TODO a new feature).
+*/
+
+   filter_options->raw_value = strdup(arg);
+
+   if (skip_prefix(arg, "blob:", ) || skip_prefix(arg, "blobs:", )) {
+   if (!strcmp(v0, "none")) {
+   filter_options->choice = LOFC_BLOB_NONE;
+   return 0;
+   }
+
+   if (skip_prefix(v0, "limit=", ) &&
+   git_parse_ulong(v1, _options->blob_limit_value)) {
+   filter_options->choice = LOFC_BLOB_LIMIT;
+   return 0;
+   }
+   }
+   else if (skip_prefix(arg, "sparse:", )) {
+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/
+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+  

[PATCH v2 5/6] rev-list: add list-objects filtering support

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach rev-list to use the filtering provided by the
traverse_commit_list_filtered() interface to omit
unwanted objects from the result.  This feature is
intended to help with partial clone.

Object filtering is only allowed when one of the "--objects*"
options are used.

When the "--filter-print-omitted" option is used, the omitted
objects are printed at the end.  These are marked with a "~".
This option can be combined with "--quiet" to get a list of
just the omitted objects.

Normally, rev-list will stop with an error when there are
missing objects.

When the "--filter-print-missing" option is used, rev-list
will print a list of any missing objects that should have
been included in the output (rather than stopping).
These are marked with a "?".

When the "--filter-ignore-missing" option is used, rev-list
will silently ignore any missing objects and continue without
error.

Add t6112 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-rev-list.txt  |   6 +-
 Documentation/rev-list-options.txt  |  34 ++
 builtin/rev-list.c  |  75 +++-
 t/t6112-rev-list-filters-objects.sh | 225 
 4 files changed, 337 insertions(+), 3 deletions(-)
 create mode 100755 t/t6112-rev-list-filters-objects.sh

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index ef22f17..b8a3a5b 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -47,7 +47,11 @@ SYNOPSIS
 [ --fixed-strings | -F ]
 [ --date=]
 [ [ --objects | --objects-edge | --objects-edge-aggressive ]
-  [ --unpacked ] ]
+  [ --unpacked ]
+  [ --filter= ] ]
+[ --filter-print-missing ]
+[ --filter-print-omitted ]
+[ --filter-ignore-missing ]
 [ --pretty | --header ]
 [ --bisect ]
 [ --bisect-vars ]
diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index 13501e1..9233134 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -706,6 +706,40 @@ ifdef::git-rev-list[]
 --unpacked::
Only useful with `--objects`; print the object IDs that are not
in packs.
+
+--filter=::
+   Only useful with one of the `--objects*`; omits objects (usually
+   blobs) from the list of printed objects.  The ''
+   may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=[kmg]' omits blobs larger than n bytes
+or units.  The value may be zero.  Special files matching '.git*' are
+alwayse included, regardless of size.
++
+The form '--filter=sparse:oid=' uses a sparse-checkout
+specification contained in the object (or the object that the expression
+evaluates to) to omit blobs not required by the corresponding sparse
+checkout.
++
+The form '--filter=sparse:path=' similarly uses a sparse-checkout
+specification contained in .
+
+--filter-print-missing::
+   Prints a list of the missing objects for the requested traversal.
+   Object IDs are prefixed with a ``?'' character.  The object type
+   is printed after the ID.  This may be used with or without any of
+   the above filtering options.
+
+--filter-ignore-missing::
+   Ignores missing objects encountered during the requested traversal.
+   This may be used with or without any of the above filtering options.
+
+--filter-print-omitted::
+   Only useful with one of the above `--filter*`; prints a list
+   of the omitted objects.  Object IDs are prefixed with a ``~''
+   character.
 endif::git-rev-list[]
 
 --no-walk[=(sorted|unsorted)]::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4..cc9fa40 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -4,6 +4,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "builtin.h"
@@ -12,6 +14,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "oidset.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -54,6 +57,15 @@ static const char rev_list_usage[] =
 
 static struct progress *progress;
 static unsigned progress_counter;
+static struct list_objects_filter_options filter_options;
+static struct oidset missing_objects;
+static struct oidset omitted_objects;
+static int arg_print_missing;
+static int arg_print_omitted;
+static int arg_ignore_missing;
+
+#define DEFAULT_OIDSET_SIZE (16*1024)
+
 
 static void finish_commit(

[PATCH v2 3/6] oidset: add iterator methods to oidset

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual iterator methods to oidset.
Add oidset_remove().

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidset.c | 10 ++
 oidset.h | 36 
 2 files changed, 46 insertions(+)

diff --git a/oidset.c b/oidset.c
index f1f874a..454c54f 100644
--- a/oidset.c
+++ b/oidset.c
@@ -24,6 +24,16 @@ int oidset_insert(struct oidset *set, const struct object_id 
*oid)
return 0;
 }
 
+int oidset_remove(struct oidset *set, const struct object_id *oid)
+{
+   struct oidmap_entry *entry;
+
+   entry = oidmap_remove(>map, oid);
+   free(entry);
+
+   return (entry != NULL);
+}
+
 void oidset_clear(struct oidset *set)
 {
oidmap_free(>map, 1);
diff --git a/oidset.h b/oidset.h
index f4c9e0f..783abce 100644
--- a/oidset.h
+++ b/oidset.h
@@ -24,6 +24,12 @@ struct oidset {
 
 #define OIDSET_INIT { OIDMAP_INIT }
 
+
+static inline void oidset_init(struct oidset *set, size_t initial_size)
+{
+   return oidmap_init(>map, initial_size);
+}
+
 /**
  * Returns true iff `set` contains `oid`.
  */
@@ -39,9 +45,39 @@ int oidset_contains(const struct oidset *set, const struct 
object_id *oid);
 int oidset_insert(struct oidset *set, const struct object_id *oid);
 
 /**
+ * Remove the oid from the set.
+ *
+ * Returns 1 if the oid was present in the set, 0 otherwise.
+ */
+int oidset_remove(struct oidset *set, const struct object_id *oid);
+
+/**
  * Remove all entries from the oidset, freeing any resources associated with
  * it.
  */
 void oidset_clear(struct oidset *set);
 
+struct oidset_iter {
+   struct oidmap_iter m_iter;
+};
+
+static inline void oidset_iter_init(struct oidset *set,
+   struct oidset_iter *iter)
+{
+   oidmap_iter_init(>map, >m_iter);
+}
+
+static inline struct object_id *oidset_iter_next(struct oidset_iter *iter)
+{
+   struct oidmap_entry *e = oidmap_iter_next(>m_iter);
+   return e ? >oid : NULL;
+}
+
+static inline struct object_id *oidset_iter_first(struct oidset *set,
+ struct oidset_iter *iter)
+{
+   oidset_iter_init(set, iter);
+   return oidset_iter_next(iter);
+}
+
 #endif /* OIDSET_H */
-- 
2.9.3



[PATCH v2 6/6] pack-objects: add list-objects filtering

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach pack-objects to use the filtering provided by the
traverse_commit_list_filtered() interface to omit unwanted
objects from the resulting packfile.

This feature is intended for partial clone/fetch.

Filtering requires the use of the "--stdout" option.

Add t5317 test.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt |  12 +-
 builtin/pack-objects.c |  28 ++-
 t/t5317-pack-objects-filter-objects.sh | 369 +
 3 files changed, 407 insertions(+), 2 deletions(-)
 create mode 100755 t/t5317-pack-objects-filter-objects.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 473a161..6786351 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--filter=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,15 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--filter=::
+   Requires `--stdout`.  Omits certain objects (usually blobs) from
+   the resulting packfile.  See linkgit:git-rev-list[1] for valid
+   `` forms.
+
+--filter-ignore-missing:
+   Ignore missing objects without error.  This may be used with
+   or without and of the above filtering.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6e77dfd..e16722f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -15,6 +15,8 @@
 #include "diff.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
 #include "pack-objects.h"
 #include "progress.h"
 #include "refs.h"
@@ -79,6 +81,9 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static struct list_objects_filter_options filter_options;
+static int arg_ignore_missing;
+
 /*
  * stats
  */
@@ -2547,6 +2552,15 @@ static void show_commit(struct commit *commit, void 
*data)
 
 static void show_object(struct object *obj, const char *name, void *data)
 {
+   /*
+* Quietly ignore missing objects when they are expected.  This
+* avoids staging them and getting an odd error later.  If we are
+* not expecting them, stage it and let the normal error handling
+* deal with it.
+*/
+   if (arg_ignore_missing && !has_object_file(>oid))
+   return;
+
add_preferred_base_object(name);
add_object_entry(obj->oid.hash, obj->type, name, 0);
obj->flags |= OBJECT_ADDED;
@@ -2816,7 +2830,10 @@ static void get_object_list(int ac, const char **av)
if (prepare_revision_walk())
die("revision walk setup failed");
mark_edges_uninteresting(, show_edge);
-   traverse_commit_list(, show_commit, show_object, NULL);
+
+   traverse_commit_list_filtered(_options, ,
+ show_commit, show_object, NULL,
+ NULL);
 
if (unpack_unreachable_expiration) {
revs.ignore_missing_links = 1;
@@ -2952,6 +2969,9 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 N_("use a bitmap index if available to speed up 
counting objects")),
OPT_BOOL(0, "write-bitmap-index", _bitmap_index,
 N_("write a bitmap index together with the pack 
index")),
+   OPT_PARSE_LIST_OBJECTS_FILTER(_options),
+   OPT_BOOL(0, "filter-ignore-missing", _ignore_missing,
+N_("ignore and omit missing objects from packfile")),
OPT_END(),
};
 
@@ -3028,6 +3048,12 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (!rev_list_all || !rev_list_reflog || !rev_list_index)
unpack_unreachable_expiration = 0;
 
+   if (filter_options.choice) {
+   if (!pack_to_stdout)
+   die("cannot use filtering with an indexable pack.");
+   use_bitmap_index = 0;
+   }
+
/*
 * "soft" reasons not to use bitmaps - for on-disk repack by default we 
want
 *
diff --git a/t/t5317

[PATCH v2 1/6] dir: allow exclusions from blob in addition to file

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID without assuming a local
worktree or index exists.

Refactor read_skip_worktree_file_from_index() and add
do_read_blob() to eliminate duplication of preliminary
processing of blob contents.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 dir.c | 132 ++
 dir.h |   3 ++
 2 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..1962374 100644
--- a/dir.c
+++ b/dir.c
@@ -220,6 +220,57 @@ int within_depth(const char *name, int namelen,
return 1;
 }
 
+/*
+ * Read the contents of the blob with the given OID into a buffer.
+ * Append a trailing LF to the end if the last line doesn't have one.
+ *
+ * Returns:
+ *-1 when the OID is invalid or unknown or does not refer to a blob.
+ * 0 when the blob is empty.
+ * 1 along with { data, size } of the (possibly augmented) buffer
+ *   when successful.
+ *
+ * Optionally updates the given sha1_stat with the given OID (when valid).
+ */
+static int do_read_blob(const struct object_id *oid,
+   struct sha1_stat *sha1_stat,
+   size_t *size_out,
+   char **data_out)
+{
+   enum object_type type;
+   unsigned long sz;
+   char *data;
+
+   *size_out = 0;
+   *data_out = NULL;
+
+   data = read_sha1_file(oid->hash, , );
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return -1;
+   }
+
+   if (sha1_stat) {
+   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
+   hashcpy(sha1_stat->sha1, oid->hash);
+   }
+
+   if (sz == 0) {
+   free(data);
+   return 0;
+   }
+
+   if (data[sz - 1] != '\n') {
+   data = xrealloc(data, st_add(sz, 1));
+   data[sz++] = '\n';
+   }
+
+   *size_out = xsize_t(sz);
+   *data_out = data;
+
+   return 1;
+}
+
 #define DO_MATCH_EXCLUDE   (1<<0)
 #define DO_MATCH_DIRECTORY (1<<1)
 #define DO_MATCH_SUBMODULE (1<<2)
@@ -600,32 +651,22 @@ void add_exclude(const char *string, const char *base,
x->el = el;
 }
 
-static void *read_skip_worktree_file_from_index(const struct index_state 
*istate,
-   const char *path, size_t *size,
-   struct sha1_stat *sha1_stat)
+static int read_skip_worktree_file_from_index(const struct index_state *istate,
+ const char *path,
+ size_t *size_out,
+ char **data_out,
+ struct sha1_stat *sha1_stat)
 {
int pos, len;
-   unsigned long sz;
-   enum object_type type;
-   void *data;
 
len = strlen(path);
pos = index_name_pos(istate, path, len);
if (pos < 0)
-   return NULL;
+   return -1;
if (!ce_skip_worktree(istate->cache[pos]))
-   return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, , );
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   *size = xsize_t(sz);
-   if (sha1_stat) {
-   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
-   hashcpy(sha1_stat->sha1, istate->cache[pos]->oid.hash);
-   }
-   return data;
+   return -1;
+
+   return do_read_blob(>cache[pos]->oid, sha1_stat, size_out, 
data_out);
 }
 
 /*
@@ -739,6 +780,10 @@ static void invalidate_directory(struct untracked_cache 
*uc,
dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +799,10 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
 {
struct stat st;
-   int fd, i, lineno = 1;
+   int r;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
 
fd = open(fname, O_RDONLY);
if (fd < 0 || fstat(fd, ) < 0) {
@@ -764,17 +810,13 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
warn_on_fopen_errors(fname);

[PATCH v2 0/6] Partial clone part 1: object filtering

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Here is V2 of the list-object filtering. It replaces [1]
and reflect a refactoring and simplification of the original.

After much discussion on the "list-object-filter-map" I've replaced
it with a regular oidset -- the only need for the map was to store
the first observed pathname for each blob, but that itself was of
questionable value.

I've extended oidmap and oidset to have iterators.  These 2 commits
could be pulled out and applied on their own, but for now I need
them here.

There were also several comments on the layout of the filtering
API and the layout of the filter source code.  I've restructured
the filtering routines to put them in the same source file, and
made them all static.  These are now hidden behind a "factory-like"
function with a vtable.  This greatly simplifies the code in
traverse_commit_list_filtered().

I've added "--filter-ignore-missing" parameter to rev-list and
pack-objects to ignore missing objects rather than error out.
This allows this patch series to better stand on its own eliminates
the need in part 1 for "patch 9" from V1.

This is a brute force ignore all missing objects.  Later, in part
2 or part 3 when --exclude-promisor-objects is introduced, we will
be able to ignore EXPECTED missing objects.

Finally, patch 1 in this series is the same [2] which is currently
cooking in next.

[1] https://public-inbox.org/git/20171024185332.57261-1-...@jeffhostetler.com/

[2] * jh/dir-add-exclude-from-blob (2017-10-27) 1 commit
- dir: allow exclusions from blob in addition to file


Jeff Hostetler (6):
  dir: allow exclusions from blob in addition to file
  oidmap: add oidmap iterator methods
  oidset: add iterator methods to oidset
  list-objects: filter objects in traverse_commit_list
  rev-list: add list-objects filtering support
  pack-objects: add list-objects filtering

 Documentation/git-pack-objects.txt |  12 +-
 Documentation/git-rev-list.txt |   6 +-
 Documentation/rev-list-options.txt |  34 +++
 Makefile   |   2 +
 builtin/pack-objects.c |  28 ++-
 builtin/rev-list.c |  75 +-
 dir.c  | 132 ---
 dir.h  |   3 +
 list-objects-filter-options.c  | 119 ++
 list-objects-filter-options.h  |  55 +
 list-objects-filter.c  | 408 +
 list-objects-filter.h  |  84 +++
 list-objects.c |  95 ++--
 list-objects.h |   2 +-
 oidmap.h   |  22 ++
 oidset.c   |  10 +
 oidset.h   |  36 +++
 t/t5317-pack-objects-filter-objects.sh | 369 +
 t/t6112-rev-list-filters-objects.sh| 225 ++
 19 files changed, 1664 insertions(+), 53 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h
 create mode 100755 t/t5317-pack-objects-filter-objects.sh
 create mode 100755 t/t6112-rev-list-filters-objects.sh

-- 
2.9.3



[PATCH v2 2/6] oidmap: add oidmap iterator methods

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Add the usual map iterator functions to oidmap.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 oidmap.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/oidmap.h b/oidmap.h
index 18f54cd..d3cd2bb 100644
--- a/oidmap.h
+++ b/oidmap.h
@@ -65,4 +65,26 @@ extern void *oidmap_put(struct oidmap *map, void *entry);
  */
 extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
 
+
+struct oidmap_iter {
+   struct hashmap_iter h_iter;
+};
+
+static inline void oidmap_iter_init(struct oidmap *map, struct oidmap_iter 
*iter)
+{
+   hashmap_iter_init(>map, >h_iter);
+}
+
+static inline void *oidmap_iter_next(struct oidmap_iter *iter)
+{
+   return hashmap_iter_next(>h_iter);
+}
+
+static inline void *oidmap_iter_first(struct oidmap *map,
+ struct oidmap_iter *iter)
+{
+   oidmap_iter_init(map, iter);
+   return oidmap_iter_next(iter);
+}
+
 #endif
-- 
2.9.3



RE: What's cooking in git.git (Oct 2017, #07; Mon, 30)

2017-10-31 Thread Jeff Hostetler


From: Junio C Hamano [mailto:gits...@pobox.com] 
Subject: Re: What's cooking in git.git (Oct 2017, #07; Mon, 30)

> Jeff Hostetler <g...@jeffhostetler.com> writes:
> 
>> I've been assuming that the jt/partial-clone-lazy-fetch is a 
>> placeholder for our next combined patch series.
>
> Yes, that, together with the expectation that I will hear from both you and 
> JTan 
> once the result of combined effort becomes ready to replace this placeholder, 
> matches my assumption.
> 
> Is that happening now?

Yes, I'm merging our them now and hope to have a version to
send to Jonathan and/or the list sometime this week.

Jeff



Re: What's cooking in git.git (Oct 2017, #07; Mon, 30)

2017-10-30 Thread Jeff Hostetler



On 10/30/2017 1:31 PM, Johannes Schindelin wrote:

Hi Junio,

On Mon, 30 Oct 2017, Junio C Hamano wrote:


* jt/partial-clone-lazy-fetch (2017-10-02) 18 commits
  - fetch-pack: restore save_commit_buffer after use
  - unpack-trees: batch fetching of missing blobs
  - clone: configure blobmaxbytes in created repos
  - clone: support excluding large blobs
  - fetch: support excluding large blobs
  - fetch: refactor calculation of remote list
  - fetch-pack: support excluding large blobs
  - pack-objects: support --blob-max-bytes
  - pack-objects: rename want_.* to ignore_.*
  - gc: do not repack promisor packfiles
  - rev-list: support termination at promisor objects
  - sha1_file: support lazily fetching missing objects
  - introduce fetch-object: fetch one promisor object
  - index-pack: refactor writing of .keep files
  - fsck: support promisor objects as CLI argument
  - fsck: support referenced promisor objects
  - fsck: support refs pointing to promisor objects
  - fsck: introduce partialclone extension

  A journey for "git clone" and "git fetch" to become "lazier" by
  depending more on its remote repository---this is the beginning of
  it.

  Expecting a reroll.
  cf. 


It was my understanding that Jeff's heavy-lifting produced a shorter,
initial patch series with parts of this, that was already reviewed
internally by Jonathan.

Am I mistaken?

Ciao,
Dscho



Right.  I posted a "part 1" of this last week and am currently
rerolling that.  I should also have a followup "part 2" patch
series shortly.

https://public-inbox.org/git/20171024185332.57261-1-...@jeffhostetler.com/

I've been assuming that the jt/partial-clone-lazy-fetch is a
placeholder for our next combined patch series.

Jeff


Re: [PATCH] dir: allow exclusions from blob in addition to file

2017-10-27 Thread Jeff Hostetler



On 10/26/2017 9:20 PM, Junio C Hamano wrote:

Jeff Hostetler <g...@jeffhostetler.com> writes:


From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID without assuming a local
worktree or index exists.

Refactor read_skip_worktree_file_from_index() and add
do_read_blob() to eliminate duplication of preliminary
processing of blob contents.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---


Yeah, with a separate do_read_blob() helper, this one looks a lot
easier to follow, at least to me---as the author, you might find the
earlier one just as easy, I suspect, though ;-)

Thanks.  Will queue.



Yeah, I think the net result is better and easier to follow.
Thanks,
Jeff


Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file

2017-10-26 Thread Jeff Hostetler



On 10/25/2017 11:47 PM, Junio C Hamano wrote:

Jeff Hostetler <g...@jeffhostetler.com> writes:


The existing code handles use cases where you want to read the
exclusion list from a pathname in the worktree -- or from blob
named in the index when the pathname is not populated (presumably
because of the skip-worktree bit).

I was wanting to add a more general case (and perhaps my commit
message should be improved).  I want to be able to read it from
a blob not necessarily associated with the current commit or
not necessarily available on the local client, but yet known to
exist.


Oh, I understand the above two paragraphs perfectly well, and I
agree with you that such a helper to read from an arbitrary blob is
a worthy thing to have.  I was merely commenting on the fact that
such a helper that is meant to be able to handle more general cases
is not used to help the more specific case that we already have,
which was a bit curious.

I guess the reason why it is not done is (besides expediency)
because the model the new helper operates in would not fit well with
the existing logic flow, where everything is loaded into core
(either from the filesystem or from a blob) and then a common code
parses and registers; the helper wants to do the reading (only) from
the blob, the parsing and the registration all by itself, so there
is not much that can be shared even if the existing code wanted to
reuse what the helper offers.

The new helper mimicks the read_skip_worktree_file_from_index()
codepath to massage the data it reads from the blob to buf[] but not
really (e.g. even though it copies and pastes a lot, it forgets to
call skip_utf8_bom(), for example).  We may still want to see if we
can share more so that we do not have to worry about these tiny
differences between codepaths.


I'm going to extract this commit, refactor it to try to share
more code with the existing read_skip_worktree_file_from_index()
and submit it as a separate patch series so that we can discuss
it in isolation without the rest of the partial-clone code getting
in the way.

The call to skip_utf8_bom() was captured in the new
add_excludes_from_buffer() routine that both add_excludes()
and my new add_excludes_from_blob_to_list() call.




With my "add_excludes_from_blob_to_list()", we can request a
blob-ish expression, such as "master:enlistments/foo".  In my
later commits associated with clone and fetch, we can use this
mechanism to let the client ask the server to filter using the
blob associated with this blob-ish.  If the client has the blob
(such as during a later fetch) and can resolve it, then it can
and send the server the OID, but it can also send the blob-ish
to the server and let it resolve it.


Security-minded people may want to keep an eye or two open for these
later patches---extended SHA-1 expressions is a new attack surface
we would want to carefully polish and protect.



[PATCH] dir: allow exclusions from blob in addition to file

2017-10-26 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

I pulled commit 01/13 from Tuesday's partial clone part 1 patch series [1]
and refactored the changes in dir.c to try to address Junio's comments in [2]
WRT sharing more code with the existing read_skip_worktree_file_from_index().

This patch can be discussed independently of the partial clone series.

[1] https://public-inbox.org/git/20171024185332.57261-2-...@jeffhostetler.com/
[2] https://public-inbox.org/git/xmqqpo9afu3s@gitster.mtv.corp.google.com/

Jeff Hostetler (1):
  dir: allow exclusions from blob in addition to file

 dir.c | 132 ++
 dir.h |   3 ++
 2 files changed, 104 insertions(+), 31 deletions(-)

-- 
2.9.3



[PATCH] dir: allow exclusions from blob in addition to file

2017-10-26 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID without assuming a local
worktree or index exists.

Refactor read_skip_worktree_file_from_index() and add
do_read_blob() to eliminate duplication of preliminary
processing of blob contents.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 dir.c | 132 ++
 dir.h |   3 ++
 2 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..1962374 100644
--- a/dir.c
+++ b/dir.c
@@ -220,6 +220,57 @@ int within_depth(const char *name, int namelen,
return 1;
 }
 
+/*
+ * Read the contents of the blob with the given OID into a buffer.
+ * Append a trailing LF to the end if the last line doesn't have one.
+ *
+ * Returns:
+ *-1 when the OID is invalid or unknown or does not refer to a blob.
+ * 0 when the blob is empty.
+ * 1 along with { data, size } of the (possibly augmented) buffer
+ *   when successful.
+ *
+ * Optionally updates the given sha1_stat with the given OID (when valid).
+ */
+static int do_read_blob(const struct object_id *oid,
+   struct sha1_stat *sha1_stat,
+   size_t *size_out,
+   char **data_out)
+{
+   enum object_type type;
+   unsigned long sz;
+   char *data;
+
+   *size_out = 0;
+   *data_out = NULL;
+
+   data = read_sha1_file(oid->hash, , );
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return -1;
+   }
+
+   if (sha1_stat) {
+   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
+   hashcpy(sha1_stat->sha1, oid->hash);
+   }
+
+   if (sz == 0) {
+   free(data);
+   return 0;
+   }
+
+   if (data[sz - 1] != '\n') {
+   data = xrealloc(data, st_add(sz, 1));
+   data[sz++] = '\n';
+   }
+
+   *size_out = xsize_t(sz);
+   *data_out = data;
+
+   return 1;
+}
+
 #define DO_MATCH_EXCLUDE   (1<<0)
 #define DO_MATCH_DIRECTORY (1<<1)
 #define DO_MATCH_SUBMODULE (1<<2)
@@ -600,32 +651,22 @@ void add_exclude(const char *string, const char *base,
x->el = el;
 }
 
-static void *read_skip_worktree_file_from_index(const struct index_state 
*istate,
-   const char *path, size_t *size,
-   struct sha1_stat *sha1_stat)
+static int read_skip_worktree_file_from_index(const struct index_state *istate,
+ const char *path,
+ size_t *size_out,
+ char **data_out,
+ struct sha1_stat *sha1_stat)
 {
int pos, len;
-   unsigned long sz;
-   enum object_type type;
-   void *data;
 
len = strlen(path);
pos = index_name_pos(istate, path, len);
if (pos < 0)
-   return NULL;
+   return -1;
if (!ce_skip_worktree(istate->cache[pos]))
-   return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, , );
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   *size = xsize_t(sz);
-   if (sha1_stat) {
-   memset(_stat->stat, 0, sizeof(sha1_stat->stat));
-   hashcpy(sha1_stat->sha1, istate->cache[pos]->oid.hash);
-   }
-   return data;
+   return -1;
+
+   return do_read_blob(>cache[pos]->oid, sha1_stat, size_out, 
data_out);
 }
 
 /*
@@ -739,6 +780,10 @@ static void invalidate_directory(struct untracked_cache 
*uc,
dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +799,10 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
 {
struct stat st;
-   int fd, i, lineno = 1;
+   int r;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
 
fd = open(fname, O_RDONLY);
if (fd < 0 || fstat(fd, ) < 0) {
@@ -764,17 +810,13 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
warn_on_fopen_errors(fname);

Re: [PATCH 10/13] rev-list: add list-objects filtering support

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 12:41 AM, Jonathan Tan wrote:

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <g...@jeffhostetler.com> wrote:

  static void finish_object(struct object *obj, const char *name, void *cb_data)
  {
 struct rev_list_info *info = cb_data;
-   if (obj->type == OBJ_BLOB && !has_object_file(>oid))
+   if (obj->type == OBJ_BLOB && !has_object_file(>oid)) {
+   if (arg_print_missing) {
+   list_objects_filter_map_insert(
+   _objects, >oid, name, obj->type);
+   return;
+   }
+
+   /*
+* Relax consistency checks when we expect missing
+* objects because of partial-clone or a previous
+* partial-fetch.
+*
+* Note that this is independent of any filtering that
+* we are doing in this run.
+*/
+   if (is_partial_clone_registered())
+   return;
+
 die("missing blob object '%s'", oid_to_hex(>oid));


I'm fine with arg_print_missing suppressing lazy fetching (when I
rebase my patches on this, I'll have to ensure that fetch_if_missing
is set to 0 if arg_print_missing is true), but I think that the
behavior when arg_print_missing is false should be the opposite - we
should let has_object_file() perform the lazy fetching, and die if it
returns false (that is, if the fetching failed).


Right. This is a point where our different approaches need
to come together.  My "is_partial_clone_registered" is essentially
a placeholder for your lazy fetching.  so we can delete this call
when your changes are in.  Basically, you set:
fetch_if_missing = !arg_print_missing
at the top.




+   }
 if (info->revs->verify_objects && !obj->parsed && obj->type != 
OBJ_COMMIT)
 parse_object(>oid);
  }


Re: [PATCH 08/13] list-objects: add traverse_commit_list_filtered method

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 12:24 AM, Jonathan Tan wrote:

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <g...@jeffhostetler.com> wrote:

+void traverse_commit_list_filtered(
+   struct list_objects_filter_options *filter_options,
+   struct rev_info *revs,
+   show_commit_fn show_commit,
+   show_object_fn show_object,
+   list_objects_filter_map_foreach_cb print_omitted_object,
+   void *show_data);


So the function call chain, if we wanted a filtered traversal, is:
traverse_commit_list_filtered -> traverse_commit_list__sparse_path
(and friends, and each algorithm is in its own file) ->
traverse_commit_list_worker

This makes the implementation of each algorithm more easily understood
(since they are all in their own files), but also increases the number
of global functions and code files. I personally would combine the
traverse_commit_list__* functions into one file
(list-objects-filtered.c), make them static, and also put
traverse_commit_list_filtered in there, but I understand that other
people in the Git project may differ on this.



I'll do a round of refactoring to include your suggestion of
a default null filter.  Then with that see what collapsing this
looks like.

Thanks,
Jeff


Re: [PATCH 07/13] list-objects-filter-options: common argument parsing

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 12:14 AM, Jonathan Tan wrote:

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <g...@jeffhostetler.com> wrote:

+ *  ::= blob:none
+ *   blob:limit:[kmg]
+ *   sparse:oid:
+ *   sparse:path:


I notice in the code below that there are some usages of "=" instead
of ":" - could you clarify which one it is? (Ideally this would point
to one point of documentation which serves as both user and technical
documentation.)


good catch.  thanks!
 

+ */
+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)
+{
+   struct object_context oc;
+   struct object_id sparse_oid;
+   const char *v0;
+   const char *v1;
+
+   if (filter_options->choice)
+   die(_("multiple object filter types cannot be combined"));
+
+   /*
+* TODO consider rejecting 'arg' if it contains any
+* TODO injection characters (since we might send this
+* TODO to a sub-command or to the server and we don't
+* TODO want to deal with legacy quoting/escaping for
+* TODO a new feature).
+*/
+
+   filter_options->raw_value = strdup(arg);
+
+   if (skip_prefix(arg, "blob:", ) || skip_prefix(arg, "blobs:", )) {


I know that some people prefer leniency, but I think it's better to
standardize on one form ("blob" instead of both "blob" and "blobs").


I could go either way on this.  (I kept mistyping it during interactive testing,
so I added both cases...)




+   if (!strcmp(v0, "none")) {
+   filter_options->choice = LOFC_BLOB_NONE;
+   return 0;
+   }
+
+   if (skip_prefix(v0, "limit=", ) &&
+   git_parse_ulong(v1, _options->blob_limit_value)) {
+   filter_options->choice = LOFC_BLOB_LIMIT;
+   return 0;
+   }
+   }
+   else if (skip_prefix(arg, "sparse:", )) {
+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/
+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+   /*
+* We could not turn the  into an
+* OID.  Leave the raw_value as is in case
+* the server can parse it.  (It may refer to
+* a branch, commit, or blob we don't have.)
+*/
+   }
+   return 0;
+   }
+
+   if (skip_prefix(v0, "path=", )) {
+   filter_options->choice = LOFC_SPARSE_PATH;
+   filter_options->sparse_path_value = strdup(v1);
+   return 0;
+   }
+   }
+
+   die(_("invalid filter expression '%s'"), arg);
+   return 0;
+}
+
+int opt_parse_list_objects_filter(const struct option *opt,
+ const char *arg, int unset)
+{
+   struct list_objects_filter_options *filter_options = opt->value;
+
+   assert(arg);
+   assert(!unset);
+
+   return parse_list_objects_filter(filter_options, arg);
+}
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
new file mode 100644
index 000..23bd68e
--- /dev/null
+++ b/list-objects-filter-options.h
@@ -0,0 +1,50 @@
+#ifndef LIST_OBJECTS_FILTER_OPTIONS_H
+#define LIST_OBJECTS_FILTER_OPTIONS_H
+
+#include "parse-options.h"
+
+/*
+ * Common declarations and utilities for filtering objects (such as omitting
+ * large blobs) in list_objects:traverse_commit_list() and git-rev-list.
+ */
+
+enum list_objects_filter_choice {
+   LOF

Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 12:05 AM, Jonathan Tan wrote:

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <g...@jeffhostetler.com> wrote:


+enum list_objects_filter_result {
+   LOFR_ZERO  = 0,
+   LOFR_MARK_SEEN = 1<<0,


Probably worth documenting, something like /* Mark this object so that
it is skipped for the rest of the traversal. */


+   LOFR_SHOW  = 1<<1,


And something like /* Invoke show_object_fn on this object. This
object may be revisited unless LOFR_MARK_SEEN is also set. */


+};
+
+/* See object.h and revision.h */
+#define FILTER_REVISIT (1<<25)


I think this should be declared closer to its use - in the sparse
filter code or in the file that uses it. Wherever it is, also update
the chart in object.h to indicate that we're using this 25th bit.


+
+enum list_objects_filter_type {
+   LOFT_BEGIN_TREE,
+   LOFT_END_TREE,
+   LOFT_BLOB
+};
+
+typedef enum list_objects_filter_result list_objects_filter_result;
+typedef enum list_objects_filter_type list_objects_filter_type;


I don't think we typedef enums in Git code.


+
+typedef list_objects_filter_result (*filter_object_fn)(
+   list_objects_filter_type filter_type,
+   struct object *obj,
+   const char *pathname,
+   const char *filename,
+   void *filter_data);
+
+void traverse_commit_list_worker(
+   struct rev_info *,
+   show_commit_fn, show_object_fn, void *show_data,
+   filter_object_fn filter, void *filter_data);


I think things would be much clearer if a default filter was declared
(matching the behavior as of this patch when filter == NULL), say:
static inline default_filter(args) { switch(filter_type) { case
LOFT_BEGIN_TREE: return LOFR_MARK_SEEN | LOFR_SHOW; case
LOFT_END_TREE: return LOFT_ZERO; ...

And inline traverse_commit_list() instead of putting it in the .c file.

This would reduce or eliminate the need to document
traverse_commit_list_worker, including what happens if filter is NULL,
and explain how a user would make their own filter_object_fn.


+
+#endif /* LIST_OBJECTS_H */
--
2.9.3



I'll give that a try.  Thanks!

Jeff


Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 3:10 AM, Junio C Hamano wrote:

Jeff Hostetler <g...@jeffhostetler.com> writes:


From: Jeff Hostetler <jeffh...@microsoft.com>

Create helper class to extend oidmap to collect a list of
omitted or missing objects during traversal.


The reason why oidmap itself cannot be used is because the code
wants to record not just the object name but something else about
the object.  And attributes that the code may care about we can see
in this patch are the object type and the path it found.


I recently simplified the code in this version to not completely
sub-class oidmap, but to just use it along with a custom
_insert method that takes care of allocating the _entry
data.  I should update the commit message to reflect that.



Is the plan to extend this set of attributes over time as different
"omitter"s are added?  Why was "path" chosen as a member of the
initial set and how it will be useful (also, what path would we
record for tags and commits)?


I envisioned this to let rev-list print the pathname of omitted
objects -- like "rev-list --objects" does for regular blobs.
I would leave the pathname NULL for tags and commits.

The pathname helps with debugging and testing, but also is
used by the sparse filter to avoid some expensive duplicate
is-excluded lookups.

Currently the 3 filters I have defined all use the same extra
data.  I suppose a future filter could want additional fields,
so maybe it would be better to refactor my "map-entry" to be
per-filter specific.



These "future plans" needs revealed upfront, instead of (or in
addition to) "will be used in a later commit".  As it is hard to
judge if "filter map" is an appropriate name for this thing without
knowing _how_ it is envisioned to be used.  "filter map" sounds more
like a map function that is consulted when we decide if we want to
drop the object, but from the looks of the code, it is used more to
record what was done to these objects.


Sorry, I meant a later commit in this patch series.  It is used by
commits 4, 5, 6, and 10 to actually do the filtering and collect a
list of omitted or missing objects.



Is it really a "map" (i.e. whose primary focus is to find out what
an object name is "mapped to" when we get an object name---e.g. we
notice an otherwise connected object is missing, and consult this
"map" to learn what the type/path is because we want to do X)?  Or
is it more like a "set of known-to-be-missing object" (i.e. whose
primary point is to serve as a set of object names and what a name
maps to is primarily for debugging)?  These are easier to answer if
we know how it will be used.


I think of a "set" as a member? or not-member? class.
I think of a "map" as a member? or not-member? class but where each
member also has a value.  Sometimes map lookups just want to know
membership and sometimes the lookup wants the value.

Granted, having the key and value data stuffed into the same entry
(from hashmap's point of view, rather than a key having a pointer
to a value) does kind of blur the line, but I was thinking about
a map here.  (And I was building on oidmap which builds on hashmap,
so it seemed appropriate.)




This will be used in a later commit by the list-object filtering
code.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
new file mode 100644
index 000..7e496b3
--- /dev/null
+++ b/list-objects-filter-map.c
@@ -0,0 +1,63 @@
+#include "cache.h"
+#include "list-objects-filter-map.h"
+
+int list_objects_filter_map_insert(struct oidmap *map,
+  const struct object_id *oid,
+  const char *pathname, enum object_type type)
+{
+   size_t len, size;
+   struct list_objects_filter_map_entry *e;
+
+   if (oidmap_get(map, oid))
+   return 1;


It is OK for the existing entry to record a path that is totally
different from what the caller has.  It is hard to judge without
knowing what pathname the callers are expected to call this function
with, but I am guessing that it is similar to the path shown in the
output from "rev-list --objects"---and if that is the case, it is
correct that the same object may be reached at different paths
depending on what tree the traversal begins at, so pathname recorded
in the map is merely "there is one tree somewhere that has this
object at this path".


Right, the first observed pathname is as good as any.



For that matter, the caller may have a completely different type
from the object we saw earlier; not checking and flagging it as a
possible error makes me feel somewhat uneasy, but there probably is
little you can do at this layer of the code if you noticed such a
discrepancy so it may be OK to punt.


I

Re: [PATCH 00/13] WIP Partial clone part 1: object filtering

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 2:46 AM, Jonathan Tan wrote:

On Tue, Oct 24, 2017 at 10:00 PM, Junio C Hamano <gits...@pobox.com> wrote:

OK, thanks for working well together.  So does this (1) build on
Jonathan's fsck-squelching series, or (2) ignores that and builds
filtering first, potentially leaving the codebase to a broken state
where it can create fsck-unclean repository until Jonathan's series
is rebased on top of this, or (3) something else?  [*1*]


Excluding the partialclone patch (patch 9), I think that the answer is
(2), but I don't think that it leaves the codebase in a broken state.
In particular, none of the code modifies the repo, so it can't create
a fsck-unclean one.


My part 1 series starts with filtering, rev-list, and pack-objects.
So, yes, it add new features that no one will use yet.  But it is useful
by itself.  For example, you can use rev-list to ask for the set of
missing objects that you would need to do a checkout (assuming you had
commits and trees, but no blobs or no large blobs) *before* actually
starting the checkout.

I was then going to lay Jonathan's fsck/gc/dynamic object fetch
on top of that.  I started that here:
https://github.com/jeffhostetler/git/pull/7

Patch 9 just adds the "extensions.partialclone*" fields and is prep
for my rev-list and his fsck changes.  I included it sooner rather than
later so I can test rev-list on a repo with hand-deleted blobs.
But yes, it can be omitted from this series and included with the fsck
changes.




Maybe one could say that this leaves the codebase with features that
no one will ever use in the absence of partial clone, but I don't
think that's true - rev-list with blob-size/sparse-specification
filter seems independently useful, at least (for example, when
desiring to operate on a repo in a sparse way without going through a
workdir), and if we're planning to allow listing of objects, we
probably should allow packing as well (especially since this doesn't
add much code).

The above is relevant only if we can exclude the partialclone patch,
but I think that we can and we should, as I wrote in my reply to Jeff
Hostetler [1].

As for how this patch set (excluding the partialclone patch) interacts
with my fsck series, they are relatively independent, as far as I can
tell. I'll rebase my fsck, gc, and lazy object fetch patches (but not
the fetch and clone parts, which we plan to instead adapt from Jeff
Hostetler's patches, as far as I know) on top of these and resend
those out once discussion on this has settled.


Yes, I want to get Jonathan's fsck/gc/lazy changes built into part 2.
They came over easily and are independent of how/why there are missing
objects.

For part 3, I'd like to take my version of clone, fetch, fetch-pack,
and upload-pack (that talks with the filters from part 1) and incorporate
Jonathan's promisor concepts.  That merge is a little messier, so I'd
like to parts 1 and 2 a chance to get some feedback first.



[1] 
https://public-inbox.org/git/CAGf8dg+8AR=xfsv92odatktnjbnd1+ovzp9rs4y4otz_ezy...@mail.gmail.com/


I also saw a patch marked as "this is from Jonathan's earlier work",
taking the authorship (which to me implies that the changes were
extensive enough), so I am a bit at loss envisioning how this piece
fits in the bigger picture together with the other piece.


The patch you mentioned is the partialclone patch, which I think can
be considered separately from the rest (as I said above).


A question of mailing-list etiquette: in patch 9, I took Jonathan's
ideas for adding the "extensions.partialclone" setting and extended it
with some helper functions.  His change was part of a larger change
with other code (fsck, IIRC) that I wasn't ready for.  What is the
preferred way to give credit for something like this?


Thanks
Jeff




Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file

2017-10-25 Thread Jeff Hostetler



On 10/25/2017 2:43 AM, Junio C Hamano wrote:

Jeff Hostetler <g...@jeffhostetler.com> writes:


+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
  /*
   * Given a file with name "fname", read it (either from disk, or from
   * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +758,9 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
  {
struct stat st;
-   int fd, i, lineno = 1;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
  
  	fd = open(fname, O_RDONLY);

if (fd < 0 || fstat(fd, ) < 0) {


The post-context of this hunk is quite interesting in that there is
a call to read_skip_worktree_file_from_index(); which essentially
pretends as if we read from the filesystem but in fact it grabs the
blob object name registered in the index and reads from it.

The reason why it is interesting is because this patch adds yet
nother "let's instead read from a blob object" function and there is
no sign to make the existing one take advantage of the new function
seen in this patch.



The existing code handles use cases where you want to read the
exclusion list from a pathname in the worktree -- or from blob
named in the index when the pathname is not populated (presumably
because of the skip-worktree bit).

I was wanting to add a more general case (and perhaps my commit
message should be improved).  I want to be able to read it from
a blob not necessarily associated with the current commit or
not necessarily available on the local client, but yet known to
exist.  I'm thinking of the case the client could ask the server
to do a partial clone using a sparse-checkout specification stored
in a well-known location on the server.  The reason for this is
that, in this case, the client is pre-clone and doesn't have a
worktree or index.

With my "add_excludes_from_blob_to_list()", we can request a
blob-ish expression, such as "master:enlistments/foo".  In my
later commits associated with clone and fetch, we can use this
mechanism to let the client ask the server to filter using the
blob associated with this blob-ish.  If the client has the blob
(such as during a later fetch) and can resolve it, then it can
and send the server the OID, but it can also send the blob-ish
to the server and let it resolve it.

Jeff




[PATCH 10/13] rev-list: add list-objects filtering support

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach rev-list to use the filtering provided by the
traverse_commit_list_filtered() interface to omit
unwanted objects from the result.

This feature is only enabled when one of the "--objects*"
options are used.

Furthermore, when the "--filter-print-omitted" option is
used, the omitted objects are printed at the end.  These
are marked with a "~".  This option can be combined with
"--quiet" to get a list of just the omitted objects.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-rev-list.txt |  5 ++-
 Documentation/rev-list-options.txt | 30 ++
 builtin/rev-list.c | 84 +-
 3 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index ef22f17..6d2e60d 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -47,7 +47,10 @@ SYNOPSIS
 [ --fixed-strings | -F ]
 [ --date=]
 [ [ --objects | --objects-edge | --objects-edge-aggressive ]
-  [ --unpacked ] ]
+  [ --unpacked ]
+  [ --filter= ] ]
+[ --filter-print-missing ]
+[ --filter-print-omitted ]
 [ --pretty | --header ]
 [ --bisect ]
 [ --bisect-vars ]
diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index 7d860bf..88f8878 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -706,6 +706,36 @@ ifdef::git-rev-list[]
 --unpacked::
Only useful with `--objects`; print the object IDs that are not
in packs.
+
+--filter=::
+   Only useful with one of the `--objects*`; omits objects (usually
+   blobs) from the list of printed objects.  The ''
+   may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=[kmg]' omits blobs larger than n bytes
+or units.  The value may be zero.  Special files matching '.git*' are
+alwayse included, regardless of size.
++
+The form '--filter=sparse:oid=' uses a sparse-checkout
+specification contained in the object (or the object that the expression
+evaluates to) to omit blobs not required by the corresponding sparse
+checkout.
++
+The form '--filter=sparse:path=' similarly uses a sparse-checkout
+specification contained in .
+
+--filter-print-missing::
+   Prints a list of the missing objects for the requested traversal.
+   Object IDs are prefixed with a ``?'' character.  The object type
+   is printed after the ID.  This may be used with or without any of
+   the above filtering options.
+
+--filter-print-omitted::
+   Only useful with one of the above `--filter*`; prints a list
+   of the omitted objects.  Object IDs are prefixed with a ``~''
+   character.
 endif::git-rev-list[]
 
 --no-walk[=(sorted|unsorted)]::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4..7a0353f 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -12,6 +12,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "partial-clone-utils.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -54,6 +55,11 @@ static const char rev_list_usage[] =
 
 static struct progress *progress;
 static unsigned progress_counter;
+static struct list_objects_filter_options filter_options;
+static struct oidmap missing_objects;
+static int arg_print_missing;
+static int arg_print_omitted;
+#define DEFAULT_MAP_SIZE (16*1024)
 
 static void finish_commit(struct commit *commit, void *data);
 static void show_commit(struct commit *commit, void *data)
@@ -181,8 +187,26 @@ static void finish_commit(struct commit *commit, void 
*data)
 static void finish_object(struct object *obj, const char *name, void *cb_data)
 {
struct rev_list_info *info = cb_data;
-   if (obj->type == OBJ_BLOB && !has_object_file(>oid))
+   if (obj->type == OBJ_BLOB && !has_object_file(>oid)) {
+   if (arg_print_missing) {
+   list_objects_filter_map_insert(
+   _objects, >oid, name, obj->type);
+   return;
+   }
+
+   /*
+* Relax consistency checks when we expect missing
+* objects because of partial-clone or a previous
+* partial-fetch.
+*
+* Note that this is independent of any filtering that
+* we are doing in this run.
+*/
+   if (is_partial_clone_registered())
+   return;
+
die("missing blob object '%s'", oid_to_hex(>oid));
+   }
  

[PATCH 12/13] pack-objects: add list-objects filtering

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Teach pack-objects to use the filtering provided by the
traverse_commit_list_filtered() interface to omit unwanted
objects from the resulting packfile.

This feature is intended for partial clone/fetch.

Filtering requires the use of the "--stdout" option.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/git-pack-objects.txt |  8 +++-
 builtin/pack-objects.c | 18 +-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 473a161..8b4a223 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--filter=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,11 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--filter=::
+   Requires `--stdout`.  Omits certain objects (usually blobs) from
+   the resulting packfile.  See linkgit:git-rev-list[1] for valid
+   `` forms.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6e77dfd..a251850 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -79,6 +79,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static struct list_objects_filter_options filter_options;
+
 /*
  * stats
  */
@@ -2816,7 +2818,12 @@ static void get_object_list(int ac, const char **av)
if (prepare_revision_walk())
die("revision walk setup failed");
mark_edges_uninteresting(, show_edge);
-   traverse_commit_list(, show_commit, show_object, NULL);
+   if (filter_options.choice)
+   traverse_commit_list_filtered(_options, ,
+ show_commit, show_object,
+ NULL, NULL);
+   else
+   traverse_commit_list(, show_commit, show_object, NULL);
 
if (unpack_unreachable_expiration) {
revs.ignore_missing_links = 1;
@@ -2952,6 +2959,9 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 N_("use a bitmap index if available to speed up 
counting objects")),
OPT_BOOL(0, "write-bitmap-index", _bitmap_index,
 N_("write a bitmap index together with the pack 
index")),
+
+   OPT_PARSE_LIST_OBJECTS_FILTER(_options),
+
OPT_END(),
};
 
@@ -3028,6 +3038,12 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (!rev_list_all || !rev_list_reflog || !rev_list_index)
unpack_unreachable_expiration = 0;
 
+   if (filter_options.choice) {
+   if (!pack_to_stdout)
+   die("cannot use filtering with an indexable pack.");
+   use_bitmap_index = 0;
+   }
+
/*
 * "soft" reasons not to use bitmaps - for on-disk repack by default we 
want
 *
-- 
2.9.3



[PATCH 09/13] extension.partialclone: introduce partial clone extension

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Introduce the ability to have missing objects in a repo.  This
functionality is guarded by new repository extension options:
`extensions.partialcloneremote` and
`extensions.partialclonefilter`.

See the update to Documentation/technical/repository-version.txt
in this patch for more information.

This patch is part of a patch originally authored by:
Jonathan Tan <jonathanta...@google.com>

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Documentation/technical/repository-version.txt | 22 ++
 Makefile   |  1 +
 cache.h|  4 ++
 config.h   |  3 +
 environment.c  |  2 +
 partial-clone-utils.c  | 99 ++
 partial-clone-utils.h  | 34 +
 setup.c| 15 
 8 files changed, 180 insertions(+)
 create mode 100644 partial-clone-utils.c
 create mode 100644 partial-clone-utils.h

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad379..9d488db 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,25 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`partialcloneremote`
+
+
+When the config key `extensions.partialcloneremote` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects.  Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.
+
+`partialclonefilter`
+
+
+When the config key `extensions.partialclonefilter` is set, it gives
+the initial filter expression used to create the partial clone.
+This value becomed the default filter expression for subsequent
+fetches (called "partial fetches") from the promisor remote.  This
+value may also be set by the first explicit partial fetch following a
+normal clone.
diff --git a/Makefile b/Makefile
index b9ff0b4..38632fb 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += pack-write.o
 LIB_OBJS += pager.o
 LIB_OBJS += parse-options.o
 LIB_OBJS += parse-options-cb.o
+LIB_OBJS += partial-clone-utils.o
 LIB_OBJS += patch-delta.o
 LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
diff --git a/cache.h b/cache.h
index 6440e2b..4b785c0 100644
--- a/cache.h
+++ b/cache.h
@@ -860,12 +860,16 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_partial_clone_remote;
+extern char *repository_format_partial_clone_filter;
 
 struct repository_format {
int version;
int precious_objects;
int is_bare;
char *work_tree;
+   char *partial_clone_remote; /* value of extensions.partialcloneremote */
+   char *partial_clone_filter; /* value of extensions.partialclonefilter */
struct string_list unknown_extensions;
 };
 
diff --git a/config.h b/config.h
index a49d264..90544ef 100644
--- a/config.h
+++ b/config.h
@@ -34,6 +34,9 @@ struct config_options {
const char *git_dir;
 };
 
+#define KEY_PARTIALCLONEREMOTE "partialcloneremote"
+#define KEY_PARTIALCLONEFILTER "partialclonefilter"
+
 typedef int (*config_fn_t)(const char *, const char *, void *);
 extern int git_default_config(const char *, const char *, void *);
 extern int git_config_from_file(config_fn_t fn, const char *, void *);
diff --git a/environment.c b/environment.c
index 8289c25..2fcf9bb 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,8 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_partial_clone_remote;
+char *repository_format_partial_clone_filter;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/partial-clone-utils.c b/partial-clone-utils.c
new file mode 100644
index 000..8c925ae
--- /dev/null
+++ b/partial-clone-utils.c
@@ -0,0 +1,99 @@
+#include "cache.h"
+#include "config.h"
+#include "partial-clone-utils.h"
+
+int is_partial_clone_registered(void)
+{
+   if (repository_format_partial_clone_remote ||
+   repository_format_partial_clone_filter)
+   return 1;
+
+   return 0;
+}
+
+void partial_clone_utils_register(
+   

[PATCH 05/13] list-objects-filter-blobs-limit: add large blob filtering

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create a filter for traverse_commit_list_worker() to omit blobs
larger than a requested size from the result, but always include
".git*" special files.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile  |   1 +
 list-objects-filter-blobs-limit.c | 146 ++
 list-objects-filter-blobs-limit.h |  18 +
 3 files changed, 165 insertions(+)
 create mode 100644 list-objects-filter-blobs-limit.c
 create mode 100644 list-objects-filter-blobs-limit.h

diff --git a/Makefile b/Makefile
index 7e9d1f4..0fdeabb 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
 LIB_OBJS += ll-merge.o
diff --git a/list-objects-filter-blobs-limit.c 
b/list-objects-filter-blobs-limit.c
new file mode 100644
index 000..f68963d
--- /dev/null
+++ b/list-objects-filter-blobs-limit.c
@@ -0,0 +1,146 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-blobs-limit.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter for list-objects to omit large blobs,
+ * but always include ".git*" special files.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */
+struct filter_blobs_limit_data {
+   struct oidmap *omits;
+   unsigned long max_bytes;
+};
+
+static list_objects_filter_result filter_blobs_limit(
+   list_objects_filter_type filter_type,
+   struct object *obj,
+   const char *pathname,
+   const char *filename,
+   void *filter_data_)
+{
+   struct filter_blobs_limit_data *filter_data = filter_data_;
+   struct list_objects_filter_data_entry *entry;
+   unsigned long object_length;
+   enum object_type t;
+   int is_special_filename;
+
+   switch (filter_type) {
+   default:
+   die("unkown filter_type");
+   return LOFR_ZERO;
+
+   case LOFT_BEGIN_TREE:
+   assert(obj->type == OBJ_TREE);
+   /* always include all tree objects */
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+
+   case LOFT_END_TREE:
+   assert(obj->type == OBJ_TREE);
+   return LOFR_ZERO;
+
+   case LOFT_BLOB:
+   assert(obj->type == OBJ_BLOB);
+   assert((obj->flags & SEEN) == 0);
+
+   is_special_filename = ((strncmp(filename, ".git", 4) == 0) &&
+  filename[4]);
+
+   /*
+* If we are keeping a list of the omitted objects
+* for the caller *AND* we previously "provisionally"
+* omitted this object (because of size) *AND* it now
+* has a special filename, make it not-omitted.
+* Otherwise, continue to provisionally omit it.
+*/
+   if (filter_data->omits &&
+   oidmap_get(filter_data->omits, >oid)) {
+   if (!is_special_filename)
+   return LOFR_ZERO;
+   entry = oidmap_remove(filter_data->omits, >oid);
+   free(entry);
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+   }
+
+   /*
+* If filename matches ".git*", always include it (regardless
+* of size).  (This may include blobs that we do not have
+* locally.)
+*/
+   if (is_special_filename)
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+
+   t = sha1_object_info(obj->oid.hash, _length);
+   if (t != OBJ_BLOB) { /* probably OBJ_NONE */
+   /*
+* We DO NOT have the blob locally, so we cannot
+* apply the size filter criteria.  Be conservative
+* and force show it (and let the caller deal with
+* the ambiguity).  (This matches the behavior above
+* when the special filename matches.)
+*/
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+   }
+
+   if (object_length < filter_data->max_bytes)
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+
+   /*
+* Provisionally omit it.  We've already established
+  

[PATCH 04/13] list-objects-filter-blobs-none: add filter to omit all blobs

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create a simple filter for traverse_commit_list_worker() to omit
all blobs from the result.

This filter will be used in a future commit by rev-list and pack-objects
to create a "commits and trees" result.  This is intended for partial
clone and fetch support.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile |  1 +
 list-objects-filter-blobs-none.c | 83 
 list-objects-filter-blobs-none.h | 18 +
 3 files changed, 102 insertions(+)
 create mode 100644 list-objects-filter-blobs-none.c
 create mode 100644 list-objects-filter-blobs-none.h

diff --git a/Makefile b/Makefile
index e59f12d..7e9d1f4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
diff --git a/list-objects-filter-blobs-none.c b/list-objects-filter-blobs-none.c
new file mode 100644
index 000..1b548b9
--- /dev/null
+++ b/list-objects-filter-blobs-none.c
@@ -0,0 +1,83 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-blobs-none.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter for list-objects to omit ALL blobs from the traversal.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */
+struct filter_blobs_none_data {
+   struct oidmap *omits;
+};
+
+static list_objects_filter_result filter_blobs_none(
+   list_objects_filter_type filter_type,
+   struct object *obj,
+   const char *pathname,
+   const char *filename,
+   void *filter_data_)
+{
+   struct filter_blobs_none_data *filter_data = filter_data_;
+
+   switch (filter_type) {
+   default:
+   die("unkown filter_type");
+   return LOFR_ZERO;
+
+   case LOFT_BEGIN_TREE:
+   assert(obj->type == OBJ_TREE);
+   /* always include all tree objects */
+   return LOFR_MARK_SEEN | LOFR_SHOW;
+
+   case LOFT_END_TREE:
+   assert(obj->type == OBJ_TREE);
+   return LOFR_ZERO;
+
+   case LOFT_BLOB:
+   assert(obj->type == OBJ_BLOB);
+   assert((obj->flags & SEEN) == 0);
+
+   if (filter_data->omits)
+   list_objects_filter_map_insert(
+   filter_data->omits, >oid, pathname,
+   obj->type);
+
+   return LOFR_MARK_SEEN; /* but not LOFR_SHOW (hard omit) */
+   }
+}
+
+void traverse_commit_list__blobs_none(
+   struct rev_info *revs,
+   show_commit_fn show_commit,
+   show_object_fn show_object,
+   list_objects_filter_map_foreach_cb print_omitted_object,
+   void *ctx_data)
+{
+   struct filter_blobs_none_data d;
+
+   memset(, 0, sizeof(d));
+   if (print_omitted_object) {
+   d.omits = xcalloc(1, sizeof(*d.omits));
+   oidmap_init(d.omits, DEFAULT_MAP_SIZE);
+   }
+
+   traverse_commit_list_worker(revs, show_commit, show_object, ctx_data,
+   filter_blobs_none, );
+
+   if (print_omitted_object) {
+   list_objects_filter_map_foreach(d.omits,
+   print_omitted_object,
+   ctx_data);
+   oidmap_free(d.omits, 1);
+   }
+}
diff --git a/list-objects-filter-blobs-none.h b/list-objects-filter-blobs-none.h
new file mode 100644
index 000..363c9de
--- /dev/null
+++ b/list-objects-filter-blobs-none.h
@@ -0,0 +1,18 @@
+#ifndef LIST_OBJECTS_FILTER_BLOBS_NONE_H
+#define LIST_OBJECTS_FILTER_BLOBS_NONE_H
+
+#include "list-objects-filter-map.h"
+
+/*
+ * A filter for list-objects to omit ALL blobs
+ * from the traversal.
+ */
+void traverse_commit_list__blobs_none(
+   struct rev_info *revs,
+   show_commit_fn show_commit,
+   show_object_fn show_object,
+   list_objects_filter_map_foreach_cb print_omitted_object,
+   void *ctx_data);
+
+#endif /* LIST_OBJECTS_FILTER_BLOBS_NONE_H */
+
-- 
2.9.3



[PATCH 13/13] t5317: pack-objects object filtering test

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t5317-pack-objects-filter-objects.sh | 384 +
 1 file changed, 384 insertions(+)
 create mode 100755 t/t5317-pack-objects-filter-objects.sh

diff --git a/t/t5317-pack-objects-filter-objects.sh 
b/t/t5317-pack-objects-filter-objects.sh
new file mode 100755
index 000..ef7a8f6
--- /dev/null
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -0,0 +1,384 @@
+#!/bin/sh
+
+test_description='git pack-objects with object filtering for partial clone'
+
+. ./test-lib.sh
+
+# Test blob:none filter.
+
+test_expect_success 'setup r1' '
+   echo "{print \$1}" >print_1.awk &&
+   echo "{print \$2}" >print_2.awk &&
+
+   git init r1 &&
+   for n in 1 2 3 4 5
+   do
+   echo "This is file: $n" > r1/file.$n
+   git -C r1 add file.$n
+   git -C r1 commit -m "$n"
+   done
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+   git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r1 pack-objects --rev --stdout >all.pack <<-EOF &&
+   HEAD
+   EOF
+   git -C r1 index-pack ../all.pack &&
+   git -C r1 verify-pack -v ../all.pack \
+   | grep blob \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:none packfile has no blobs' '
+   git -C r1 pack-objects --rev --stdout --filter=blob:none >filter.pack 
<<-EOF &&
+   HEAD
+   EOF
+   git -C r1 index-pack ../filter.pack &&
+   git -C r1 verify-pack -v ../filter.pack \
+   | grep blob \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   nr=$(wc -l <observed) &&
+   test 0 -eq $nr
+'
+
+test_expect_success 'verify normal and blob:none packfiles have same 
commits/trees' '
+   git -C r1 verify-pack -v ../all.pack \
+   | grep -E "commit|tree" \
+   | awk -f print_1.awk \
+   | sort >expected &&
+   git -C r1 verify-pack -v ../filter.pack \
+   | grep -E "commit|tree" \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   test_cmp observed expected
+'
+
+# Test blob:limit=[kmg] filter.
+# We boundary test around the size parameter.  The filter is strictly less than
+# the value, so size 500 and 1000 should have the same results, but 1001 should
+# filter more.
+
+test_expect_success 'setup r2' '
+   git init r2 &&
+   for n in 1000 1
+   do
+   printf "%"$n"s" X > r2/large.$n
+   git -C r2 add large.$n
+   git -C r2 commit -m "$n"
+   done
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+   git -C r2 ls-files -s large.1000 large.1 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 pack-objects --rev --stdout >all.pack <<-EOF &&
+   HEAD
+   EOF
+   git -C r2 index-pack ../all.pack &&
+   git -C r2 verify-pack -v ../all.pack \
+   | grep blob \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=500 omits all blobs' '
+   git -C r2 pack-objects --rev --stdout --filter=blob:limit=500 
>filter.pack <<-EOF &&
+   HEAD
+   EOF
+   git -C r2 index-pack ../filter.pack &&
+   git -C r2 verify-pack -v ../filter.pack \
+   | grep blob \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   nr=$(wc -l <observed) &&
+   test 0 -eq $nr
+'
+
+test_expect_success 'verify blob:limit=1000' '
+   git -C r2 pack-objects --rev --stdout --filter=blob:limit=1000 
>filter.pack <<-EOF &&
+   HEAD
+   EOF
+   git -C r2 index-pack ../filter.pack &&
+   git -C r2 verify-pack -v ../filter.pack \
+   | grep blob \
+   | awk -f print_1.awk \
+   | sort >observed &&
+   nr=$(wc -l <observed) &&
+   test 0 -eq $nr
+'
+
+test_expect_success 'verify blob:limit=1001' '
+   git -C r2 ls-files -s large.1000 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 pack-objects --rev --stdout --filter=blob:limit=1001 
>filter.pack <<-EOF &&
+   HEAD
+   EOF
+   git -C r2 index-pa

[PATCH 01/13] dir: allow exclusions from blob in addition to file

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 dir.c | 51 +--
 dir.h |  3 +++
 2 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..d848f2b 100644
--- a/dir.c
+++ b/dir.c
@@ -739,6 +739,10 @@ static void invalidate_directory(struct untracked_cache 
*uc,
dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +758,9 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
struct sha1_stat *sha1_stat)
 {
struct stat st;
-   int fd, i, lineno = 1;
+   int fd;
size_t size = 0;
-   char *buf, *entry;
+   char *buf;
 
fd = open(fname, O_RDONLY);
if (fd < 0 || fstat(fd, ) < 0) {
@@ -813,6 +817,17 @@ static int add_excludes(const char *fname, const char 
*base, int baselen,
}
}
 
+   add_excludes_from_buffer(buf, size, base, baselen, el);
+   return 0;
+}
+
+static int add_excludes_from_buffer(char *buf, size_t size,
+   const char *base, int baselen,
+   struct exclude_list *el)
+{
+   int i, lineno = 1;
+   char *entry;
+
el->filebuf = buf;
 
if (skip_utf8_bom(, size))
@@ -841,6 +856,38 @@ int add_excludes_from_file_to_list(const char *fname, 
const char *base,
return add_excludes(fname, base, baselen, el, istate, NULL);
 }
 
+int add_excludes_from_blob_to_list(
+   struct object_id *oid,
+   const char *base, int baselen,
+   struct exclude_list *el)
+{
+   char *buf;
+   unsigned long size;
+   enum object_type type;
+
+   buf = read_sha1_file(oid->hash, , );
+   if (!buf)
+   return -1;
+
+   if (type != OBJ_BLOB) {
+   free(buf);
+   return -1;
+   }
+
+   if (size == 0) {
+   free(buf);
+   return 0;
+   }
+
+   if (buf[size - 1] != '\n') {
+   buf = xrealloc(buf, st_add(size, 1));
+   buf[size++] = '\n';
+   }
+
+   add_excludes_from_buffer(buf, size, base, baselen, el);
+   return 0;
+}
+
 struct exclude_list *add_exclude_list(struct dir_struct *dir,
  int group_type, const char *src)
 {
diff --git a/dir.h b/dir.h
index e371705..1bcf391 100644
--- a/dir.h
+++ b/dir.h
@@ -256,6 +256,9 @@ extern struct exclude_list *add_exclude_list(struct 
dir_struct *dir,
 extern int add_excludes_from_file_to_list(const char *fname, const char *base, 
int baselen,
  struct exclude_list *el, struct  
index_state *istate);
 extern void add_excludes_from_file(struct dir_struct *, const char *fname);
+extern int add_excludes_from_blob_to_list(struct object_id *oid,
+ const char *base, int baselen,
+ struct exclude_list *el);
 extern void parse_exclude_pattern(const char **string, int *patternlen, 
unsigned *flags, int *nowildcardlen);
 extern void add_exclude(const char *string, const char *base,
int baselen, struct exclude_list *el, int srcpos);
-- 
2.9.3



[PATCH 03/13] list-objects: filter objects in traverse_commit_list

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted (not shown)
during a traversal.

Update traverse_commit_list() to be a wrapper for the above.

Filtering will be used in a future commit by rev-list and
pack-objects for narrow/partial clone/fetch to omit certain
blobs from the output.

traverse_bitmap_commit_list() does not work with filtering.
If a packfile bitmap is present, it will not be used.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 list-objects.c | 66 --
 list-objects.h | 32 +++-
 2 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index b3931fa..3e86008 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
 show_object_fn show,
 struct strbuf *path,
 const char *name,
-void *cb_data)
+void *cb_data,
+filter_object_fn filter,
+void *filter_data)
 {
struct object *obj = >object;
size_t pathlen;
+   list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
if (!revs->blob_objects)
return;
@@ -24,11 +27,15 @@ static void process_blob(struct rev_info *revs,
die("bad blob object");
if (obj->flags & (UNINTERESTING | SEEN))
return;
-   obj->flags |= SEEN;
 
pathlen = path->len;
strbuf_addstr(path, name);
-   show(obj, path->buf, cb_data);
+   if (filter)
+   r = filter(LOFT_BLOB, obj, path->buf, >buf[pathlen], 
filter_data);
+   if (r & LOFR_MARK_SEEN)
+   obj->flags |= SEEN;
+   if (r & LOFR_SHOW)
+   show(obj, path->buf, cb_data);
strbuf_setlen(path, pathlen);
 }
 
@@ -69,7 +76,9 @@ static void process_tree(struct rev_info *revs,
 show_object_fn show,
 struct strbuf *base,
 const char *name,
-void *cb_data)
+void *cb_data,
+filter_object_fn filter,
+void *filter_data)
 {
struct object *obj = >object;
struct tree_desc desc;
@@ -77,6 +86,7 @@ static void process_tree(struct rev_info *revs,
enum interesting match = revs->diffopt.pathspec.nr == 0 ?
all_entries_interesting: entry_not_interesting;
int baselen = base->len;
+   list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
if (!revs->tree_objects)
return;
@@ -90,9 +100,13 @@ static void process_tree(struct rev_info *revs,
die("bad tree object %s", oid_to_hex(>oid));
}
 
-   obj->flags |= SEEN;
strbuf_addstr(base, name);
-   show(obj, base->buf, cb_data);
+   if (filter)
+   r = filter(LOFT_BEGIN_TREE, obj, base->buf, 
>buf[baselen], filter_data);
+   if (r & LOFR_MARK_SEEN)
+   obj->flags |= SEEN;
+   if (r & LOFR_SHOW)
+   show(obj, base->buf, cb_data);
if (base->len)
strbuf_addch(base, '/');
 
@@ -112,7 +126,7 @@ static void process_tree(struct rev_info *revs,
process_tree(revs,
 lookup_tree(entry.oid),
 show, base, entry.path,
-cb_data);
+cb_data, filter, filter_data);
else if (S_ISGITLINK(entry.mode))
process_gitlink(revs, entry.oid->hash,
show, base, entry.path,
@@ -121,8 +135,17 @@ static void process_tree(struct rev_info *revs,
process_blob(revs,
 lookup_blob(entry.oid),
 show, base, entry.path,
-cb_data);
+cb_data, filter, filter_data);
}
+
+   if (filter) {
+   r = filter(LOFT_END_TREE, obj, base->buf, >buf[baselen], 
filter_data);
+   if (r & LOFR_MARK_SEEN)
+   obj->flags |= SEEN;
+   if (r & LOFR_SHOW)
+   show(obj, base->buf, cb_data);
+   }
+
strbuf_setlen(base, baselen);
free_tree_buffer(tree);
 }
@@ -183,10 +206,10 @@ static void add_pending_tree(struct rev_info *revs, 
struct tree *tree)
add_pending_object(revs, >object, "");
 }
 
-void traverse_commit_list(struct rev_in

[PATCH 07/13] list-objects-filter-options: common argument parsing

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create common routines and defines for parsing
list-objects-filter-related command line arguments and
pack-protocol fields.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile  |   1 +
 list-objects-filter-options.c | 101 ++
 list-objects-filter-options.h |  50 +
 3 files changed, 152 insertions(+)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h

diff --git a/Makefile b/Makefile
index fc82664..b9ff0b4 100644
--- a/Makefile
+++ b/Makefile
@@ -810,6 +810,7 @@ LIB_OBJS += list-objects.o
 LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += list-objects-filter-sparse.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 000..40f48ac
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,101 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like: --filter=
+ * and in the pack protocol as: filter 
+ *
+ *  ::= blob:none
+ *   blob:limit:[kmg]
+ *   sparse:oid:
+ *   sparse:path:
+ */
+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)
+{
+   struct object_context oc;
+   struct object_id sparse_oid;
+   const char *v0;
+   const char *v1;
+
+   if (filter_options->choice)
+   die(_("multiple object filter types cannot be combined"));
+
+   /*
+* TODO consider rejecting 'arg' if it contains any
+* TODO injection characters (since we might send this
+* TODO to a sub-command or to the server and we don't
+* TODO want to deal with legacy quoting/escaping for
+* TODO a new feature).
+*/
+
+   filter_options->raw_value = strdup(arg);
+
+   if (skip_prefix(arg, "blob:", ) || skip_prefix(arg, "blobs:", )) {
+   if (!strcmp(v0, "none")) {
+   filter_options->choice = LOFC_BLOB_NONE;
+   return 0;
+   }
+
+   if (skip_prefix(v0, "limit=", ) &&
+   git_parse_ulong(v1, _options->blob_limit_value)) {
+   filter_options->choice = LOFC_BLOB_LIMIT;
+   return 0;
+   }
+   }
+   else if (skip_prefix(arg, "sparse:", )) {
+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/
+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+   /*
+* We could not turn the  into an
+* OID.  Leave the raw_value as is in case
+* the server can parse it.  (It may refer to
+* a branch, commit, or blob we don't have.)
+*/
+   }
+   return 0;
+   }
+
+   if (skip_prefix(v0, "path=", )) {
+   filter_options->choice = LOFC_SPARSE_PATH;
+   filter_options->sparse_path_value = strdup(v1);
+   return 0;
+   }
+   }
+
+   die(_("invalid filter expression '%s'"), arg);
+   return 0;
+

[PATCH 11/13] t6112: rev-list object filtering test

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 t/t6112-rev-list-filters-objects.sh | 223 
 1 file changed, 223 insertions(+)
 create mode 100755 t/t6112-rev-list-filters-objects.sh

diff --git a/t/t6112-rev-list-filters-objects.sh 
b/t/t6112-rev-list-filters-objects.sh
new file mode 100755
index 000..26fa12f
--- /dev/null
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -0,0 +1,223 @@
+#!/bin/sh
+
+test_description='git rev-list with object filtering for partial clone'
+
+. ./test-lib.sh
+
+# Test the blob:none filter.
+
+test_expect_success 'setup r1' '
+   echo "{print \$1}" >print_1.awk &&
+   echo "{print \$2}" >print_2.awk &&
+
+   git init r1 &&
+   for n in 1 2 3 4 5
+   do
+   echo "This is file: $n" > r1/file.$n
+   git -C r1 add file.$n
+   git -C r1 commit -m "$n"
+   done
+'
+
+test_expect_success 'verify blob:none omits all 5 blobs' '
+   git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r1 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:none \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify emitted+omitted == all' '
+   git -C r1 rev-list HEAD --objects \
+   | awk -f print_1.awk \
+   | sort >expected &&
+   git -C r1 rev-list HEAD --objects --filter-print-omitted 
--filter=blob:none \
+   | awk -f print_1.awk \
+   | sed "s/~//" \
+   | sort >observed &&
+   test_cmp observed expected
+'
+
+
+# Test blob:limit=[kmg] filter.
+# We boundary test around the size parameter.  The filter is strictly less than
+# the value, so size 500 and 1000 should have the same results, but 1001 should
+# filter more.
+
+test_expect_success 'setup r2' '
+   git init r2 &&
+   for n in 1000 1
+   do
+   printf "%"$n"s" X > r2/large.$n
+   git -C r2 add large.$n
+   git -C r2 commit -m "$n"
+   done
+'
+
+test_expect_success 'verify blob:limit=500 omits all blobs' '
+   git -C r2 ls-files -s large.1000 large.1 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:limit=500 \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify emitted+omitted == all' '
+   git -C r2 rev-list HEAD --objects \
+   | awk -f print_1.awk \
+   | sort >expected &&
+   git -C r2 rev-list HEAD --objects --filter-print-omitted 
--filter=blob:limit=500 \
+   | awk -f print_1.awk \
+   | sed "s/~//" \
+   | sort >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1000' '
+   git -C r2 ls-files -s large.1000 large.1 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:limit=1000 \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1001' '
+   git -C r2 ls-files -s large.1 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:limit=1001 \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1k' '
+   git -C r2 ls-files -s large.1 \
+   | awk -f print_2.awk \
+   | sort >expected &&
+   git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:limit=1k \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1m' '
+   cat expected &&
+   git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted 
--filter=blob:limit=1m \
+   | awk -f print_1.awk \
+   | sed "s/~//" >observed &&
+   test_cmp observed expected
+'
+
+# Test sparse:path= filter.
+# Use a local file containing a sparse-checkout s

[PATCH 06/13] list-objects-filter-sparse: add sparse filter

2017-10-24 Thread Jeff Hostetler
From: Jeff Hostetler <jeffh...@microsoft.com>

Create a filter for traverse_commit_list_worker() to only include
the blobs the would be referenced by a sparse-checkout using the
given specification.

Signed-off-by: Jeff Hostetler <jeffh...@microsoft.com>
---
 Makefile |   1 +
 list-objects-filter-sparse.c | 241 +++
 list-objects-filter-sparse.h |  30 ++
 3 files changed, 272 insertions(+)
 create mode 100644 list-objects-filter-sparse.c
 create mode 100644 list-objects-filter-sparse.h

diff --git a/Makefile b/Makefile
index 0fdeabb..fc82664 100644
--- a/Makefile
+++ b/Makefile
@@ -810,6 +810,7 @@ LIB_OBJS += list-objects.o
 LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
+LIB_OBJS += list-objects-filter-sparse.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-sparse.c b/list-objects-filter-sparse.c
new file mode 100644
index 000..386b667
--- /dev/null
+++ b/list-objects-filter-sparse.c
@@ -0,0 +1,241 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-sparse.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter driven by a sparse-checkout specification to only
+ * include blobs that a sparse checkout would populate.
+ *
+ * The sparse-checkout spec can be loaded from a blob with the
+ * given OID or from a local pathname.  We allow an OID because
+ * the repo may be bare or we may be doing the filtering on the
+ * server.
+ */
+struct frame {
+   int defval;
+   int child_prov_omit : 1;
+};
+
+struct filter_use_sparse_data {
+   struct oidmap *omits;
+   struct exclude_list el;
+
+   size_t nr, alloc;
+   struct frame *array_frame;
+};
+
+static list_objects_filter_result filter_use_sparse(
+   list_objects_filter_type filter_type,
+   struct object *obj,
+   const char *pathname,
+   const char *filename,
+   void *filter_data_)
+{
+   struct filter_use_sparse_data *filter_data = filter_data_;
+   struct list_objects_filter_map_entry *entry_prev = NULL;
+   int val, dtype;
+   struct frame *frame;
+
+   switch (filter_type) {
+   default:
+   die("unkown filter_type");
+   return LOFR_ZERO;
+
+   case LOFT_BEGIN_TREE:
+   assert(obj->type == OBJ_TREE);
+   dtype = DT_DIR;
+   val = is_excluded_from_list(pathname, strlen(pathname),
+   filename, , _data->el,
+   _index);
+   if (val < 0)
+   val = filter_data->array_frame[filter_data->nr].defval;
+
+   ALLOC_GROW(filter_data->array_frame, filter_data->nr + 1,
+  filter_data->alloc);
+   filter_data->nr++;
+   filter_data->array_frame[filter_data->nr].defval = val;
+   filter_data->array_frame[filter_data->nr].child_prov_omit = 0;
+
+   /*
+* A directory with this tree OID may appear in multiple
+* places in the tree. (Think of a directory move, with
+* no other changes.)  And with a different pathname, the
+* is_excluded...() results for this directory and items
+* contained within it may be different.  So we cannot
+* mark it SEEN (yet), since that will prevent process_tree()
+* from revisiting this tree object with other pathnames.
+*
+* Only SHOW the tree object the first time we visit this
+* tree object.
+*
+* We always show all tree objects.  A future optimization
+* may want to attempt to narrow this.
+*/
+   if (obj->flags & FILTER_REVISIT)
+   return LOFR_ZERO;
+   obj->flags |= FILTER_REVISIT;
+   return LOFR_SHOW;
+
+   case LOFT_END_TREE:
+   assert(obj->type == OBJ_TREE);
+   assert(filter_data->nr > 0);
+
+   frame = _data->array_frame[filter_data->nr];
+   filter_data->nr--;
+
+   /*
+* Tell our parent directory if any of our children were
+* provisionally omitted.
+*/
+   filter_data->array_frame[filter_data->nr].child_prov_omit |=
+   frame->child_prov_omit;
+
+   /*
+* 

<    1   2   3   4   5   6   7   >