from:"Jonathan Tan"

Re: fetch-any-blob / ref-in-want proposal

2017-07-24 Thread Jonathan Tan

On Sun, 23 Jul 2017 09:41:50 +0300
Orgad Shaneh <org...@gmail.com> wrote:

> Hi,
> 
> Jonathan Tan proposed a design and a patch series for requesting a
> specific ref on fetch 4 months ago[1].
> 
> Is there any progress with this?
> 
> - Orgad
> 
> [1] 
> https://public-inbox.org/git/ffd92ad9-39fe-c76b-178d-6e3d6a425...@google.com/

Do you mean requesting a specific blob (as referenced by your link)? If yes, it 
is still being discussed. One such discussion is here: [1]

If you mean ref-in-want, I don't recall anything being done since then.

[1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-21 Thread Jonathan Tan

On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart  wrote:

> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
> promises. Given the average developer only hydrates 56K files (2 MB 
> promises) that is 103 MB to download that no one will ever need. We 
> would like to avoid that if possible as this would be a significant 
> regression in clone times from where we are today.
> 
> I'm also concerned about the performance of merging in promises given we 
> have 100M objects today and growing so the number of promises over time 
> could get pretty large.

After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.

This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.

> >> I think we should have a flag (off by default) that enables someone to
> >> say that promised objects are optional. If the flag is set,
> >> "is_promised_object" will return success and pass the OBJ_ANY type and a
> >> size of -1.
> >>
> >> Nothing today is using the size and in the two places where the object
> >> type is being checked for consistency (fsck_cache_tree and
> >> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> >>
> >> This will enable very large numbers of objects to be omitted from the
> >> clone without triggering a download of the corresponding number of
> >> promised objects.
> > 
> > Eventually I plan to use the size when implementing parameters for
> > history-searching commands (e.g. "git log -S"), but it's true that
> > that's in the future.
> > 
> > Allowing promised objects to be optional would indeed solve the issue of
> > downloading too many promises. It would make the code more complicated,
> > but I'm not sure by how much.
> > 
> > For example, in this fsck patch, the easiest way I could think of to
> > have promised objects was to introduce a 3rd state, called "promised",
> > of "struct object" - one in which the type is known, but we don't have
> > access to the full "struct commit" or equivalent. And thus fsck could
> > assume that if the "struct object" is "parsed" or "promised", the type
> > is known. Having optional promised objects would require that we let
> > this "promised" state have a type of OBJ_UNKNOWN (or something like
> > that) - maybe that would be fine, but I haven't looked into this in
> > detail.
> > 
> 
> Caveats apply as I only did a quick look but I only found the two 
> locations that were checking the object type for consistency.

I haven't looked into detail, but you are probably right.

Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook

2017-07-20 Thread Jonathan Tan

On Thu, 20 Jul 2017 16:58:16 -0400
Ben Peart  wrote:

> >> This is meant as a temporary measure to ensure that all Git commands
> >> work in such a situation. Future patches will update some commands to
> >> either tolerate promised objects (without invoking the hook) or be more
> >> efficient in invoking the promised objects hook.
> 
> I agree that making git more tolerant of promised objects if possible 
> and precomputing a list of promised objects required to complete a 
> particular command and downloading them with a single request are good 
> optimizations to add over time.

That's good to know!

> has_sha1_file also takes a hash "whether local or in an alternate object 
> database, and whether packed or loose" but never calls 
> sha1_object_info_extended.  As a result, we had to add support in 
> check_and_freshen to download missing objects to get proper behavior in 
> all cases.  I don't think this will work correctly without it.

Thanks for the attention to detail. Is this before or after commit
e83e71c ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26)?

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Jonathan Tan

On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart <peart...@gmail.com> wrote:

> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> > Currently, Git does not support repos with very large numbers of objects
> > or repos that wish to minimize manipulation of certain blobs (for
> > example, because they are very large) very well, even if the user
> > operates mostly on part of the repo, because Git is designed on the
> > assumption that every referenced object is available somewhere in the
> > repo storage.
> > 
> 
> Great to see this idea making progress. Making git able to gracefully 
> handle partial clones (beyond the existing shallow clone support) is a 
> key piece of dealing with very large objects and repos.

Thanks.

> > As a first step to reducing this problem, introduce the concept of
> > promised objects. Each Git repo can contain a list of promised objects
> > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> > contains functions to query them; functions for creating and modifying
> > that file will be introduced in later patches.
> 
> If I'm reading this patch correctly, for a repo to successfully pass 
> "git fsck" either the object or a promise must exist for everything fsck 
> checks.  From the documentation for fsck it says "git fsck defaults to 
> using the index file, all SHA-1 references in refs namespace, and all 
> reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
> imply objects or promises must exist for all objects referenced by any 
> of these locations?
> 
> We're currently in the hundreds of millions of objects on some of our 
> repos so even downloading the promises for all the objects in the index 
> is unreasonable as it is gigabytes of data and growing.

For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?

Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.

> I think we should have a flag (off by default) that enables someone to 
> say that promised objects are optional. If the flag is set, 
> "is_promised_object" will return success and pass the OBJ_ANY type and a 
> size of -1.
> 
> Nothing today is using the size and in the two places where the object 
> type is being checked for consistency (fsck_cache_tree and 
> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> 
> This will enable very large numbers of objects to be omitted from the 
> clone without triggering a download of the corresponding number of 
> promised objects.

Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.

> > A repository that is missing an object but has that object promised is not
> > considered to be in error, so also teach fsck this. As part of doing
> > this, object.{h,c} has been modified to generate "struct object" based
> > on only the information available to promised objects, without requiring
> > the object itself.
> 
> In your work on this, did you investigate if there are other commands 
> (ie repack/gc) that will need to learn about promised objects? Have you 
> had a chance (or have plans) to hack up the test suite so that it runs 
> all tests with promised objects and see what (if anything) breaks?

In one of the subsequent patches, I tried to ensure that all
object-reading functions in sha1_file.c somewhat works (albeit slowly)
in the presence of promised objects - that would cover the functionality
of the other commands. As for hacking up the test suite to run with
promised objects, that would be ideal, but I haven't figured out how to
do that yet.

Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-20 Thread Jonathan Tan

On Thu, 20 Jul 2017 11:07:29 -0700
Stefan Beller  wrote:

> > +   if (fsck_promised_objects()) {
> > +   error("Errors found in promised object list");
> > +   errors_found |= ERROR_PROMISED_OBJECT;
> > +   }
> 
> This got me thinking: It is an error if we do not have an object
> and also do not promise it, but what about the other case:
> having and object and promising it, too?
> That is not harmful to the operation, except that the promise
> machinery may be slower due to its size. (Should that be a soft
> warning then? Do we have warnings in fsck?)

Good question - having an object and also having it promised is not an
error condition (and I don't think it's a good idea to make it so,
because objects can appear quite easily from various sources). In the
future, I expect "git gc" to be extended to remove such redundant lines
from the "promised" list.

> >   * The object type is stored in 3 bits.
> >   */
> 
> We may want to remove this comment while we're here as it
> sounds stale despite being technically correct.
> 1974632c66 (Remove TYPE_* constant macros and use
> object_type enums consistently., 2006-07-11)

I agree that the comment is unnecessary, but in this commit I didn't
modify anything to do with the type, so I left it there.

> >  struct object {
> > +   /*
> > +* Set if this object is parsed. If set, "type" is populated and 
> > this
> > +* object can be casted to "struct commit" or an equivalent.
> > +*/
> > unsigned parsed : 1;
> > +   /*
> > +* Set if this object is not in the repo but is promised. If set,
> > +* "type" is populated, but this object cannot be casted to "struct
> > +* commit" or an equivalent.
> > +*/
> > +   unsigned promised : 1;
> 
> Would it make sense to have a bit field instead:
> 
> #define STATE_BITS 2
> #define STATE_PARSED (1<<0)
> #define STATE_PROMISED (1<<1)
> 
> unsigned state : STATE_BITS
> 
> This would be similar to the types and flags?

Both type and flag have to be bit fields (for different reasons), but
since we don't need such a combined field for "parsed" and "promised", I
prefer separating them each into their own field.

> > +test_expect_success 'fsck fails on missing objects' '
> > +   test_create_repo repo &&
> > +
> > +   test_commit -C repo 1 &&
> > +   test_commit -C repo 2 &&
> > +   test_commit -C repo 3 &&
> > +   git -C repo tag -a annotated_tag -m "annotated tag" &&
> > +   C=$(git -C repo rev-parse 1) &&
> > +   T=$(git -C repo rev-parse 2^{tree}) &&
> > +   B=$(git hash-object repo/3.t) &&
> > +   AT=$(git -C repo rev-parse annotated_tag) &&
> > +
> > +   # missing commit, tree, blob, and tag
> > +   rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) 
> > &&
> > +   rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut 
> > -c3-40) &&
> 
> This is a pretty cool test as it promises all sorts of objects
> from different parts of the graph.

Thanks.

Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object

2017-07-19 Thread Jonathan Tan

On Wed, 19 Jul 2017 17:36:39 -0700
Stefan Beller <sbel...@google.com> wrote:

> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathanta...@google.com> 
> wrote:
> > The "used" field in struct object is only used by builtin/fsck. Remove
> > that field and modify builtin/fsck to use a flag instead.
> 
> The patch looks good to me (I would even claim this could
> go in as an independent cleanup, not tied to the RFCish nature
> of the later patches), though I have a question:
> How did you select 0x0008 for USED, i.e. does it
> collide with other flags (theoretically?), and if so
> how do we make sure to avoid the collusion in
> the future?

Thanks. 0x0008 was the next one in the series (as you can see in the
context). As for whether it collides with other flags, that is what the
chart in object.h is for (which I have added to in this patch), I
presume. As far as I can tell, each component must make sure not to
overlap with any other component running concurrently.

[RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash

2017-07-19 Thread Jonathan Tan

In a subsequent patch, sha1_file will need to append object names in the
form of "unsigned char *" to oid arrays. Teach sha1-array support for
that.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1-array.c | 7 +++
 sha1-array.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf84..6e0e35391 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -9,6 +9,13 @@ void oid_array_append(struct oid_array *array, const struct 
object_id *oid)
array->sorted = 0;
 }
 
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1)
+{
+   ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
+   hashcpy(array->oid[array->nr++].hash, sha1);
+   array->sorted = 0;
+}
+
 static int void_hashcmp(const void *a, const void *b)
 {
return oidcmp(a, b);
diff --git a/sha1-array.h b/sha1-array.h
index 04b075633..3479959e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -11,6 +11,7 @@ struct oid_array {
 #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
 
 void oid_array_append(struct oid_array *array, const struct object_id *oid);
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1);
 int oid_array_lookup(struct oid_array *array, const struct object_id *oid);
 void oid_array_clear(struct oid_array *array);
 
-- 
2.14.0.rc0.284.gd933b75aa4-goog

[RFC PATCH v2 4/4] sha1_file: support promised object hook

2017-07-19 Thread Jonathan Tan

Teach sha1_file to invoke a hook whenever an object is requested and
unavailable but is promised. The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.
parse_or_promise_object() in object.c requires this functionality, and
has been modified to use it.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate promised objects (without invoking the hook) or be more
efficient in invoking the promised objects hook.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about promised objects
 - builtin/cat-file - fixed in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised objects
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed object is requested, not on any
   object.  That is, whenever we attempt to search the packfiles for an
   object, if it is missing (from the packfiles and from the loose
   object storage), to invoke the hook (which must then store it as a
   packfile), open the packfile the hook generated, and report that the
   object is found in that new packfile. This reduces the amount of
   analysis needed (in that we only need to look at how packed objects
   are handled), but requires that the hook generate packfiles (or for
   sha1_file to pack whatever loose objects are generated), creating one
   packfile for each missing object and potentially very many packfiles
   that must be linearly searched. This may be tolerable now for repos
   that only have a few missing objects (for example, repos that only
   want to exclude large blobs), and might be tolerable in the future if
   we have batching support for the most commonly used commands, but is
   not tolerable now for repos that exclude a large amount of objects.

Helped-by: Ben Peart <benpe...@microsoft.com>
Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/config.txt |   8 +
 Documentation/gitrepository-layout.txt   |   8 +
 Documentation/technical/read-object-protocol.txt | 102 
 builtin/cat-file.c   |   9 ++
 cache.h  |   2 +
 object.c |   3 +-
 promised-object.c| 194 +++
 promised-object.h|  12 ++
 sha1_file.c  |  44 +++--
 t/t3907-promised-object.sh   |  32 
 t/t3907/read-object  | 114 +
 11 files changed, 513 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 t/t3907/read-object

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab..c293ac921 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropria

[RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

2017-07-19 Thread Jonathan Tan

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.

As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.

A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/technical/repository-version.txt |   6 ++
 Makefile   |   1 +
 builtin/fsck.c |  18 +++-
 cache.h|   2 +
 environment.c  |   1 +
 fsck.c |   6 +-
 object.c   |  19 
 object.h   |  19 
 promised-object.c  | 130 +
 promised-object.h  |  22 +
 setup.c|   7 +-
 t/t3907-promised-object.sh |  41 
 t/test-lib-functions.sh|   6 ++
 13 files changed, 273 insertions(+), 5 deletions(-)
 create mode 100644 promised-object.c
 create mode 100644 promised-object.h
 create mode 100755 t/t3907-promised-object.sh

diff --git a/Documentation/technical/repository-version.txt 
b/Documentation/technical/repository-version.txt
index 00ad37986..f8b82c1c7 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,9 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`promisedObjects`
+~
+
+(Explain this - basically a string containing a command to be run
+whenever a missing object needs to be fetched.)
diff --git a/Makefile b/Makefile
index 9c9c42f8f..c1446d5ef 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
+LIB_OBJS += promised-object.o
 LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 462b8643b..49e21f361 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "promised-object.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -44,6 +45,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_PROMISED_OBJECT 011
 
 static const char *describe_object(struct object *obj)
 {
@@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const 
struct object_id *oid,
 {
struct object *obj;
 
-   obj = parse_object(oid);
+   obj = parse_or_promise_object(oid);
if (!obj) {
error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
@@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
fprintf(stderr, "Checking cache tree\n");
 
if (0 <= it->entry_count) {
-   struct object *obj = parse_object(>oid);
+   struct object *obj = parse_or_promise_object(>oid);
if (!obj) {
error("%s: invalid sha1 pointer in cache-tree",
  oid_to_hex(>oid));
@@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int mark_have_promised_object(const struct object_id *oid, void *data)
+{
+   mark_object_for_connectivity(oid);
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
 
git_config(fsck_config, NULL);
 
+   if (fsck_promised_objects()) {
+   error("Errors found in promised object list");
+   errors_found |= ERROR_PROM

[RFC PATCH v2 1/4] object: remove "used" field from struct object

2017-07-19 Thread Jonathan Tan

The "used" field in struct object is only used by builtin/fsck. Remove
that field and modify builtin/fsck to use a flag instead.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/fsck.c | 24 ++--
 object.c   |  1 -
 object.h   |  2 +-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4ba311cda..462b8643b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -19,6 +19,8 @@
 #define REACHABLE 0x0001
 #define SEEN  0x0002
 #define HAS_OBJ   0x0004
+/* This flag is set if something points to this object. */
+#define USED  0x0008
 
 static int show_root;
 static int show_tags;
@@ -195,7 +197,7 @@ static int mark_used(struct object *obj, int type, void 
*data, struct fsck_optio
 {
if (!obj)
return 1;
-   obj->used = 1;
+   obj->flags |= USED;
return 0;
 }
 
@@ -244,7 +246,7 @@ static void check_unreachable_object(struct object *obj)
}
 
/*
-* "!used" means that nothing at all points to it, including
+* "!USED" means that nothing at all points to it, including
 * other unreachable objects. In other words, it's the "tip"
 * of some set of unreachable objects, usually a commit that
 * got dropped.
@@ -255,7 +257,7 @@ static void check_unreachable_object(struct object *obj)
 * deleted a branch by mistake, this is a prime candidate to
 * start looking at, for example.
 */
-   if (!obj->used) {
+   if (!(obj->flags & USED)) {
if (show_dangling)
printf("dangling %s %s\n", printable_type(obj),
   describe_object(obj));
@@ -379,7 +381,8 @@ static int fsck_obj_buffer(const struct object_id *oid, 
enum object_type type,
errors_found |= ERROR_OBJECT;
return error("%s: object corrupt or missing", oid_to_hex(oid));
}
-   obj->flags = HAS_OBJ;
+   obj->flags &= ~(REACHABLE | SEEN);
+   obj->flags |= HAS_OBJ;
return fsck_obj(obj);
 }
 
@@ -397,7 +400,7 @@ static void fsck_handle_reflog_oid(const char *refname, 
struct object_id *oid,
add_decoration(fsck_walk_options.object_names,
obj,
xstrfmt("%s@{%"PRItime"}", refname, 
timestamp));
-   obj->used = 1;
+   obj->flags |= USED;
mark_object_reachable(obj);
} else {
error("%s: invalid reflog entry %s", refname, 
oid_to_hex(oid));
@@ -445,7 +448,7 @@ static int fsck_handle_ref(const char *refname, const 
struct object_id *oid,
errors_found |= ERROR_REFS;
}
default_refs++;
-   obj->used = 1;
+   obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(refname));
@@ -513,7 +516,8 @@ static int fsck_loose(const struct object_id *oid, const 
char *path, void *data)
return 0; /* keep checking other objects */
}
 
-   obj->flags = HAS_OBJ;
+   obj->flags &= ~(REACHABLE | SEEN);
+   obj->flags |= HAS_OBJ;
if (fsck_obj(obj))
errors_found |= ERROR_OBJECT;
return 0;
@@ -595,7 +599,7 @@ static int fsck_cache_tree(struct cache_tree *it)
errors_found |= ERROR_REFS;
return 1;
}
-   obj->used = 1;
+   obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(":"));
@@ -737,7 +741,7 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
continue;
}
 
-   obj->used = 1;
+   obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(arg));
@@ -774,7 +778,7 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
if (!blob)
continue;
obj = >object;
-   obj->used = 1;
+   obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj,
diff --git a/object.c b/object.c
index f81877741..321d7e920 100644
--- a/object.c
+++ b/object.c
@@ -141,7 +141,6 @@ void *create_object(const unsig

[RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs)

2017-07-19 Thread Jonathan Tan

Thanks for all your comments on the earlier version. This is a
substantially different version. In particular:
 - Now supports all types (tag, commit, tree) of objects, not only blobs
 - fsck works
 - Incorporates Ben Peart's code that uses a long-living process
   (lifetime of the Git invocation) to obtain objects
 - Implemented as a repository extension

If anyone would like to comment on the overall direction of this
approach, that would be great. In particular, I'm not too sure on the
names of things. I know that I have some things to still work on
(documentation and coding style, more improvements to and tests for
fsck, maybe division of commits?) but I would like some early feedback
on the bigger picture first.

If you want to patch this in, this is built off my recent cleanup patch
[1].

About inability to scale if we have the list of promised blobs: In this
design, we will have a promised blob entry only if we have a concrete
tree, so in a repository in which many trees are omitted, there will not
be many promised blob entry objects. In fact, the minimal partial clone
will be one in which there will be one promised commit for each ref
(assuming no duplicates), no promised trees, and no promised blobs.
(I'm not planning to implement such a clone, but someone else could do
so.)

About having multiple promise lists: I have retained the single list,
but am still open to change. The support of all the object types might
be sufficient mitigation for the issues that caused us to investigate
multiple promise lists in the first place.

[1] https://public-inbox.org/git/20170718222848.1453-1-jonathanta...@google.com/

Jonathan Tan (4):
  object: remove "used" field from struct object
  promised-object, fsck: introduce promised objects
  sha1-array: support appending unsigned char hash
  sha1_file: support promised object hook

 Documentation/config.txt |   8 +
 Documentation/gitrepository-layout.txt   |   8 +
 Documentation/technical/read-object-protocol.txt | 102 +++
 Documentation/technical/repository-version.txt   |   6 +
 Makefile |   1 +
 builtin/cat-file.c   |   9 +
 builtin/fsck.c   |  42 ++-
 cache.h  |   4 +
 environment.c|   1 +
 fsck.c   |   6 +-
 object.c |  21 +-
 object.h |  21 +-
 promised-object.c| 324 +++
 promised-object.h|  34 +++
 setup.c  |   7 +-
 sha1-array.c |   7 +
 sha1-array.h |   1 +
 sha1_file.c  |  44 ++-
 t/t3907-promised-object.sh   |  73 +
 t/t3907/read-object  | 114 
 t/test-lib-functions.sh  |   6 +
 21 files changed, 808 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100644 promised-object.c
 create mode 100644 promised-object.h
 create mode 100755 t/t3907-promised-object.sh
 create mode 100755 t/t3907/read-object

-- 
2.14.0.rc0.284.gd933b75aa4-goog

Re: [RFC PATCH 2/3] sha1-array: support appending unsigned char hash

2017-07-19 Thread Jonathan Tan

On Tue, 11 Jul 2017 15:06:11 -0700
Stefan Beller <sbel...@google.com> wrote:

> On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathanta...@google.com> 
> wrote:
> > In a subsequent patch, sha1_file will need to append object names in the
> > form of "unsigned char *" to oid arrays. Teach sha1-array support for
> > that.
> >
> > Signed-off-by: Jonathan Tan <jonathanta...@google.com>
> 
> This breaks the oid/sha1 barrier?

Not sure what you mean by this. This patch is meant to be a change to
make "unsigned char *"-using code able to create/modify oid arrays,
while we migrate away from "unsigned char *" - I don't know of any
barrier.

> I would have expected the caller to do a
> 
>   oid_array_append_oid(, sha1_to_oid(sha1));
> 
> with sha1_to_oid working off some static memory, such that the
> call to oid_array_append_oid (which we have as oid_array_append)
> is just as cheap?

The solution you present would need 2 copies (one in sha1_to_oid, and
another in oid_array_append), but the solution in this patch only
requires one.

> Could you say more in the commit message on why we collude
> sha1 and oids here?

I also don't understand this, but the answer to this is probably the
same as the answer to your first question.

Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

2017-07-19 Thread Jonathan Tan

On Tue, 11 Jul 2017 15:02:09 -0700
Stefan Beller  wrote:

> Here I wondered what this file looks like, in a later patch you
> add documentation:
> 
>   +objects/promisedblob::
>   +   This file records the sha1 object names and sizes of promised
>   +   blobs.
>   +
> 
> but that leaves more fundamental questions:
> * Is that a list of sha1s, separated by LF? (CRLF? How would Windows
>   users interact with it, are they supposed to ever modify this file?)
> * Similar to .git/packed-refs, would we want to have a first line
>   that has some sort of comment?
> * In the first question I assumed a linear list, will that be a sorted list,
>   or do we want to have some fancy data structure here?
>   (critbit tree, bloom filter)
> * is the contents in ASCII or binary? (What are the separators?)
> * In the future I presume we'd want to quickly answer "Is X in the
>   promised blobs list?" so would it be possible (in the future) to
>   improve the searching e.g. binary search?
> * ... I'll read on to see my questions answered, but I'd guess
>   others would prefer to see it in the docs. :)

I'll write more documentation once the file format is finalized. At the
time this patch was written, this was a sorted binary file, where each
entry consists of a 20-byte SHA-1 and an 8-byte size. Now each entry is
a 20-byte SHA-1, a 1-byte type, and an 8-byte size (I will send this
patch out soon).

> Similar to other files, would we want to prefix the file with
> a 4 letter magic number and a version number?

That's a good idea.

> Later down the road, do we want to have a
> (plumbing) command to move an object from
> standard blob to promised blob (as of now I'd think
> it would perform this rm call as well as an insert into
> the promised blobs file) ?
> (Well, we'd also have to think about how to get objects
> out of a pack)
> 
> With such a command you can easily write your own custom
> filter to free up blobs again.

This sounds reasonable, but probably not in the initial set of patches.

> > +   test_must_fail git -C repo fsck
> 
> test_i18n_grep missing out ?
> 
> maybe, too? (Maybe that is already tested elsewhere,
> so no need for it)

This test is just meant to show that configuration can make fsck pass,
not the existing behavior of fsck when it fails (which is tested
elsewhere - I guess that was your question).

[PATCH] sha1_file: use access(), not lstat(), if possible

2017-07-19 Thread Jonathan Tan

In sha1_loose_object_info(), use access() (indirectly invoked through
has_loose_object()) instead of lstat() if we do not need the on-disk
size, as it should be faster on Windows [1].

[1] 
https://public-inbox.org/git/alpine.DEB.2.21.1.1707191450570.4193@virtualbox/

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
Thanks for the information - here's a patch. Do you, by any chance, know
of a web page (or similar thing) that I can cite for this?
---
 sha1_file.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index fca165f13..81962b019 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2920,20 +2920,19 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 
/*
 * If we don't care about type or size, then we don't
-* need to look inside the object at all. Note that we
-* do not optimize out the stat call, even if the
-* caller doesn't care about the disk-size, since our
-* return value implicitly indicates whether the
-* object even exists.
+* need to look inside the object at all. We only check
+* for its existence.
 */
if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
-   const char *path;
-   struct stat st;
-   if (stat_sha1_file(sha1, , ) < 0)
-   return -1;
-   if (oi->disk_sizep)
+   if (oi->disk_sizep) {
+   const char *path;
+   struct stat st;
+   if (stat_sha1_file(sha1, , ) < 0)
+   return -1;
*oi->disk_sizep = st.st_size;
-   return 0;
+   return 0;
+   }
+   return has_loose_object(sha1) ? 0 : -1;
}
 
map = map_sha1_file(sha1, );
-- 
2.14.0.rc0.284.gd933b75aa4-goog

[PATCH] fsck: remove redundant parse_tree() invocation

2017-07-18 Thread Jonathan Tan

If obj->type == OBJ_TREE, an invocation of fsck_walk() will invoke
parse_tree() and return quickly if that returns nonzero, so it is of no
use for traverse_one_object() to invoke parse_tree() in this situation
before invoking fsck_walk(). Remove that code.

The behavior of traverse_one_object() is changed slightly in that it now
returns -1 instead of 1 in the case that parse_tree() fails, but this is
not an issue because its only caller (traverse_reachable) does not care
about the value as long as it is nonzero.

This code was introduced in commit 271b8d2 ("builtin-fsck: move away
from object-refs to fsck_walk", 2008-02-25). The same issue existed in
that commit.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
Here's a code cleanup. I noticed this while looking at modifying fsck.
---
 builtin/fsck.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 99dea7adf..4ba311cda 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -168,18 +168,7 @@ static void mark_object_reachable(struct object *obj)
 
 static int traverse_one_object(struct object *obj)
 {
-   int result;
-   struct tree *tree = NULL;
-
-   if (obj->type == OBJ_TREE) {
-   tree = (struct tree *)obj;
-   if (parse_tree(tree) < 0)
-   return 1; /* error already displayed */
-   }
-   result = fsck_walk(obj, obj, _walk_options);
-   if (tree)
-   free_tree_buffer(tree);
-   return result;
+   return fsck_walk(obj, obj, _walk_options);
 }
 
 static int traverse_reachable(void)
-- 
2.13.2.932.g7449e964c-goog

Re: [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags

2017-07-18 Thread Jonathan Tan

On Tue, 18 Jul 2017 12:30:46 +0200
Christian Couder <christian.cou...@gmail.com> wrote:

> On Thu, Jun 22, 2017 at 2:40 AM, Jonathan Tan <jonathanta...@google.com> 
> wrote:
> 
> > diff --git a/sha1_file.c b/sha1_file.c
> > index bf6b64ec8..778f01d92 100644
> > --- a/sha1_file.c
> > +++ b/sha1_file.c
> > @@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
> >
> >  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
> >  {
> > -   struct pack_entry e;
> > -
> > if (!startup_info->have_repository)
> > return 0;
> > -   if (find_pack_entry(sha1, ))
> > -   return 1;
> > -   if (has_loose_object(sha1))
> > -   return 1;
> > -   if (flags & HAS_SHA1_QUICK)
> > -   return 0;
> > -   reprepare_packed_git();
> > -   return find_pack_entry(sha1, );
> > +   return sha1_object_info_extended(sha1, NULL,
> > +flags | OBJECT_INFO_SKIP_CACHED) 
> > >= 0;
> >  }
> 
> I am not sure if it could affect performance (in one way or another) a
> lot or not but I just wanted to note that has_loose_object() calls
> check_and_freshen() which calls access() on loose object files, while
> sha1_object_info_extended() calls sha1_loose_object_info() which calls
> stat_sha1_file() which calls lstat() on loose object files.
> 
> So depending on the relative performance of access() and lstat() there
> could be a performance impact on repos that have a lot of loose object
> files.

That is true, but from what little I have read online, they have about
the same performance.

Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-17 Thread Jonathan Tan

On Mon, 17 Jul 2017 16:09:17 -0400
Ben Peart  wrote:

> > Is this change meant to ensure that Git code that operates on loose
> > objects directly (bypassing storage-agnostic functions such as
> > sha1_object_info_extended() and has_sha1_file()) still work? If yes,
> > this patch appears incomplete (for example, read_loose_object() needs to
> > be changed too), and this seems like a difficult task - in my patch set
> > [1], I ended up deciding to create a separate type of storage and
> > instead looked at the code that operates on *packed* objects directly
> > (because there were fewer such methods) to ensure that they would work
> > correctly in the presence of a separate type of storage.
> > 
> 
> Yes, with this set of patches, we've been running successfully on 
> completely sparse clones (no commits, trees, or blobs) for several 
> months.  read_loose_object() is only called by fsck when it is 
> enumerating existing loose objects so does not need to be updated.

Ah, that's good to know. I think such an analysis (of the other
loose-related functions) in the commit message would be useful, like I
did for the packed-related functions [1].

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

> We have a few thousand developers making ~100K commits per week so in 
> our particular usage, I'm fairly confident it works correctly.  That 
> said, it is possible there is some code path I've missed. :)

I think that code paths like the history-searching ones ("git log -S",
for example) should still work offline if possible - one of my ideas is
to have these commands take a size threshold parameter so that we do not
need to fetch large blobs during the invocation. (Hence my preference
for size information to be already available to the repo.)

Aside from that, does fsck of a partial repo work? Ability to fsck seems
quite important (or, at least, useful). I tried updating fsck to support
omitting commits and trees (in addition to blobs), and it seems
relatively involved (although I didn't look into it deeply yet).

(Also, that investigation also made me realize that, in my patch set, I
didn't handle the case where a tag references a blob - fsck doesn't work
when the blob is missing, even if it is promised. That's something for
me to look into.)

Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

2017-07-17 Thread Jonathan Tan

About the difference between this patch and my patch set [1], besides
the fact that this patch does not spawn separate processes for each
missing object, which does seem like an improvement to me, this patch
(i) does not use a list of promised objects (but instead communicates
with the hook for each missing object), and (ii) provides backwards
compatibility with other Git code (that does not know about promised
objects) in a different way.

The costs and benefits of (i) are being discussed here [2]. As for (ii),
I still think that my approach is better - I have commented more about
this below.

Maybe the best approach is a combination of both our approaches.

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

[2] 
https://public-inbox.org/git/20170713123951.5cab1...@twelve2.svl.corp.google.com/

On Fri, 14 Jul 2017 09:26:51 -0400
Ben Peart  wrote:

> +
> +packet: git> command=get
> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
> +packet: git> 
> +

It would be useful to have this command support more than one SHA-1, so
that hooks that know how to batch can do so.

> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;

The documentation of "tablesize" in "struct hashmap" states that it can
be used to check if the hashmap is initialized, so
subprocess_map_initialized is probably unnecessary.

>  static int check_and_freshen(const unsigned char *sha1, int freshen)
>  {
> - return check_and_freshen_local(sha1, freshen) ||
> -check_and_freshen_nonlocal(sha1, freshen);
> + int ret;
> + int already_retried = 0;
> +
> +retry:
> + ret = check_and_freshen_local(sha1, freshen) ||
> + check_and_freshen_nonlocal(sha1, freshen);
> + if (!ret && core_virtualize_objects && !already_retried) {
> + already_retried = 1;
> + if (!read_object_process(sha1))
> + goto retry;
> + }
> +
> + return ret;
>  }

Is this change meant to ensure that Git code that operates on loose
objects directly (bypassing storage-agnostic functions such as
sha1_object_info_extended() and has_sha1_file()) still work? If yes,
this patch appears incomplete (for example, read_loose_object() needs to
be changed too), and this seems like a difficult task - in my patch set
[1], I ended up deciding to create a separate type of storage and
instead looked at the code that operates on *packed* objects directly
(because there were fewer such methods) to ensure that they would work
correctly in the presence of a separate type of storage.

[1] 
https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathanta...@google.com/

Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

2017-07-13 Thread Jonathan Tan

On Wed, 12 Jul 2017 13:29:11 -0400
Jeff Hostetler  wrote:

> My primary concern is scale and managing the list of objects over time.
> 
> My fear is that this list will be quite large.  If we only want to omit
> the very large blobs, then maybe not.  But if we want to expand that
> scope to also omit other objects (such as a clone synchronized with a
> sparse checkout), then that list will get large on large repos.  For
> example, on the Windows repo we have (conservatively) 100M+ blobs (and
> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
> 
> If I understand your proposal, newly-omitted blobs would need to be
> merged into the promised-blob list after each fetch.  The fetch itself
> may not have that many new entries, but inserting them into the existing
> list will be slow.  Also, mmap'ing and bsearch'ing will likely have
> issues.  And there's likely to be a very expensive step to remove
> entries from the list as new blobs are received (or locally created).
> 
> In such a "sparse clone", it would be nice to omit unneeded tree objects
> in addition to just blobs.   I say that because we are finding with GVFS
> on the Windows repo, that even with commits-and-trees-only filtering,
> the number of tree objects is overwhelming.

I know that discussion has shifted to the possibility of not having this
list at all, and not sending size information together with the fetch,
but going back to this...maybe omitting trees *is* the solution to both
the large local list and the large amount of size information needing to
be transferred.

So the large-blob (e.g. Android) and many-blob (e.g. Windows) cases
would look like this:

 * Large-blob repositories have no trees omitted and a few blobs
   omitted, and we have sizes for all of them.
 * Many-blob repositories have many trees omitted and either all
   blobs omitted (and we have size information for them, useful for FUSE
   or FUSE-like things, for example) or possibly no blobs omitted (for
   example, if shallow clones are going to be the norm, there won't be
   many blobs to begin with if trees are omitted).

This seems better than an intermediate solution for the many-blob
repository case in which we still keep all the trees but also try to
avoid sending and storing as much information about the blobs as
possible, because that doesn't seem to provide us with much savings
(because the trees as a whole are just as large, if not larger, than the
blob information).

> So I'm also concerned about
> limiting the list to just blobs.  If we need to have this list, it
> should be able to contain any object.  (Suggesting having an object type
> in the entry.)

This makes sense - I'll add it in.

> I also have to wonder about the need to have a complete list of omitted
> blobs up front.  It may be better to just relax the consistency checks
> and assume a missing blob is "intentionally missing" rather than
> indicating a corruption somewhere.  And then let the client do a later
> round-trip to either demand-load the object -or- demand-load the
> existence/size info if/when it really matters.
> 
> Maybe we should add a verb to your new fetch-blob endpoint to just get
> the size of one or more objects to help with this.

If we allow the omission of trees, I don't think the added complexity of
demand-loading sizes is worth it.

What do you think of doing this:
 * add a "type" field to the list of promised objects (formerly the list
   of promised blobs)
 * retain mandatory size for blobs
 * retain single file containing list of promised objects (I don't feel
   too strongly about this, but it has a slight simplicity and
   in-between-GC performance advantage)

Re: [PATCH 1/3] trailers: create struct trailer_opts

2017-07-12 Thread Jonathan Tan

On Wed, 12 Jul 2017 15:46:44 +0200
Paolo Bonzini  wrote:

> -static void print_all(FILE *outfile, struct list_head *head, int trim_empty)
> +static void print_all(FILE *outfile, struct list_head *head,
> +   struct trailer_opts *opts)

This can be "const struct trailer_opts *", I think. (Same for the other
functions in this patch.)

> +struct trailer_opts {
> + int in_place;
> + int trim_empty;
> +};

I was going to suggest that you make these "unsigned in_place : 1" etc.,
but I think OPT_BOOL doesn't support those. (And OPT_BIT is probably not
worth it.)

Re: [PATCH 2/3] trailers: export action enums and corresponding lookup functions

2017-07-12 Thread Jonathan Tan

On Wed, 12 Jul 2017 15:46:45 +0200
Paolo Bonzini  wrote:

> -static struct conf_info default_conf_info;
> +static struct conf_info default_conf_info = {
> + .where = WHERE_END,
> + .if_exists = EXISTS_ADD_IF_DIFFERENT_NEIGHBOR,
> + .if_missing = MISSING_ADD,
> +};

I'm not sure if Git is ready to commit to using designated initializers,
so maybe avoid those for now. More information here [1].

[1] 
https://public-inbox.org/git/20170710070342.txmlwwq6gvjkw...@sigill.intra.peff.net/

Re: [PATCH 3/3] interpret-trailers: add options for actions

2017-07-12 Thread Jonathan Tan

On Wed, 12 Jul 2017 15:46:46 +0200
Paolo Bonzini  wrote:

> +static int option_parse_where(const struct option *opt,
> +   const char *arg, int unset)
> +{
> + enum action_where *where = opt->value;
> +
> + if (unset)
> + return 0;
> +
> + return set_where(where, arg);
> +}

This means that we have the following:

$ cat message
Hello

a: a

$ ./git interpret-trailers --trailer a=b message
Hello

a: a
a: b

$ ./git interpret-trailers --where start --no-where --trailer a=b message
Hello

a: b
a: a

When I would expect the last 2 commands to produce the same output. Maybe
invoke set_where(where, NULL) when "unset" is true? And change set_where()
accordingly. Same for the other two option parsing functions.

Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

2017-07-12 Thread Jonathan Tan

On Tue, 20 Jun 2017 09:54:34 +0200
Christian Couder  wrote:

> Git can store its objects only in the form of loose objects in
> separate files or packed objects in a pack file.
> 
> To be able to better handle some kind of objects, for example big
> blobs, it would be nice if Git could store its objects in other object
> databases (ODB).

Thanks for this, and sorry for the late reply. It's good to know that
others are thinking about "missing" objects in repos too.

>   - "have": the helper should respond with the sha1, size and type of
> all the objects the external ODB contains, one object per line.

This should work well if we are not caching this "have" information
locally (that is, if the object store can be accessed with low latency),
but I am not sure if this will work otherwise. I see that you have
proposed a local cache-using method later in the e-mail - my comments on
that are below.

>   - "get ": the helper should then read from the external ODB
> the content of the object corresponding to  and pass it to
> Git.

This makes sense - I have some patches [1] that implement this with the
"fault_in" mechanism described in your e-mail.

[1] https://public-inbox.org/git/cover.1499800530.git.jonathanta...@google.com/

> * Transfering information
> 
> To tranfer information about the blobs stored in external ODB, some
> special refs, called "odb ref", similar as replace refs, are used in
> the tests of this series, but in general nothing forces the helper to
> use that mechanism.
> 
> The external odb helper is responsible for using and creating the refs
> in refs/odbs//, if it wants to do that. It is free for
> example to just create one ref, as it is also free to create many
> refs. Git would just transmit the refs that have been created by this
> helper, if Git is asked to do so.
> 
> For now in the tests there is one odb ref per blob, as it is simple
> and as it is similar to what git-lfs does. Each ref name is
> refs/odbs// where  is the sha1 of the blob stored
> in the external odb named .
> 
> These odb refs point to a blob that is stored in the Git
> repository and contain information about the blob stored in the
> external odb. This information can be specific to the external odb.
> The repos can then share this information using commands like:
> 
> `git fetch origin "refs/odbs//*:refs/odbs//*"`
> 
> At the end of the current patch series, "git clone" is teached a
> "--initial-refspec" option, that asks it to first fetch some specified
> refs. This is used in the tests to fetch the odb refs first.
> 
> This way only one "git clone" command can setup a repo using the
> external ODB mechanism as long as the right helper is installed on the
> machine and as long as the following options are used:
> 
>   - "--initial-refspec " to fetch the odb refspec
>   - "-c odb..command=" to configure the helper

A method like this means that information about every object is
downloaded, regardless of which branches were actually cloned, and
regardless of what parameters (e.g. max blob size) were used to control
the objects that were actually cloned.

We could make, say, one "odb ref" per size and branch - for example,
"refs/odbs/master/0", "refs/odbs/master/1k", "refs/odbs/master/1m", etc.
- and have the client know which one to download. But this wouldn't
scale if we introduce different object filters in the clone and fetch
commands.

I think that it is best to have upload-pack send this information
together with the packfile, since it knows exactly what objects were
omitted, and therefore what information the client needs. As discussed
in a sibling e-mail, clone/fetch already needs to be modified to omit
objects anyway.

[RFC PATCH 3/3] sha1_file: add promised blob hook support

2017-07-11 Thread Jonathan Tan

Teach sha1_file to invoke a hook whenever a blob is requested and
unavailable but is promised. The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate promised blobs (without invoking the hook) or be more
efficient in invoking the promised blob hook.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about promised blobs
 - builtin/cat-file - fixed in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised blobs
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/config.txt   |  8 
 Documentation/gitrepository-layout.txt |  8 
 builtin/cat-file.c |  9 
 promised-blob.c| 75 ++
 promised-blob.h| 13 ++
 sha1_file.c| 44 +---
 t/t3907-promised-blob.sh   | 36 
 7 files changed, 179 insertions(+), 14 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab..c293ac921 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.promisedBlobCommand::
+   If set, whenever a blob in the local repo is attempted to be read, but
+   is both missing and a promised blob, invoke this shell command to
+   generate or obtain that blob before reporting an error. This shell
+   command should take one or more hashes, each terminated by a newline,
+   as standard input, and (if successful) should write the corresponding
+   objects to the local repo (packed or loose).
+
 core.precomposeUnicode::
This option is only used by Mac OS implementation of Git.
When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/Documentati

[RFC PATCH 2/3] sha1-array: support appending unsigned char hash

2017-07-11 Thread Jonathan Tan

In a subsequent patch, sha1_file will need to append object names in the
form of "unsigned char *" to oid arrays. Teach sha1-array support for
that.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1-array.c | 7 +++
 sha1-array.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf84..6e0e35391 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -9,6 +9,13 @@ void oid_array_append(struct oid_array *array, const struct 
object_id *oid)
array->sorted = 0;
 }
 
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1)
+{
+   ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
+   hashcpy(array->oid[array->nr++].hash, sha1);
+   array->sorted = 0;
+}
+
 static int void_hashcmp(const void *a, const void *b)
 {
return oidcmp(a, b);
diff --git a/sha1-array.h b/sha1-array.h
index 04b075633..3479959e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -11,6 +11,7 @@ struct oid_array {
 #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
 
 void oid_array_append(struct oid_array *array, const struct object_id *oid);
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1);
 int oid_array_lookup(struct oid_array *array, const struct object_id *oid);
 void oid_array_clear(struct oid_array *array);
 
-- 
2.13.2.932.g7449e964c-goog

[RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

2017-07-11 Thread Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, introduce the concept of
promised blobs. Each Git repo can contain a list of promised blobs and
their sizes at $GIT_DIR/objects/promisedblob. This patch contains
functions to query them; functions for creating and modifying that file
will be introduced in later patches.

A repository that is missing a blob but has that blob promised is not
considered to be in error, so also teach fsck this.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Makefile |  1 +
 builtin/fsck.c   | 13 +++
 promised-blob.c  | 95 
 promised-blob.h  | 14 +++
 t/t3907-promised-blob.sh | 29 +++
 t/test-lib-functions.sh  |  6 +++
 6 files changed, 158 insertions(+)
 create mode 100644 promised-blob.c
 create mode 100644 promised-blob.h
 create mode 100755 t/t3907-promised-blob.sh

diff --git a/Makefile b/Makefile
index 9c9c42f8f..e96163269 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
+LIB_OBJS += promised-blob.o
 LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 99dea7adf..7454be7f1 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "promised-blob.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -223,6 +224,9 @@ static void check_reachable_object(struct object *obj)
if (!(obj->flags & HAS_OBJ)) {
if (has_sha1_pack(obj->oid.hash))
return; /* it is in pack - forget about it */
+   if (obj->type == OBJ_BLOB &&
+   is_promised_blob(>oid, NULL))
+   return;
printf("missing %s %s\n", printable_type(obj),
describe_object(obj));
errors_found |= ERROR_REACHABLE;
@@ -642,6 +646,13 @@ static int mark_packed_for_connectivity(const struct 
object_id *oid,
return 0;
 }
 
+static int mark_promised_blob_for_connectivity(const struct object_id *oid,
+  void *data)
+{
+   mark_object_for_connectivity(oid);
+   return 0;
+}
+
 static char const * const fsck_usage[] = {
N_("git fsck [] [...]"),
NULL
@@ -701,6 +712,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
if (connectivity_only) {
for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
+   for_each_promised_blob(mark_promised_blob_for_connectivity,
+  NULL);
} else {
fsck_object_dir(get_object_directory());
 
diff --git a/promised-blob.c b/promised-blob.c
new file mode 100644
index 0..493808ed2
--- /dev/null
+++ b/promised-blob.c
@@ -0,0 +1,95 @@
+#include "cache.h"
+#include "promised-blob.h"
+#include "sha1-lookup.h"
+#include "strbuf.h"
+
+#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 8)
+/*
+ * A mmap-ed byte array of size (missing_blob_nr * ENTRY_SIZE). Each
+ * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised blob and its
+ * 64-bit size in network byte order. The entries are sorted in ascending SHA-1
+ * order.
+ */
+static char *promised_blobs;
+static int64_t promised_blob_nr = -1;
+
+static void prepare_promised_blobs(void)
+{
+   char *filename;
+   int fd;
+   struct stat st;
+
+   if (promised_blob_nr >= 0)
+   return;
+
+   if (getenv("GIT_IGNORE_PROMISED_BLOBS")) {
+   promised_blob_nr = 0;
+   return;
+   }
+   
+   filename = xstrfmt("%s/promisedblob", get_object_directory());
+   fd = git_open(filename);
+   if (fd < 0) {
+   if (errno == ENOENT) {
+   promised_blob_nr = 0;
+   goto cleanup;
+   }
+   perror("prepare_promised_blobs");
+   die("Could not open %s", filename);
+   }
+   if (fstat(fd, )) {
+   perror("prepare_promised_blobs");
+   die("Could not stat %s", filename);
+   }
+   if (st.st_size == 0) {
+   promised_blob_nr = 0;
+

[RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")

2017-07-11 Thread Jonathan Tan

These patches are part of a set of patches implementing partial clone,
as you can see here:

https://github.com/jonathantanmy/git/tree/partialclone

In that branch, clone with batch checkout works, as you can see in the
README. The code and tests are generally done, but some patches are
still missing documentation and commit messages.

These 3 patches implement the foundational concept - formerly known as
"missing blobs" in the "missing blob manifest", I decided to call them
"promised blobs". The repo knows their object names and sizes. It also
does not have the blobs themselves, but can be configured to know how to
fetch them.

An older version of these patches was sent as a single demonstration
patch in versions 1 to 3 of [1]. In there, Junio suggested that I have
only one file containing missing blob information. I have made that
suggested change in this version.

One thing remaining is to add a repository extension [2] so that older
versions of Git fail immediately instead of trying to read missing
blobs, but I thought I'd send these first in order to get some initial
feedback.

[1] https://public-inbox.org/git/cover.1497035376.git.jonathanta...@google.com/
[2] Documentation/technical/repository-version.txt

Jonathan Tan (3):
  promised-blob, fsck: introduce promised blobs
  sha1-array: support appending unsigned char hash
  sha1_file: add promised blob hook support

 Documentation/config.txt   |   8 ++
 Documentation/gitrepository-layout.txt |   8 ++
 Makefile   |   1 +
 builtin/cat-file.c |   9 ++
 builtin/fsck.c |  13 +++
 promised-blob.c| 170 +
 promised-blob.h|  27 ++
 sha1-array.c   |   7 ++
 sha1-array.h   |   1 +
 sha1_file.c|  44 ++---
 t/t3907-promised-blob.sh   |  65 +
 t/test-lib-functions.sh|   6 ++
 12 files changed, 345 insertions(+), 14 deletions(-)
 create mode 100644 promised-blob.c
 create mode 100644 promised-blob.h
 create mode 100755 t/t3907-promised-blob.sh

-- 
2.13.2.932.g7449e964c-goog

Re: speeding up git pull from a busy gerrit instance over a slow link?

2017-06-30 Thread Jonathan Tan

On Fri, 30 Jun 2017 14:28:15 +0200
Noel Grandin  wrote:

> -
> snippet of packet trace
> ---
> 
> 14:20:45.705091 pkt-line.c:80   packet:fetch< 
> c5b026801c729ab37e2af6a610f31ca2e28b51fe 
> refs/changes/99/29099/2
> 14:20:45.705093 pkt-line.c:80   packet:fetch< 
> 931e2c40aeb4cf4591ae9fcfea1b352b966f0a32 
> refs/changes/99/29199/1

Out of curiosity, what is the timestamp difference between the first and
last GIT_TRACE_PACKET log message containing "refs/changes"?

This is an issue for the Android repository too (which also uses
Gerrit). I have some work in progress to avoid the extra refs from being
sent [1], but haven't been working on it recently.

[1] https://public-inbox.org/git/cover.1491851452.git.jonathanta...@google.com/

RFC: Missing blob hook might be invoked infinitely recursively

2017-06-29 Thread Jonathan Tan

As some of you may know, I'm currently working on support for partial
clones/fetches in Git (where blobs above a user-specified size threshold
are not downloaded - only their names and sizes are downloaded). To do
this, the client repository needs to be able to download blobs at will
whenever it needs a missing one (for example, upon checkout).

So I have done this by adding support for a hook in Git [1], and
updating the object-reading code in Git to, by default, automatically
invoke this hook whenever necessary. (This means that existing
subsystems will all work by default, in theory at least.) My current
design is for the hook to have maximum flexibility - when invoked with a
list of SHA-1s, it must merely ensure that those objects are in the
local repo, whether packed or loose.

I am also working on a command (fetch-blob) to be bundled with Git to be
used as a default hook, and here is where the problem lies.

Suppose you have missing blob AB12 and CD34 that you now need, so
fetch-blob is invoked. It sends the literals AB12 and CD34 to a new
server endpoint and obtains a packfile, which it then pipes through "git
index-pack". The issue is that "git index-pack" wants to try to access
AB12 and CD34 in the local repo in order to do a SHA-1 collision check,
and therefore fetch-blob is invoked once again, creating infinite
recursion.

This is straightforwardly fixed by making "git index-pack" understand
about missing blobs, but this might be a symptom of this approach being
error-prone (custom hooks that invoke any Git command must be extra
careful).

So I have thought of a few solutions, each with its pros and cons:

1. Require the hook to instead output a packfile to stdout. This means
that that hook no longer needs to access the local repo, and thus has
less dependence on Git commands. But this reduces the flexibility in
that its output must be packed, not loose. (This is fine for the use
cases I'm thinking of, but probably not so for others.)

2. Add support for an environment variable to Git that suppresses access
to the missing blob manifest, in effect, suppressing invocation of the
hook. This allows anyone (the person configuring Git or the hook writer)
to suppress this access, although they might need in-depth knowledge to
know whether the hook is meant to be run with such access suppressed or
required.

3. Like the above, except for a command-line argument to Git.

What do you think? Any solutions that I am missing?

[1] Work in progress, but you can see an earlier version here:
https://public-inbox.org/git/b917a463f0ad4ce0ab115203b3f24894961a2e75.1497558851.git.jonathanta...@google.com/

Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended

2017-06-26 Thread Jonathan Tan

On Mon, Jun 26, 2017 at 10:28 AM, Junio C Hamano <gits...@pobox.com> wrote:
> Jonathan Tan <jonathanta...@google.com> writes:
>
>> On Sat, Jun 24, 2017 at 5:45 AM, Jeff King <p...@peff.net> wrote:
>>> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>>>
>>>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>>>> Improve sha1_object_info_extended() by supporting additional flags. This
>>>> allows has_sha1_file_with_flags() to be modified to use
>>>> sha1_object_info_extended() in a subsequent patch.
>>>
>>> A minor nit, but try to avoid vague words like "improve" in your subject
>>> lines. Something like:
>>>
>>>   sha1_file: teach sha1_object_info_extended more flags
>>>
>>> That's not too specific either, but I think in --oneline output it gives
>>> you a much better clue about what part of the function it touches.
>>>
>>>> ---
>>>>  cache.h |  4 
>>>>  sha1_file.c | 43 ---
>>>>  2 files changed, 28 insertions(+), 19 deletions(-)
>>>
>>> The patch itself looks good.
>>
>> Thanks. I did try, but all my attempts exceeded 50 characters. Maybe
>> "sha1_file: support more flags in object info query" is good enough.
>
> Between the two, I personally find that Peff's is more descriptive,
> so unless there are other changes planned, let me "rebase -i" to
> retitle the commit.
>
> Thanks.

His suggestion does exceed 50 characters, but I see that that's a soft
limit. Either title is fine with me, thanks.

Re: What's cooking in git.git (Jun 2017, #07; Sat, 24)

2017-06-26 Thread Jonathan Tan

On Sat, 24 Jun 2017 16:25:13 -0700
Junio C Hamano  wrote:

> * jt/unify-object-info (2017-06-21) 8 commits
>  - sha1_file: refactor has_sha1_file_with_flags
>  - sha1_file: do not access pack if unneeded
>  - sha1_file: improve sha1_object_info_extended
>  - sha1_file: refactor read_object
>  - sha1_file: move delta base cache code up
>  - sha1_file: rename LOOKUP_REPLACE_OBJECT
>  - sha1_file: rename LOOKUP_UNKNOWN_OBJECT
>  - sha1_file: teach packed_object_info about typename
> 
>  Code clean-ups.
> 
>  Looked sensible to me.  Any further comments?
>  cf. <20170624124522.p2dnwdah75e4n...@sigill.intra.peff.net>

In a reply that I just sent out [1], I suggested "sha1_file: support
more flags in object info query" to replace "sha1_file: improve
sha1_object_info_extended" (the 6th patch from the bottom). If you are
OK with that commit message title, and if you prefer not to make the
change locally yourself, I can send out a reroll with the updated commit
message title.

[1] 
https://public-inbox.org/git/cagf8dgj8c0choxzo5cpt56225xgbgrjagy8hdast+0-usfw...@mail.gmail.com/

Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended

2017-06-26 Thread Jonathan Tan

On Sat, Jun 24, 2017 at 5:45 AM, Jeff King <p...@peff.net> wrote:
> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>
>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>> Improve sha1_object_info_extended() by supporting additional flags. This
>> allows has_sha1_file_with_flags() to be modified to use
>> sha1_object_info_extended() in a subsequent patch.
>
> A minor nit, but try to avoid vague words like "improve" in your subject
> lines. Something like:
>
>   sha1_file: teach sha1_object_info_extended more flags
>
> That's not too specific either, but I think in --oneline output it gives
> you a much better clue about what part of the function it touches.
>
>> ---
>>  cache.h |  4 
>>  sha1_file.c | 43 ---
>>  2 files changed, 28 insertions(+), 19 deletions(-)
>
> The patch itself looks good.

Thanks. I did try, but all my attempts exceeded 50 characters. Maybe
"sha1_file: support more flags in object info query" is good enough.

Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded

2017-06-26 Thread Jonathan Tan

On Sat, Jun 24, 2017 at 1:39 PM, Jeff King  wrote:
> On Sat, Jun 24, 2017 at 11:41:39AM -0700, Junio C Hamano wrote:
> If we are open to writing anything, then I think it should follow the
> same pointer-to-data pattern that the rest of the struct does. I.e.,
> declare the extra pack-data struct as a pointer, and let callers fill it
> in or not. It's excessive in the sense that it's not a variable-sized
> answer, but it at least makes the interface consistent.
>
> And no callers who read it would be silently broken; the actual data
> type in "struct object_info" would change, so they'd get a noisy compile
> error.

I considered that, but there was some trickiness in streaming.c -
open_istream() would need to establish that pointer even though that
is not its responsibility, or istream_source would need to
heap-allocate some memory then point to it from `oi` (it has to be
heap-allocated since the memory must outlive the function).

Also, it does not solve the "maintenance nightmare" issue that Junio
described (in that in order to optimize the pack read away, we would
need a big "if" statement).

Those issues are probably surmountable, but in the end I settled on
just allowing the caller to pass NULL and having
sha1_object_info_extended() substitute an empty struct when that
happens, as you can see in v5 [2]. That allows most of
sha1_object_info_extended() to not have to handle NULL, but also
allows for the specific optimization (optimizing the pack read away)
that I want.

[1] https://public-inbox.org/git/xmqq8tklqkbe@gitster.mtv.corp.google.com/
[2] 
https://public-inbox.org/git/ddbbc86204c131c83b3a1ff3b52789be9ed2a5b1.1498091579.git.jonathanta...@google.com/

Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list

2017-06-22 Thread Jonathan Tan

On Thu, 22 Jun 2017 14:45:26 -0700
Jonathan Tan <jonathanta...@google.com> wrote:

> On Thu, 22 Jun 2017 20:36:13 +
> Jeff Hostetler <g...@jeffhostetler.com> wrote:
> 
> > From: Jeff Hostetler <jeffh...@microsoft.com>
> > 
> > In preparation for partial/sparse clone/fetch where the
> > server is allowed to omit large/all blobs from the packfile,
> > teach traverse_commit_list() to take a blob filter-proc that
> > controls when blobs are shown and marked as SEEN.
> > 
> > Normally, traverse_commit_list() always marks visited blobs
> > as SEEN, so that the show_object() callback will never see
> > duplicates.  Since a single blob OID may be referenced by
> > multiple pathnames, we may not be able to decide if a blob
> > should be excluded until later pathnames have been traversed.
> > With the filter-proc, the automatic deduping is disabled.
> 
> Comparing this approach (this patch and the next one) against mine [1],
> I see that this has the advantage of (in pack-objects) avoiding the
> invocation of add_preferred_base_object() on blobs that are filtered
> out, avoiding polluting the "to_pack" data structure with information
> that we do not need.
> 
> But it does add an extra place where blobs are filtered out (besides
> add_object_entry()), which makes it harder for the reader to keep track
> of what's going on. I took a brief look to see if filtering could be
> eliminated from add_object_entry(), but that function is called from
> many places, so I couldn't tell.
> 
> Anyway, I think both approaches will work (this patch's and mine [1]).
> 
> [1] 
> https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathanta...@google.com/

Also it should be mentioned somewhere that this does not cover the
bitmap case - a similar discussion should be included in one of the
patches, like I did in [1].

And looking back at my original cover letter [2], I wrote:

> This is similar to [1] except that this
[...]
> is slightly more comprehensive in
> that the read_object_list_from_stdin() codepath is also covered in
> addition to the get_object_list() codepath. (Although, to be clear,
> upload-pack always passes "--revs" and thus only needs the
> get_object_list() codepath).

(The [1] in the quote above refers to one of Jeff Hostetler's patches,
[QUOTE 1].)

The same issue applies to this patch (since
read_object_list_from_stdin() invokes add_object_entry() directly
without going through the traversal mechanism) and probably warrants at
least some description in one of the commit messages.

[1] 
https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathanta...@google.com/
[2] https://public-inbox.org/git/cover.1496361873.git.jonathanta...@google.com/

[QUOTE 1] 
https://public-inbox.org/git/1488994685-37403-3-git-send-email-jeffh...@microsoft.com/

Re: [PATCH 2/3] pack-objects: WIP add max-blob-size filtering

2017-06-22 Thread Jonathan Tan

On Thu, 22 Jun 2017 20:36:14 +
Jeff Hostetler  wrote:

> +static signed long max_blob_size = -1;

FYI Junio suggested "blob-max-bytes" when he looked at my patch [1].

[1] https://public-inbox.org/git/xmqqmv9ryoym@gitster.mtv.corp.google.com/

[snip]

> +/*
> + * Filter blobs by pathname or size.
> + * Return 1 to mark the blob SEEN so that it will not be reported again.
> + * Return 0 to allow it to be presented again.
> + */
> +static int filter_blob(
> + struct object *obj,
> + const char *pathname,
> + const char *entryname,
> + void *data)
> +{
> + assert(obj->type == OBJ_BLOB);
> + assert((obj->flags & SEEN) == 0);
> + assert((obj->flags & OBJECT_ADDED) == 0);
> + assert(max_blob_size >= 0);
> +
> + /*
> +  * Always include blobs for special files of the form ".git*".
> +  */
> + if ((strncmp(entryname, ".git", 4) == 0) && entryname[4]) {
> + if (obj->flags & BLOB_OMITTED) {
> + /*
> +  * TODO
> +  * TODO Remove this blob from the omitted blob list.
> +  * TODO
> +  */
> + obj->flags &= ~BLOB_OMITTED;
> + }
> + show_object(obj, pathname, data);
> + return 1;
> + }
> +
> + /*
> +  * We already know the blob is too big because it was previously
> +  * omitted.  We still don't want it yet.  DO NOT mark it SEEN
> +  * in case it is associated with a ".git*" path in another tree
> +  * or commit.
> +  */
> + if (obj->flags & BLOB_OMITTED)
> + return 0;
> +
> + /*
> +  * We only want blobs that are LESS THAN the maximum.
> +  * This allows zero to mean NO BLOBS.

That is not what maximum means...

This is the reason why I originally called it "limit", but after some
reflection, I decided that it is no big deal to always send zero-sized
blobs. The client must be able to handle the presence of blobs anyway
(because it will receive the ".git" ones), and excluding all blobs
regardless of size does not remove the necessity of obtaining their
sizes, since we need those sizes to put in the omitted blob list.

> +  */
> + if (max_blob_size > 0) {
> + unsigned long s;
> + enum object_type t = sha1_object_info(obj->oid.hash, );
> + assert(t == OBJ_BLOB);
> + if (s < max_blob_size) {
> + show_object(obj, pathname, data);
> + return 1;
> + }
> + }
> +
> + /*
> +  * TODO
> +  * TODO (Provisionally) add this blob to the omitted blob list.
> +  * TODO

As for the omitted blob list itself, you can see from my patch [2] that I
used a hashmap, but the decorate.h functionality might work too (I
haven't looked into that in detail though).

[2] 
https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathanta...@google.com/

Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list

2017-06-22 Thread Jonathan Tan

On Thu, 22 Jun 2017 20:36:13 +
Jeff Hostetler  wrote:

> From: Jeff Hostetler 
> 
> In preparation for partial/sparse clone/fetch where the
> server is allowed to omit large/all blobs from the packfile,
> teach traverse_commit_list() to take a blob filter-proc that
> controls when blobs are shown and marked as SEEN.
> 
> Normally, traverse_commit_list() always marks visited blobs
> as SEEN, so that the show_object() callback will never see
> duplicates.  Since a single blob OID may be referenced by
> multiple pathnames, we may not be able to decide if a blob
> should be excluded until later pathnames have been traversed.
> With the filter-proc, the automatic deduping is disabled.

Comparing this approach (this patch and the next one) against mine [1],
I see that this has the advantage of (in pack-objects) avoiding the
invocation of add_preferred_base_object() on blobs that are filtered
out, avoiding polluting the "to_pack" data structure with information
that we do not need.

But it does add an extra place where blobs are filtered out (besides
add_object_entry()), which makes it harder for the reader to keep track
of what's going on. I took a brief look to see if filtering could be
eliminated from add_object_entry(), but that function is called from
many places, so I couldn't tell.

Anyway, I think both approaches will work (this patch's and mine [1]).

[1] 
https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathanta...@google.com/

[snip]

>  static void process_blob(struct rev_info *revs,
>struct blob *blob,
>show_object_fn show,
> +  filter_blob_fn filter_blob,
>struct strbuf *path,
>const char *name,
>void *cb_data)
> @@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
>   die("bad blob object");
>   if (obj->flags & (UNINTERESTING | SEEN))
>   return;
> - obj->flags |= SEEN;
>  
>   pathlen = path->len;
>   strbuf_addstr(path, name);
> - show(obj, path->buf, cb_data);
> + if (!filter_blob) {
> + /*
> +  * Normal processing is to imediately dedup blobs
> +  * during commit traversal, regardless of how many
> +  * times it appears in a single or multiple commits,
> +  * so we always set SEEN.
> +  */
> + obj->flags |= SEEN;
> + show(obj, path->buf, cb_data);
> + } else {
> + /*
> +  * Use the filter-proc to decide whether to show
> +  * the blob.  We only set SEEN if requested.  For
> +  * example, this could be used to omit a specific
> +  * blob until it appears with a ".git*" entryname.
> +  */
> + if (filter_blob(obj, path->buf, >buf[pathlen], cb_data))
> + obj->flags |= SEEN;
> + }
>   strbuf_setlen(path, pathlen);
>  }

After looking at this code again, I wonder if we should explicitly
document that SEEN will be set on the object before show(), and that
show() is allowed to unset SEEN if it wants traversal to process that
object again. That seems to accomplish what we want to accomplish
without this expansion of the API.

(This does mean that the next patch will need to call strrchr() itself,
since show() does not provide the basename of the file name.)

Having said that, if we keep with the current approach, I think there
should be a show() call after the invocation to filter_blob(). That is
much less surprising to me, and also allows us to remove some
show_object() invocations in the next patch.

> +void traverse_commit_list_filtered(
> + struct rev_info *revs,
> + show_commit_fn show_commit,
> + show_object_fn show_object,
> + filter_blob_fn filter_blob,
> + void *data)
> +{
>   int i;
>   struct commit *commit;
>   struct strbuf base;

Git style is to indent to the location after the first "(", I believe.
Likewise in the header file below.

>  typedef void (*show_commit_fn)(struct commit *, void *);
>  typedef void (*show_object_fn)(struct object *, const char *, void *);
> +typedef int  (*filter_blob_fn)(struct object *, const char *, const char *, 
> void *);

Can this have parameter names added (especially both the const char *
ones, otherwise indistinguishable) and, if necessary, documented?

[PATCH v5 5/8] sha1_file: refactor read_object

2017-06-21 Thread Jonathan Tan

read_object() and sha1_object_info_extended() both implement mechanisms
such as object replacement, retrying the packed store after failing to
find the object in the packed store then the loose store, and being able
to mark a packed object as bad and then retrying the whole process.
Consolidating these mechanisms would be a great help to maintainability.

Therefore, consolidate them by extending sha1_object_info_extended() to
support the functionality needed, and then modifying read_object() to
use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  1 +
 sha1_file.c | 84 ++---
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index a3631b237..48aea923b 100644
--- a/cache.h
+++ b/cache.h
@@ -1827,6 +1827,7 @@ struct object_info {
off_t *disk_sizep;
unsigned char *delta_base_sha1;
struct strbuf *typename;
+   void **contentp;
 
/* Response */
enum {
diff --git a/sha1_file.c b/sha1_file.c
index 0c996370d..615a27dac 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
return parse_sha1_header_extended(hdr, , 0);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-   int ret;
-   git_zstream stream;
-   char hdr[8192];
-
-   ret = unpack_sha1_header(, map, mapsize, hdr, sizeof(hdr));
-   if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-   return NULL;
-
-   return unpack_sha1_rest(, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
  struct pack_window **w_curs,
  off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, 
off_t base_offset,
if (!ent)
return unpack_entry(p, base_offset, type, base_size);
 
-   *type = ent->type;
-   *base_size = ent->size;
+   if (type)
+   *type = ent->type;
+   if (base_size)
+   *base_size = ent->size;
return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
 * We always get the representation type, but only convert it to
 * a "real" type later if the caller is interested.
 */
-   type = unpack_object_header(p, _curs, , );
+   if (oi->contentp) {
+   *oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+ );
+   if (!*oi->contentp)
+   type = OBJ_BAD;
+   } else {
+   type = unpack_object_header(p, _curs, , );
+   }
 
-   if (oi->sizep) {
+   if (!oi->contentp && oi->sizep) {
if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
off_t tmp_pos = curpos;
off_t base_offset = get_delta_base(p, _curs, _pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
free(external_base);
}
 
-   *final_type = type;
-   *final_size = size;
+   if (final_type)
+   *final_type = type;
+   if (final_size)
+   *final_size = size;
 
unuse_pack(_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
git_zstream stream;
char hdr[32];
struct strbuf hdrbuf = STRBUF_INIT;
+   unsigned long size_scratch;
 
if (oi->delta_base_sha1)
hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 * return value implicitly indicates whether the
 * object even exists.
 */
-   if (!oi->typep && !oi->typename && !oi->sizep) {
+   if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
const char *path;
struct stat st;
if (stat_sha1_file(sha1, , ) < 0)
@@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
map = map_sha1_file(sha1, );
if (!map)
return -1;
+
+   if (!oi->sizep)
+   oi->sizep = _scratch;
+
if (oi->disk_sizep)
*oi->disk_sizep = mapsize;
if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
@@ -2956,10 +2959,18 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
   sha1_to_hex(sha1));
} else if ((status = parse_sha1_header_extended(hdr, oi, flags)

[PATCH v5 7/8] sha1_file: do not access pack if unneeded

2017-06-21 Thread Jonathan Tan

Currently, regardless of the contents of the "struct object_info" passed
to sha1_object_info_extended(), that function always accesses the
packfile whenever it returns information about a packed object, since it
needs to populate "u.packed".

Add the ability to pass NULL, and use NULL-ness of the argument to
activate an optimization in which sha1_object_info_extended() does not
needlessly access the packfile. A subsequent patch will make use of this
optimization.

A similar optimization is not made for the cached and loose cases as it
would not cause a significant performance improvement.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index b6bc02f09..bf6b64ec8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,12 +2977,16 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info 
*oi, unsigned flags)
 {
+   static struct object_info blank_oi = OBJECT_INFO_INIT;
struct pack_entry e;
int rtype;
const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
lookup_replace_object(sha1) :
sha1;
 
+   if (!oi)
+   oi = _oi;
+
if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
struct cached_object *co = find_cached_object(real);
if (co) {
@@ -3020,6 +3024,13 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
}
}
 
+   if (oi == _oi)
+   /*
+* We know that the caller doesn't actually need the
+* information below, so return early.
+*/
+   return 0;
+
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v5 6/8] sha1_file: improve sha1_object_info_extended

2017-06-21 Thread Jonathan Tan

Improve sha1_object_info_extended() by supporting additional flags. This
allows has_sha1_file_with_flags() to be modified to use
sha1_object_info_extended() in a subsequent patch.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  4 
 sha1_file.c | 43 ---
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/cache.h b/cache.h
index 48aea923b..7cf2ca466 100644
--- a/cache.h
+++ b/cache.h
@@ -1863,6 +1863,10 @@ struct object_info {
 #define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
+/* Do not check cached storage */
+#define OBJECT_INFO_SKIP_CACHED 4
+/* Do not retry packed storage after checking packed and loose storage */
+#define OBJECT_INFO_QUICK 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct 
object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index 615a27dac..b6bc02f09 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,29 +2977,30 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info 
*oi, unsigned flags)
 {
-   struct cached_object *co;
struct pack_entry e;
int rtype;
const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
lookup_replace_object(sha1) :
sha1;
 
-   co = find_cached_object(real);
-   if (co) {
-   if (oi->typep)
-   *(oi->typep) = co->type;
-   if (oi->sizep)
-   *(oi->sizep) = co->size;
-   if (oi->disk_sizep)
-   *(oi->disk_sizep) = 0;
-   if (oi->delta_base_sha1)
-   hashclr(oi->delta_base_sha1);
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(co->type));
-   if (oi->contentp)
-   *oi->contentp = xmemdupz(co->buf, co->size);
-   oi->whence = OI_CACHED;
-   return 0;
+   if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
+   struct cached_object *co = find_cached_object(real);
+   if (co) {
+   if (oi->typep)
+   *(oi->typep) = co->type;
+   if (oi->sizep)
+   *(oi->sizep) = co->size;
+   if (oi->disk_sizep)
+   *(oi->disk_sizep) = 0;
+   if (oi->delta_base_sha1)
+   hashclr(oi->delta_base_sha1);
+   if (oi->typename)
+   strbuf_addstr(oi->typename, typename(co->type));
+   if (oi->contentp)
+   *oi->contentp = xmemdupz(co->buf, co->size);
+   oi->whence = OI_CACHED;
+   return 0;
+   }
}
 
if (!find_pack_entry(real, )) {
@@ -3010,9 +3011,13 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
}
 
/* Not a loose object; someone else may have just packed it. */
-   reprepare_packed_git();
-   if (!find_pack_entry(real, ))
+   if (flags & OBJECT_INFO_QUICK) {
return -1;
+   } else {
+   reprepare_packed_git();
+   if (!find_pack_entry(real, ))
+   return -1;
+   }
}
 
rtype = packed_object_info(e.p, e.offset, oi);
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v5 4/8] sha1_file: move delta base cache code up

2017-06-21 Thread Jonathan Tan

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 220 ++--
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 71296e6cd..0c996370d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct 
packed_git *p,
goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-  struct object_info *oi)
-{
-   struct pack_window *w_curs = NULL;
-   unsigned long size;
-   off_t curpos = obj_offset;
-   enum object_type type;
-
-   /*
-* We always get the representation type, but only convert it to
-* a "real" type later if the caller is interested.
-*/
-   type = unpack_object_header(p, _curs, , );
-
-   if (oi->sizep) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   off_t tmp_pos = curpos;
-   off_t base_offset = get_delta_base(p, _curs, _pos,
-  type, obj_offset);
-   if (!base_offset) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   *oi->sizep = get_size_from_delta(p, _curs, tmp_pos);
-   if (*oi->sizep == 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   } else {
-   *oi->sizep = size;
-   }
-   }
-
-   if (oi->disk_sizep) {
-   struct revindex_entry *revidx = find_pack_revindex(p, 
obj_offset);
-   *oi->disk_sizep = revidx[1].offset - obj_offset;
-   }
-
-   if (oi->typep || oi->typename) {
-   enum object_type ptot;
-   ptot = packed_to_object_type(p, obj_offset, type, _curs,
-curpos);
-   if (oi->typep)
-   *oi->typep = ptot;
-   if (oi->typename) {
-   const char *tn = typename(ptot);
-   if (tn)
-   strbuf_addstr(oi->typename, tn);
-   }
-   if (ptot < 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   }
-
-   if (oi->delta_base_sha1) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   const unsigned char *base;
-
-   base = get_delta_base_sha1(p, _curs, curpos,
-  type, obj_offset);
-   if (!base) {
-   type = OBJ_BAD;
-   goto out;
-   }
-
-   hashcpy(oi->delta_base_sha1, base);
-   } else
-   hashclr(oi->delta_base_sha1);
-   }
-
-out:
-   unuse_pack(_curs);
-   return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-   struct pack_window **w_curs,
-   off_t curpos,
-   unsigned long size)
-{
-   int st;
-   git_zstream stream;
-   unsigned char *buffer, *in;
-
-   buffer = xmallocz_gently(size);
-   if (!buffer)
-   return NULL;
-   memset(, 0, sizeof(stream));
-   stream.next_out = buffer;
-   stream.avail_out = size + 1;
-
-   git_inflate_init();
-   do {
-   in = use_pack(p, w_curs, curpos, _in);
-   stream.next_in = in;
-   st = git_inflate(, Z_FINISH);
-   if (!stream.avail_out)
-   break; /* the payload is larger than it should be */
-   curpos += stream.next_in - in;
-   } while (st == Z_OK || st == Z_BUF_ERROR);
-   git_inflate_end();
-   if ((st != Z_STREAM_END) || stream.total_out != size) {
-   free(buffer);
-   return NULL;
-   }
-
-   return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, 
off_t base_offset,
hashmap_add(_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+  struct object_info *oi)
+{
+   struct pack_window *w_curs = NULL;
+   unsigned long size;
+   off_t curpos = obj_offset;
+   enum object_type type;
+
+   /*
+* We always get the representa

[PATCH v5 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT

2017-06-21 Thread Jonathan Tan

The LOOKUP_REPLACE_OBJECT flag controls whether the
lookup_replace_object() function is invoked by
sha1_object_info_extended(), read_sha1_file_extended(), and
lookup_replace_object_extended(), but it is not immediately clear which
functions accept that flag.

Therefore restrict this flag to only sha1_object_info_extended(),
renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
documentation. Update read_sha1_file_extended() to have a boolean
parameter instead, and delete lookup_replace_object_extended().

parse_sha1_header() also passes this flag to
parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
reading from a loose object of unknown type", 2015-05-03), but that has
had no effect since that commit. Therefore this patch also removes this
flag from that invocation.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/cat-file.c |  5 +++--
 cache.h| 17 ++---
 sha1_file.c| 14 +-
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 209374b3c..a58b8c820 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -56,7 +56,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
struct object_context obj_context;
struct object_info oi = OBJECT_INFO_INIT;
struct strbuf sb = STRBUF_INIT;
-   unsigned flags = LOOKUP_REPLACE_OBJECT;
+   unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
const char *path = force_path;
 
if (unknown_type)
@@ -337,7 +337,8 @@ static void batch_object_write(const char *obj_name, struct 
batch_options *opt,
struct strbuf buf = STRBUF_INIT;
 
if (!data->skip_object_info &&
-   sha1_object_info_extended(data->oid.hash, >info, 
LOOKUP_REPLACE_OBJECT) < 0) {
+   sha1_object_info_extended(data->oid.hash, >info,
+ OBJECT_INFO_LOOKUP_REPLACE) < 0) {
printf("%s missing\n",
   obj_name ? obj_name : oid_to_hex(>oid));
fflush(stdout);
diff --git a/cache.h b/cache.h
index e2ec45dfe..a3631b237 100644
--- a/cache.h
+++ b/cache.h
@@ -1205,12 +1205,12 @@ extern char *xdg_config_home(const char *filename);
  */
 extern char *xdg_cache_home(const char *filename);
 
-/* object replacement */
-#define LOOKUP_REPLACE_OBJECT 1
-extern void *read_sha1_file_extended(const unsigned char *sha1, enum 
object_type *type, unsigned long *size, unsigned flag);
+extern void *read_sha1_file_extended(const unsigned char *sha1,
+enum object_type *type,
+unsigned long *size, int lookup_replace);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type 
*type, unsigned long *size)
 {
-   return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
+   return read_sha1_file_extended(sha1, type, size, 1);
 }
 
 /*
@@ -1232,13 +1232,6 @@ static inline const unsigned char 
*lookup_replace_object(const unsigned char *sh
return do_lookup_replace_object(sha1);
 }
 
-static inline const unsigned char *lookup_replace_object_extended(const 
unsigned char *sha1, unsigned flag)
-{
-   if (!(flag & LOOKUP_REPLACE_OBJECT))
-   return sha1;
-   return lookup_replace_object(sha1);
-}
-
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
 extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *sha1);
@@ -1865,6 +1858,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Invoke lookup_replace_object() on the given hash */
+#define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
diff --git a/sha1_file.c b/sha1_file.c
index ad04ea8e0..71296e6cd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2002,7 +2002,7 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
struct object_info oi = OBJECT_INFO_INIT;
 
oi.sizep = sizep;
-   return parse_sha1_header_extended(hdr, , LOOKUP_REPLACE_OBJECT);
+   return parse_sha1_header_extended(hdr, , 0);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
@@ -2969,7 +2969,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+   const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE

[PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags

2017-06-21 Thread Jonathan Tan

has_sha1_file_with_flags() implements many mechanisms in common with
sha1_object_info_extended(). Make has_sha1_file_with_flags() a
convenience function for sha1_object_info_extended() instead.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/fetch.c  | 10 ++
 builtin/index-pack.c |  3 ++-
 cache.h  | 11 +++
 sha1_file.c  | 12 ++--
 4 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 47708451b..96d5146c4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -242,9 +242,11 @@ static void find_non_local_tags(struct transport 
*transport,
 */
if (ends_with(ref->name, "^{}")) {
if (item &&
-   !has_object_file_with_flags(>old_oid, 
HAS_SHA1_QUICK) &&
+   !has_object_file_with_flags(>old_oid,
+   OBJECT_INFO_QUICK) &&
!will_fetch(head, ref->old_oid.hash) &&
-   !has_sha1_file_with_flags(item->util, 
HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util,
+ OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
item = NULL;
@@ -258,7 +260,7 @@ static void find_non_local_tags(struct transport *transport,
 * fetch.
 */
if (item &&
-   !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
 
@@ -279,7 +281,7 @@ static void find_non_local_tags(struct transport *transport,
 * checked to see if it needs fetching.
 */
if (item &&
-   !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 04b9dcaf0..587bc80c9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -794,7 +794,8 @@ static void sha1_object(const void *data, struct 
object_entry *obj_entry,
 
if (startup_info->have_repository) {
read_lock();
-   collision_test_needed = has_sha1_file_with_flags(oid->hash, 
HAS_SHA1_QUICK);
+   collision_test_needed =
+   has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
read_unlock();
}
 
diff --git a/cache.h b/cache.h
index 7cf2ca466..3ae9769aa 100644
--- a/cache.h
+++ b/cache.h
@@ -1268,15 +1268,10 @@ int read_loose_object(const char *path,
  void **contents);
 
 /*
- * Return true iff we have an object named sha1, whether local or in
- * an alternate object database, and whether packed or loose.  This
- * function does not respect replace references.
- *
- * If the QUICK flag is set, do not re-check the pack directory
- * when we cannot find the object (this means we may give a false
- * negative answer if another process is simultaneously repacking).
+ * Convenience for sha1_object_info_extended() with a NULL struct
+ * object_info. OBJECT_INFO_SKIP_CACHED is automatically set; pass
+ * nonzero flags to also set other flags.
  */
-#define HAS_SHA1_QUICK 0x1
 extern int has_sha1_file_with_flags(const unsigned char *sha1, int flags);
 static inline int has_sha1_file(const unsigned char *sha1)
 {
diff --git a/sha1_file.c b/sha1_file.c
index bf6b64ec8..778f01d92 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-   struct pack_entry e;
-
if (!startup_info->have_repository)
return 0;
-   if (find_pack_entry(sha1, ))
-   return 1;
-   if (has_loose_object(sha1))
-   return 1;
-   if (flags & HAS_SHA1_QUICK)
-   return 0;
-   reprepare_packed_git();
-   return find_pack_entry(sha1, );
+   return sha1_object_info_extended(sha1, NULL,
+flags | OBJECT_INFO_SKIP_CACHED) >= 0;
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v5 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT

2017-06-21 Thread Jonathan Tan

The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
("sha1_file: support reading from a loose object of unknown type",
2015-05-03) in order to support a feature in cat-file subsequently
introduced in commit 39e4ae3 ("cat-file: teach cat-file a
'--allow-unknown-type' option", 2015-05-03). Despite its name and
location in cache.h, this flag is used neither in
read_sha1_file_extended() nor in any of the lookup functions, but used
only in sha1_object_info_extended().

Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
name of the cat-file flag that invokes this feature, and move it closer
to the declaration of sha1_object_info_extended(). Also add
documentation for this flag.

OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to 2, not 1, to avoid
conflicting with LOOKUP_REPLACE_OBJECT. Avoidance of this conflict is
necessary because sha1_object_info_extended() supports both flags.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/cat-file.c | 2 +-
 cache.h| 3 ++-
 sha1_file.c| 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..209374b3c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -60,7 +60,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
const char *path = force_path;
 
if (unknown_type)
-   flags |= LOOKUP_UNKNOWN_OBJECT;
+   flags |= OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
 
if (get_sha1_with_context(obj_name, GET_SHA1_RECORD_PATH,
  oid.hash, _context))
diff --git a/cache.h b/cache.h
index 4d92aae0e..e2ec45dfe 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
 
 /* object replacement */
 #define LOOKUP_REPLACE_OBJECT 1
-#define LOOKUP_UNKNOWN_OBJECT 2
 extern void *read_sha1_file_extended(const unsigned char *sha1, enum 
object_type *type, unsigned long *size, unsigned flag);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type 
*type, unsigned long *size)
 {
@@ -1866,6 +1865,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Allow reading from a loose object file of unknown/bogus type */
+#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct 
object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..ad04ea8e0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1964,7 +1964,7 @@ static int parse_sha1_header_extended(const char *hdr, 
struct object_info *oi,
 * we're obtaining the type using '--allow-unknown-type'
 * option.
 */
-   if ((flags & LOOKUP_UNKNOWN_OBJECT) && (type < 0))
+   if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE) && (type < 0))
type = 0;
else if (type < 0)
die("invalid object type");
@@ -2941,7 +2941,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
return -1;
if (oi->disk_sizep)
*oi->disk_sizep = mapsize;
-   if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
+   if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
if (unpack_sha1_header_to_strbuf(, map, mapsize, hdr, 
sizeof(hdr), ) < 0)
status = error("unable to unpack %s header with 
--allow-unknown-type",
   sha1_to_hex(sha1));
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v5 1/8] sha1_file: teach packed_object_info about typename

2017-06-21 Thread Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (oi->typep) {
-   *oi->typep = packed_to_object_type(p, obj_offset, type, 
_curs, curpos);
-   if (*oi->typep < 0) {
+   if (oi->typep || oi->typename) {
+   enum object_type ptot;
+   ptot = packed_to_object_type(p, obj_offset, type, _curs,
+curpos);
+   if (oi->typep)
+   *oi->typep = ptot;
+   if (oi->typename) {
+   const char *tn = typename(ptot);
+   if (tn)
+   strbuf_addstr(oi->typename, tn);
+   }
+   if (ptot < 0) {
type = OBJ_BAD;
goto out;
}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   enum object_type real_type;
const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
return -1;
}
 
-   /*
-* packed_object_info() does not follow the delta chain to
-* find out the real type, unless it is given oi->typep.
-*/
-   if (oi->typename && !oi->typep)
-   oi->typep = _type;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-   if (oi->typep == _type)
-   oi->typep = NULL;
return sha1_object_info_extended(real, oi, 0);
} else if (in_delta_base_cache(e.p, e.offset)) {
oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 rtype == OBJ_OFS_DELTA);
}
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(*oi->typep));
-   if (oi->typep == _type)
-   oi->typep = NULL;
 
return 0;
 }
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v5 0/8] Improvements to sha1_file

2017-06-21 Thread Jonathan Tan

Thanks, Junio. A reply to your comment on patch 7:

> ... this "if" statement feels like a maintenance nightmare.  The
> intent of the guard, I think, is "when the call wants absolutely
> nothing but whence", but the implementation of the guard will not
> stay true to the intent whenever somebody adds a new field to oi.
> 
> I wonder if it makes more sense to have a new field "whence_only",
> which is set only by such a specialized caller, which this guard
> checks (and no other fields).

After some more thought, I think I came up with a better solution -
allow sha1_object_info_extended() to take a NULL struct object_info
pointer, and immediately assign it (if NULL) a blank struct, but use the
NULL-ness as an indication that we can skip accessing the packfile. The
last patch actually doesn't even need the "whence", so we can do this.

Changes from v4:
 - patch 2
   - Updated commit message to explain why
 OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to be 2, not 1.
 - patch 3
   - Made all invocations of sha1_object_info_extended() compare "< 0".
 - patch 5
   - Made all invocations of sha1_object_info_extended() compare "< 0".
 - patch 7
   - Rewrote patch to make sha1_object_info_extended() accept NULL
 struct object_info pointer.
 - patch 8
   - Made has_sha1_file_with_flags send NULL instead of blank struct
 object_info.

Jonathan Tan (8):
  sha1_file: teach packed_object_info about typename
  sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  sha1_file: rename LOOKUP_REPLACE_OBJECT
  sha1_file: move delta base cache code up
  sha1_file: refactor read_object
  sha1_file: improve sha1_object_info_extended
  sha1_file: do not access pack if unneeded
  sha1_file: refactor has_sha1_file_with_flags

 builtin/cat-file.c   |   7 +-
 builtin/fetch.c  |  10 +-
 builtin/index-pack.c |   3 +-
 cache.h  |  36 +++--
 sha1_file.c  | 385 ++-
 5 files changed, 224 insertions(+), 217 deletions(-)

-- 
2.13.1.611.g7e3b11ae1-goog

Re: [PATCH v3 19/20] repository: enable initialization of submodules

2017-06-21 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:50 -0700
Brandon Williams  wrote:

> Introduce 'repo_submodule_init()' which performs initialization of a
> 'struct repository' as a submodule of another 'struct repository'.
> 
> The resulting submodule can be in one of three states:
> 
>   1. The submodule is initialized and has a worktree.
> 
>   2. The submodule is initialized but does not have a worktree.  This
>  would occur when the submodule's gitdir is present in the
>  superproject's 'gitdir/modules/' directory yet the submodule has not
>  been checked out in superproject's worktree.

In a recent proposal [1] to update the submodule documentation, an
"initialized submodule" is one that has a working directory, which seems
to have a different meaning of "initialized" (to the paragraphs above).

Or did you mean the "struct repository" is initialized etc.? In which
case, it does not seem strange to me that a repository is initialized
but does not have a worktree, since bare repositories are like that too.

[1] https://public-inbox.org/git/20170621173756.-1-sbel...@google.com/

>   3. The submodule remains uninitialized due to an error in the
>  initialization process or there is no matching submodule at the
>  provided path in the superproject.
> 
> Signed-off-by: Brandon Williams 

[snip]

> +/*
> + * Initialize 'submodule' as the submodule given by 'path' in parent 
> repository
> + * 'superproject'.
> + * Return 0 upon success and a non-zero value upon failure.
> + */
> +int repo_submodule_init(struct repository *submodule,
> + struct repository *superproject,
> + const char *path)
> +{
> + const struct submodule *sub;
> + struct strbuf submodule_path = STRBUF_INIT;
> + int ret = 0;
> +
> + sub = submodule_from_cache(superproject, null_sha1, path);
> + if (!sub) {
> + ret = -1;
> + goto out;
> + }
> +
> + strbuf_repo_worktree_path(_path, superproject, "%s", path);
> +
> + if (repo_init(submodule, submodule_path.buf, submodule_path.buf)) {

This works because the 2nd parameter (git_dir) can take in either the
Git directory itself or its parent, but it does make the call site look
strange. Would it be a good idea to make it mandatory to specify the Git
directory? That would make call sites clearer but require more code
there.

> + strbuf_reset(_path);
> + strbuf_repo_git_path(_path, superproject,
> +  "modules/%s", sub->name);
> +
> + if (repo_init(submodule, submodule_path.buf, NULL)) {
> + ret = -1;
> + goto out;
> + }
> + }
> +
> + submodule->submodule_prefix = xstrfmt("%s%s/",
> +   superproject->submodule_prefix ?
> +   superproject->submodule_prefix :
> +   "", path);
> +
> +out:
> + strbuf_release(_path);
> + return ret;
> +}

Re: [PATCH v3 15/20] repository: add index_state to struct repo

2017-06-21 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:46 -0700
Brandon Williams  wrote:

> +int repo_read_index(struct repository *repo)
> +{
> + if (!repo->index)
> + repo->index = xcalloc(1, sizeof(struct index_state));

sizeof(*repo->index)?

[snip]

> + /* Repository's in-memory index */
> + struct index_state *index;
> +

Might be worth commenting that repo_read_index() can populate this.

Re: [PATCH v3 20/20] ls-files: use repository object

2017-06-21 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:51 -0700
Brandon Williams  wrote:

> -static void show_ce_entry(const struct index_state *istate,
> -   const char *tag, const struct cache_entry *ce)
> +static void show_ce(struct repository *repo, struct dir_struct *dir,
> + const struct cache_entry *ce, const char *fullname,
> + const char *tag)

This function is getting complicated - in particular, it's getting a
"fullname" which is, in a sense, redundant with "repo" and "ce" (but
that seems necessary for efficiency). Maybe add a comment describing the
function?

[snip]

> @@ -651,8 +610,9 @@ int cmd_ls_files(int argc, const char **argv, const char 
> *cmd_prefix)
>   if (require_work_tree && !is_inside_work_tree())
>   setup_work_tree();
>  
> - if (recurse_submodules)
> - compile_submodule_options(argv, , show_tag);
> + if (recurse_submodules) {
> + repo_read_gitmodules(the_repository);
> + }

No need for braces.

Re: [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT

2017-06-21 Thread Jonathan Tan

On Wed, 21 Jun 2017 10:22:38 -0700
Junio C Hamano <gits...@pobox.com> wrote:

> Jonathan Tan <jonathanta...@google.com> writes:
> 
> > The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
> > ("sha1_file: support reading from a loose object of unknown type",
> > 2015-05-03) in order to support a feature in cat-file subsequently
> > introduced in commit 39e4ae3 ("cat-file: teach cat-file a
> > '--allow-unknown-type' option", 2015-05-03). Despite its name and
> > location in cache.h, this flag is used neither in
> > read_sha1_file_extended() nor in any of the lookup functions, but used
> > only in sha1_object_info_extended().
> >
> > Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
> > name of the cat-file flag that invokes this feature, and move it closer
> > to the declaration of sha1_object_info_extended(). Also add
> > documentation for this flag.
> 
> All of the above makes sense, but ...
> 
> > diff --git a/cache.h b/cache.h
> > index 4d92aae0e..e2ec45dfe 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
> >  
> >  /* object replacement */
> >  #define LOOKUP_REPLACE_OBJECT 1
> > -#define LOOKUP_UNKNOWN_OBJECT 2
> >  extern void *read_sha1_file_extended(const unsigned char *sha1, enum 
> > object_type *type, unsigned long *size, unsigned flag);
> >  static inline void *read_sha1_file(const unsigned char *sha1, enum 
> > object_type *type, unsigned long *size)
> >  {
> > @@ -1866,6 +1865,8 @@ struct object_info {
> >   */
> >  #define OBJECT_INFO_INIT {NULL}
> >  
> > +/* Allow reading from a loose object file of unknown/bogus type */
> > +#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
> 
> ... this contradicts the analysis given, doesn't it?
> 
> Does something break if we change this to 1 (perhaps because in some
> cases this bit reach read_sha1_file_extended())?  I doubt it, but
> leaving this to still define the bit to 2 makes readers wonder why.

The issue is that LOOKUP_REPLACE_OBJECT (which is 1) is also used by
sha1_object_info_extended(). So yes, it will break if
OBJECT_INFO_ALLOW_UNKNOWN_TYPE is changed to 1. I'm resolving this in
the next patch by also renaming LOOKUP_REPLACE_OBJECT and making it only
used by sha1_object_info_extended().

I'll add a comment in the commit message locally, and will resend it out
tomorrow (in case more comments come).

Re: [PATCHv2] submodules: overhaul documentation

2017-06-20 Thread Jonathan Tan

Thanks, this looks like a good explanation. Some more nits, but
overall I feel like I understand this and have learned something from
it.

On Tue, Jun 20, 2017 at 3:56 PM, Stefan Beller  wrote:
> +A submodule is another Git repository tracked inside a repository.
> +The tracked repository has its own history, which does not
> +interfere with the history of the current repository.
> +
> +It consists of a tracking subdirectory in the working directory,
> +a 'gitlink' in the working tree and an entry in the `.gitmodules`

Probably should be `gitlink` (the special quotes), and (optional)
s/`gitlink`/`gitlink` object/ because it might not be apparent that
gitlink is a type of object.

> +file (see linkgit:gitmodules[5]) at the root of the source tree.

After reading below, maybe we should mention the Git directory in
$GIT_DIR/modules/ as part of what a submodule consists
of too.

> +The tracking subdirectory appears in the main repositorys working

s/repositorys/repository's/ (apostrophe is also missing in some other
places below)

> +tree at the point where the submodules gitlink is tracked in the
> +tree.  It is empty when the submodule is not populated, otherwise
> +it contains the content of the submodule repository.
> +The main repository is often referred to as superproject.
> +
> +The gitlink contains the object name of a particular commit
> +of the submodule.
> +
> +The `.gitmodules` file establishes a relationship between the
> +path, which is where the gitlink is in the tree, and the logical
> +name, which is used for the location of the submodules git
> +directory. The `.gitmodules` file has the same syntax as the
> +$Git_DIR/config file and the mapping of path to name

Capitalization of $GIT_DIR

> +is done via setting `submodule..path = `.

(Optional) I would prefer  and  to be consistent with the
following paragraph.

> +The submodules git directory is found in in the main repositories
> +'$GIT_DIR/modules/' or inside the tracking subdirectory.
> +
> +Submodules can be used for at least two different use cases:
> +
> +1. Using another project while maintaining independent history.
> +  Submodules allow you to contain the working tree of another project
> +  within your own working tree while keeping the history of both
> +  projects separate. Also, since submodules are fixed to a an arbitrary
> +  version, the other project can be independently developed without
> +  affecting the superproject, allowing the superproject project to
> +  fix itself to new versions only whenever desired.
> +
> +2. Splitting a (logically single) project into multiple
> +   repositories and tying them back together. This can be used to
> +   overcome current limitations of Gits implementation to have
> +   finer grained access:
> +
> +* Size of the git repository
> +  In its current form Git scales up poorly for very large repositories that
> +  change a lot, as the history grows very large.
> +  However you can also use submodules to e.g. hold large binary assets
> +  and these repositories are then shallowly cloned such that you do not
> +  have a large history locally.
> +
> +* Transfer size
> +  In its current form Git requires the whole working tree present. It
> +  does not allow partial trees to be transferred in fetch or clone.
> +
> +* Access control
> +  By restricting user access to submodules, this can be used to implement
> +  read/write policies for different users.

The bullet points should probably be indented more.

[snip]

> +Deleting a submodule
> +
> +
> +Deleting a submodule can happen on different levels:
> +
> +1) Removing it from the local working tree without tampering with
> +   the history of the superproject.
> +
> +You may no longer need the submodule, but still want to keep it recorded
> +in the superproject history as others may have use for it. The command
> +`git submodule deinit ` will remove any configuration
> +entries from the config file, such that the submodule becomes

s=config=$GIT_DIR/config= (since there are multiple relevant config files)

> +uninitialized. The tracking directory in the superprojects working
> +tree that holds the submodules working directory is emptied.
> +This step can be undone via `git submodule init`.
> +
> +2) Remove it from history:
> +--
> +   git rm 
> +   git commit
> +--
> +This removes the submodules gitlink from the superprojects tree, as well
> +as removing the entries from the `.gitmodules` file, but keeps the
> +local configuration for the submodule. This can be undone using `git revert`.
> +
> +
> +3) Remove the submodules git directory:
> +
> +When you also want to free up the disk space that the submodules git
> +directory uses, you have to delete it manually as this
> +step cannot be undone using git tools. It is found in `$GIT_DIR/modules`.

Re: [PATCH v3 10/20] path: convert do_git_path to take a 'struct repository'

2017-06-20 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:41 -0700
Brandon Williams  wrote:

> +static void do_git_path(const struct repository *repo,
> + const struct worktree *wt, struct strbuf *buf,
>   const char *fmt, va_list args)
>  {
>   int gitdir_len;
> - strbuf_addstr(buf, get_worktree_git_dir(wt));

With this change, the get_worktree_git_dir() function no longer seems to
be used from outside - could it be marked static?

> + strbuf_worktree_gitdir(buf, repo, wt);
>   if (buf->len && !is_dir_sep(buf->buf[buf->len - 1]))
>   strbuf_addch(buf, '/');
>   gitdir_len = buf->len;
>   strbuf_vaddf(buf, fmt, args);
> - adjust_git_path(buf, gitdir_len);
> + adjust_git_path(repo, buf, gitdir_len);
>   strbuf_cleanup_path(buf);
>  }

Re: [PATCH v3 05/20] environment: place key repository state in the_repository

2017-06-20 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:36 -0700
Brandon Williams  wrote:

> Migrate 'git_dir', 'git_common_dir', 'git_object_dir', 'git_index_file',
> 'git_graft_file', and 'namespace' to be stored in 'the_repository'.
> 
> Signed-off-by: Brandon Williams 
> ---
>  cache.h   |  1 -
>  environment.c | 58 +-
>  path.c| 11 ++-
>  setup.c   | 17 +++--
>  4 files changed, 34 insertions(+), 53 deletions(-)
> 
> diff --git a/cache.h b/cache.h
> index 7c81749a9..cd64cbc81 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -771,7 +771,6 @@ extern int core_apply_sparse_checkout;
>  extern int precomposed_unicode;
>  extern int protect_hfs;
>  extern int protect_ntfs;
> -extern int git_db_env, git_index_env, git_graft_env, git_common_dir_env;

In the commit message, it is probably worth mentioning commit 557bd83
which added these fields to attempt rewriting a path in do_git_path()
only if the appropriate _env flag is set, and that this patch removes
this optimization.

Re: [PATCH v3 04/20] repository: introduce the repository object

2017-06-20 Thread Jonathan Tan

On Tue, 20 Jun 2017 12:19:35 -0700
Brandon Williams  wrote:

> Introduce the repository object 'struct repository' which can be used to
> hold all state pertaining to a git repository.
> 
> Some of the benefits of object-ifying a repository are:
> 
>   1. Make the code base more readable and easier to reason about.
> 
>   2. Allow for working on multiple repositories, specifically
>  submodules, within the same process.  Currently the process for
>  working on a submodule involves setting up an argv_array of options
>  for a particular command and then launching a child process to
>  execute the command in the context of the submodule.  This is
>  clunky and can require lots of little hacks in order to ensure
>  correctness.  Ideally it would be nice to simply pass a repository
>  and an options struct to a command.
> 
>   3. Eliminating reliance on global state will make it easier to
>  enable the use of threading to improve performance.

These would indeed be nice to have.

> +/* called after setting gitdir */
> +static void repo_setup_env(struct repository *repo)
> +{
> + struct strbuf sb = STRBUF_INIT;
> +
> + if (!repo->gitdir)
> + BUG("gitdir wasn't set before setting up the environment");
> +
> + repo->different_commondir = find_common_dir(, repo->gitdir,
> + !repo->ignore_env);
> + repo->commondir = strbuf_detach(, NULL);
> + repo->objectdir = git_path_from_env(DB_ENVIRONMENT, repo->commondir,
> + "objects", !repo->ignore_env);
> + repo->graft_file = git_path_from_env(GRAFT_ENVIRONMENT, repo->commondir,
> +  "info/grafts", !repo->ignore_env);
> + repo->index_file = git_path_from_env(INDEX_ENVIRONMENT, repo->gitdir,
> +  "index", !repo->ignore_env);
> +}

It seems that this function is only used once in repo_set_gitdir() -
could this be inlined there instead? Then you wouldn't need the comment
and the !repo->gitdir check.

> +static int verify_repo_format(struct repository_format *format,
> +   const char *commondir)
> +{
> + int ret = 0;
> + struct strbuf sb = STRBUF_INIT;
> +
> + strbuf_addf(, "%s/config", commondir);
> + read_repository_format(format, sb.buf);
> + strbuf_reset();
> +
> + if (verify_repository_format(format, ) < 0) {
> + warning("%s", sb.buf);
> + ret = -1;
> + }
> +
> + strbuf_release();
> + return ret;
> +}

This function is confusingly named - firstly, there is already a
verify_repository_format(), and secondly, this function both reads and
verifies it.

> +void repo_clear(struct repository *repo)
> +{
> + free(repo->gitdir);
> + repo->gitdir = NULL;
> + free(repo->commondir);
> + repo->commondir = NULL;
> + free(repo->objectdir);
> + repo->objectdir = NULL;
> + free(repo->graft_file);
> + repo->graft_file = NULL;
> + free(repo->index_file);
> + repo->index_file = NULL;
> + free(repo->worktree);
> + repo->worktree = NULL;
> +
> + memset(repo, 0, sizeof(*repo));
> +}

If you're going to memset, you probably don't need to set everything to
NULL.

> + /* Configurations */
> + /*
> +  * Bit used during initialization to indicate if repository state (like
> +  * the location of the 'objectdir') should be read from the
> +  * environment.  By default this bit will be set at the begining of
> +  * 'repo_init()' so that all repositories will ignore the environment.
> +  * The exception to this is 'the_repository', which doesn't go through
> +  * the normal 'repo_init()' process.
> +  */
> + unsigned ignore_env:1;

If this is only used during initialization, could this be passed as a
separate parameter internally instead of having it here?

Re: [PATCH 22/26] diff.c: color moved lines differently

2017-06-20 Thread Jonathan Tan

I just glanced through this file, because it seems similar to the
versions I have previously reviewed.

I'll skip patches 23 onwards in this round of review because (i) I would
be happy if just patches 1-22 were included in the tree and (ii) those
patches might end up changing anyway because of review comments in the
prior patches.

On Mon, 19 Jun 2017 19:48:12 -0700
Stefan Beller  wrote:

> +/* Find blocks of moved code, delegate actual coloring decision to helper */
> +static void mark_color_as_moved(struct diff_options *o,
> + struct hashmap *add_lines,
> + struct hashmap *del_lines)
> +{

[snip]

> + if (flipped_block)
> + l->flags |= DIFF_SYMBOL_MOVED_LINE_ZEBRA;

This should probably be DIFF_SYMBOL_MOVED_LINE_ALT. "Zebra" refers to
both the stripes, not just the alternate stripe.

Re: [PATCH 15/26] submodule.c: migrate diff output to use emit_diff_symbol

2017-06-20 Thread Jonathan Tan

On Mon, 19 Jun 2017 19:48:05 -0700
Stefan Beller  wrote:

> As the submodule process is no longer attached to the same stdout as
> the superprojects process, we need to pass coloring explicitly.

I found this confusing - what difference does the stdout make? If they
were the same stdout, one process could set a color for the other
process to follow, but it wouldn't be able to affect color changes
inside the output, right?

> Remove the colors from the function signatures, as all the coloring
> decisions will be made either inside the child process or the final
> emit_diff_symbol.

This seems to be of the same pattern as the others. I would write this
as:

Create diff_emit_submodule_.* functions that make its own coloring
decisions and migrate submodule.c to use them. This centralizes
diff output, including coloring information, in one file.

or something like that.

Re: [PATCH 11/26] diff.c: emit_diff_symbol learns DIFF_SYMBOL_FILEPAIR

2017-06-20 Thread Jonathan Tan

On Mon, 19 Jun 2017 19:48:01 -0700
Stefan Beller  wrote:

> @@ -676,6 +677,14 @@ static void emit_diff_symbol(struct diff_options *o, 
> enum diff_symbol s,
>   }
>   emit_line(o, context, reset, line, len);
>   break;
> + case DIFF_SYMBOL_FILEPAIR:
> + meta = diff_get_color_opt(o, DIFF_METAINFO);
> + reset = diff_get_color_opt(o, DIFF_RESET);
> + fprintf(o->file, "%s%s%s%s%s%s\n", diff_line_prefix(o), meta,
> + flags ? "+++ " : "--- ",

I think that a constant should be defined for this flag, just like for
content lines. If not it is not immediately obvious that a flag should
be set for the +++ lines.

> + line, reset,
> + strchr(line, ' ') ? "\t" : "");
> + break;
>   default:
>   die("BUG: unknown diff symbol");
>   }

Re: [RFC/PATCH] submodules: overhaul documentation

2017-06-20 Thread Jonathan Tan

On Wed,  7 Jun 2017 11:53:54 -0700
Stefan Beller  wrote:

[snip]

> +DESCRIPTION
> +---
> +
> +A submodule is another Git repository tracked in a subdirectory of your
> +repository. The tracked repository has its own history, which does not
> +interfere with the history of the current repository.
> +
> +Submodules are composed from a so-called `gitlink` tree entry
> +in the main repository that refers to a particular commit object
> +within the inner repository.
> +
> +Additionally to the gitlink entry the `.gitmodules` file (see
> +linkgit:gitmodules[5]) at the root of the source tree contains
> +information needed for submodules. The only required information
> +is the path setting, which estabishes a logical name for the submodule.

I know that this was copied over, but this is confusing to me, so it
might be worthwhile to change it. In particular, `gitlink` is, as far as
I know, the type of the child object, not any sort of tree entry. I
would rewrite this as:

A submodule consists of a tracking subdirectory and an entry in the
`.gitmodules` file (see linkgit:gitmodules[5]).

The tracking subdirectory appears in the main repository as a
`gitlink` object (instead of a `tree` object). The parent of the
tracking subdirectory links to this `gitlink` object through its
hash, just like linking to a `tree` or `blob`. This `gitlink` object
contains the hash of a particular commit object of the submodule.

The entry in the `.gitmodules` file establishes the name of the
submodule (to be used as a reference by other entries and commands)
and references the tracking subdirectory as follows:

submodule.my_submodule_name.path = path/to/my_submodule

There might also be a need to mention that when the submodule is
populated (or whatever the right term is), the tracking subdirectory in
the working tree contains both the submodule's working tree and the
submodule's Git directory.

> +The usual git configuration (see linkgit:git-config[1]) can be used to
> +override settings given by the `.gitmodules` file.

Not sure if this is relevant here.

> +Submodules can be used for two different use cases:

A creative person might come up with more, so this might be better as:

Submodules can be used for at least these use cases:

> +1. Using another project that stands on its own.
> +  When you want to use a third party library, submodules allow you to
> +  have a clean history for your own project as well as for the library.
> +  This also allows for updating the third party library as needed.

Probably better as:

1. Using another project while maintaining independent history.
Submodules allow you to contain the working tree of another project
within your own working tree while keeping the history of both
projects separate. Also, since submodules are fixed to a hash, the
other project can be independently developed without affecting the
parent project, allowing the parent project to fix itself to new
versions only whenever desired.

> +2. Artificially split a (logically single) project into multiple
> +   repositories and tying them back together. This can be used to
> +   overcome deficiences in the data model of Git, such as:

This should match the gerund used in point 1:

2. Splitting a (logically single) project into multiple
repositories. This can be used to overcome deficiencies in the data
model of Git, such as:

> +* To have finer grained access control.
> +  The design principles of Git do not allow for partial repositories to be
> +  checked out or transferred. A repository is the smallest unit that a user
> +  can be given access to. Submodules are separate repositories, such that
> +  you can restrict access to parts of your project via the use of submodules.

Not sure about this point - if the project is logically single, you
would probably need to see the entire project. If this is about
different teams independently working on different subcomponents, this
seems more like point 1 (inclusion of other projects).

> +* In its current form Git scales up poorly for very large repositories that
> +  change a lot, as the history grows very large. For that you may want to 
> look
> +  at shallow clone, sparse checkout, or git-LFS.
> +  However you can also use submodules to e.g. hold large binary assets
> +  and these repositories are then shallowly cloned such that you do not
> +  have a large history locally.
> +
> +The data model
> +--
> +
> +A submodule can be considered its own autonomous repository, that has a
> +worktree and a git directory at a different place than the superproject.

Isn't the worktree inside the superproject's worktree? I would write
this as:

A submodule is its own repository, having its own working tree and
Git directory.

> +The superproject only records the commit object name in its tree, such that
> +any other information, e.g. where to obtain a copy

[PATCH v4 8/8] sha1_file: refactor has_sha1_file_with_flags

2017-06-19 Thread Jonathan Tan

has_sha1_file_with_flags() implements many mechanisms in common with
sha1_object_info_extended(). Make has_sha1_file_with_flags() a
convenience function for sha1_object_info_extended() instead.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/fetch.c  | 10 ++
 builtin/index-pack.c |  3 ++-
 cache.h  | 11 +++
 sha1_file.c  | 13 +++--
 4 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 47708451b..96d5146c4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -242,9 +242,11 @@ static void find_non_local_tags(struct transport 
*transport,
 */
if (ends_with(ref->name, "^{}")) {
if (item &&
-   !has_object_file_with_flags(>old_oid, 
HAS_SHA1_QUICK) &&
+   !has_object_file_with_flags(>old_oid,
+   OBJECT_INFO_QUICK) &&
!will_fetch(head, ref->old_oid.hash) &&
-   !has_sha1_file_with_flags(item->util, 
HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util,
+ OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
item = NULL;
@@ -258,7 +260,7 @@ static void find_non_local_tags(struct transport *transport,
 * fetch.
 */
if (item &&
-   !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
 
@@ -279,7 +281,7 @@ static void find_non_local_tags(struct transport *transport,
 * checked to see if it needs fetching.
 */
if (item &&
-   !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+   !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
!will_fetch(head, item->util))
item->util = NULL;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 04b9dcaf0..587bc80c9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -794,7 +794,8 @@ static void sha1_object(const void *data, struct 
object_entry *obj_entry,
 
if (startup_info->have_repository) {
read_lock();
-   collision_test_needed = has_sha1_file_with_flags(oid->hash, 
HAS_SHA1_QUICK);
+   collision_test_needed =
+   has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
read_unlock();
}
 
diff --git a/cache.h b/cache.h
index 2e1cc3fe2..387694b25 100644
--- a/cache.h
+++ b/cache.h
@@ -1268,15 +1268,10 @@ int read_loose_object(const char *path,
  void **contents);
 
 /*
- * Return true iff we have an object named sha1, whether local or in
- * an alternate object database, and whether packed or loose.  This
- * function does not respect replace references.
- *
- * If the QUICK flag is set, do not re-check the pack directory
- * when we cannot find the object (this means we may give a false
- * negative answer if another process is simultaneously repacking).
+ * Convenience for sha1_object_info_extended() with a blank struct
+ * object_info. OBJECT_INFO_SKIP_CACHED is automatically set; pass
+ * nonzero flags to also set other flags.
  */
-#define HAS_SHA1_QUICK 0x1
 extern int has_sha1_file_with_flags(const unsigned char *sha1, int flags);
 static inline int has_sha1_file(const unsigned char *sha1)
 {
diff --git a/sha1_file.c b/sha1_file.c
index 68e3a3400..20db9b510 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3491,18 +3491,11 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-   struct pack_entry e;
-
+   static struct object_info blank;
if (!startup_info->have_repository)
return 0;
-   if (find_pack_entry(sha1, ))
-   return 1;
-   if (has_loose_object(sha1))
-   return 1;
-   if (flags & HAS_SHA1_QUICK)
-   return 0;
-   reprepare_packed_git();
-   return find_pack_entry(sha1, );
+   return !sha1_object_info_extended(sha1, ,
+ flags | OBJECT_INFO_SKIP_CACHED);
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 6/8] sha1_file: improve sha1_object_info_extended

2017-06-19 Thread Jonathan Tan

Improve sha1_object_info_extended() by supporting additional flags. This
allows has_sha1_file_with_flags() to be modified to use
sha1_object_info_extended() in a subsequent patch.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  4 
 sha1_file.c | 43 ---
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/cache.h b/cache.h
index 48aea923b..7cf2ca466 100644
--- a/cache.h
+++ b/cache.h
@@ -1863,6 +1863,10 @@ struct object_info {
 #define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
+/* Do not check cached storage */
+#define OBJECT_INFO_SKIP_CACHED 4
+/* Do not retry packed storage after checking packed and loose storage */
+#define OBJECT_INFO_QUICK 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct 
object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index 4d5033c48..24f7a146e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,29 +2977,30 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info 
*oi, unsigned flags)
 {
-   struct cached_object *co;
struct pack_entry e;
int rtype;
const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
lookup_replace_object(sha1) :
sha1;
 
-   co = find_cached_object(real);
-   if (co) {
-   if (oi->typep)
-   *(oi->typep) = co->type;
-   if (oi->sizep)
-   *(oi->sizep) = co->size;
-   if (oi->disk_sizep)
-   *(oi->disk_sizep) = 0;
-   if (oi->delta_base_sha1)
-   hashclr(oi->delta_base_sha1);
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(co->type));
-   if (oi->contentp)
-   *oi->contentp = xmemdupz(co->buf, co->size);
-   oi->whence = OI_CACHED;
-   return 0;
+   if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
+   struct cached_object *co = find_cached_object(real);
+   if (co) {
+   if (oi->typep)
+   *(oi->typep) = co->type;
+   if (oi->sizep)
+   *(oi->sizep) = co->size;
+   if (oi->disk_sizep)
+   *(oi->disk_sizep) = 0;
+   if (oi->delta_base_sha1)
+   hashclr(oi->delta_base_sha1);
+   if (oi->typename)
+   strbuf_addstr(oi->typename, typename(co->type));
+   if (oi->contentp)
+   *oi->contentp = xmemdupz(co->buf, co->size);
+   oi->whence = OI_CACHED;
+   return 0;
+   }
}
 
if (!find_pack_entry(real, )) {
@@ -3010,9 +3011,13 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
}
 
/* Not a loose object; someone else may have just packed it. */
-   reprepare_packed_git();
-   if (!find_pack_entry(real, ))
+   if (flags & OBJECT_INFO_QUICK) {
return -1;
+   } else {
+   reprepare_packed_git();
+   if (!find_pack_entry(real, ))
+   return -1;
+   }
}
 
rtype = packed_object_info(e.p, e.offset, oi);
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 7/8] sha1_file: do not access pack if unneeded

2017-06-19 Thread Jonathan Tan

Add an option to struct object_info to suppress population of additional
information about a packed object if unneeded. This allows an
optimization in which sha1_object_info_extended() does not even need to
access the pack if no information besides provenance is requested. A
subsequent patch will make use of this optimization.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  1 +
 sha1_file.c | 17 +
 streaming.c |  1 +
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 7cf2ca466..2e1cc3fe2 100644
--- a/cache.h
+++ b/cache.h
@@ -1828,6 +1828,7 @@ struct object_info {
unsigned char *delta_base_sha1;
struct strbuf *typename;
void **contentp;
+   unsigned populate_u : 1;
 
/* Response */
enum {
diff --git a/sha1_file.c b/sha1_file.c
index 24f7a146e..68e3a3400 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3020,6 +3020,13 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
}
}
 
+   if (!oi->typep && !oi->sizep && !oi->disk_sizep &&
+   !oi->delta_base_sha1 && !oi->typename && !oi->contentp &&
+   !oi->populate_u) {
+   oi->whence = OI_PACKED;
+   return 0;
+   }
+
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
@@ -3028,10 +3035,12 @@ int sha1_object_info_extended(const unsigned char 
*sha1, struct object_info *oi,
oi->whence = OI_DBCACHED;
} else {
oi->whence = OI_PACKED;
-   oi->u.packed.offset = e.offset;
-   oi->u.packed.pack = e.p;
-   oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
-rtype == OBJ_OFS_DELTA);
+   if (oi->populate_u) {
+   oi->u.packed.offset = e.offset;
+   oi->u.packed.pack = e.p;
+   oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
+rtype == OBJ_OFS_DELTA);
+   }
}
 
return 0;
diff --git a/streaming.c b/streaming.c
index 9afa66b8b..deebc18a8 100644
--- a/streaming.c
+++ b/streaming.c
@@ -113,6 +113,7 @@ static enum input_source istream_source(const unsigned char 
*sha1,
 
oi->typep = type;
oi->sizep = 
+   oi->populate_u = 1;
status = sha1_object_info_extended(sha1, oi, 0);
if (status < 0)
return stream_error;
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 4/8] sha1_file: move delta base cache code up

2017-06-19 Thread Jonathan Tan

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 220 ++--
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index ae44b32f3..a7be45efe 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct 
packed_git *p,
goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-  struct object_info *oi)
-{
-   struct pack_window *w_curs = NULL;
-   unsigned long size;
-   off_t curpos = obj_offset;
-   enum object_type type;
-
-   /*
-* We always get the representation type, but only convert it to
-* a "real" type later if the caller is interested.
-*/
-   type = unpack_object_header(p, _curs, , );
-
-   if (oi->sizep) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   off_t tmp_pos = curpos;
-   off_t base_offset = get_delta_base(p, _curs, _pos,
-  type, obj_offset);
-   if (!base_offset) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   *oi->sizep = get_size_from_delta(p, _curs, tmp_pos);
-   if (*oi->sizep == 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   } else {
-   *oi->sizep = size;
-   }
-   }
-
-   if (oi->disk_sizep) {
-   struct revindex_entry *revidx = find_pack_revindex(p, 
obj_offset);
-   *oi->disk_sizep = revidx[1].offset - obj_offset;
-   }
-
-   if (oi->typep || oi->typename) {
-   enum object_type ptot;
-   ptot = packed_to_object_type(p, obj_offset, type, _curs,
-curpos);
-   if (oi->typep)
-   *oi->typep = ptot;
-   if (oi->typename) {
-   const char *tn = typename(ptot);
-   if (tn)
-   strbuf_addstr(oi->typename, tn);
-   }
-   if (ptot < 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   }
-
-   if (oi->delta_base_sha1) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   const unsigned char *base;
-
-   base = get_delta_base_sha1(p, _curs, curpos,
-  type, obj_offset);
-   if (!base) {
-   type = OBJ_BAD;
-   goto out;
-   }
-
-   hashcpy(oi->delta_base_sha1, base);
-   } else
-   hashclr(oi->delta_base_sha1);
-   }
-
-out:
-   unuse_pack(_curs);
-   return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-   struct pack_window **w_curs,
-   off_t curpos,
-   unsigned long size)
-{
-   int st;
-   git_zstream stream;
-   unsigned char *buffer, *in;
-
-   buffer = xmallocz_gently(size);
-   if (!buffer)
-   return NULL;
-   memset(, 0, sizeof(stream));
-   stream.next_out = buffer;
-   stream.avail_out = size + 1;
-
-   git_inflate_init();
-   do {
-   in = use_pack(p, w_curs, curpos, _in);
-   stream.next_in = in;
-   st = git_inflate(, Z_FINISH);
-   if (!stream.avail_out)
-   break; /* the payload is larger than it should be */
-   curpos += stream.next_in - in;
-   } while (st == Z_OK || st == Z_BUF_ERROR);
-   git_inflate_end();
-   if ((st != Z_STREAM_END) || stream.total_out != size) {
-   free(buffer);
-   return NULL;
-   }
-
-   return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, 
off_t base_offset,
hashmap_add(_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+  struct object_info *oi)
+{
+   struct pack_window *w_curs = NULL;
+   unsigned long size;
+   off_t curpos = obj_offset;
+   enum object_type type;
+
+   /*
+* We always get the representa

[PATCH v4 5/8] sha1_file: refactor read_object

2017-06-19 Thread Jonathan Tan

read_object() and sha1_object_info_extended() both implement mechanisms
such as object replacement, retrying the packed store after failing to
find the object in the packed store then the loose store, and being able
to mark a packed object as bad and then retrying the whole process.
Consolidating these mechanisms would be a great help to maintainability.

Therefore, consolidate them by extending sha1_object_info_extended() to
support the functionality needed, and then modifying read_object() to
use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  1 +
 sha1_file.c | 84 ++---
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index a3631b237..48aea923b 100644
--- a/cache.h
+++ b/cache.h
@@ -1827,6 +1827,7 @@ struct object_info {
off_t *disk_sizep;
unsigned char *delta_base_sha1;
struct strbuf *typename;
+   void **contentp;
 
/* Response */
enum {
diff --git a/sha1_file.c b/sha1_file.c
index a7be45efe..4d5033c48 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
return parse_sha1_header_extended(hdr, , 0);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-   int ret;
-   git_zstream stream;
-   char hdr[8192];
-
-   ret = unpack_sha1_header(, map, mapsize, hdr, sizeof(hdr));
-   if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-   return NULL;
-
-   return unpack_sha1_rest(, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
  struct pack_window **w_curs,
  off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, 
off_t base_offset,
if (!ent)
return unpack_entry(p, base_offset, type, base_size);
 
-   *type = ent->type;
-   *base_size = ent->size;
+   if (type)
+   *type = ent->type;
+   if (base_size)
+   *base_size = ent->size;
return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
 * We always get the representation type, but only convert it to
 * a "real" type later if the caller is interested.
 */
-   type = unpack_object_header(p, _curs, , );
+   if (oi->contentp) {
+   *oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+ );
+   if (!*oi->contentp)
+   type = OBJ_BAD;
+   } else {
+   type = unpack_object_header(p, _curs, , );
+   }
 
-   if (oi->sizep) {
+   if (!oi->contentp && oi->sizep) {
if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
off_t tmp_pos = curpos;
off_t base_offset = get_delta_base(p, _curs, _pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
free(external_base);
}
 
-   *final_type = type;
-   *final_size = size;
+   if (final_type)
+   *final_type = type;
+   if (final_size)
+   *final_size = size;
 
unuse_pack(_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
git_zstream stream;
char hdr[32];
struct strbuf hdrbuf = STRBUF_INIT;
+   unsigned long size_scratch;
 
if (oi->delta_base_sha1)
hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 * return value implicitly indicates whether the
 * object even exists.
 */
-   if (!oi->typep && !oi->typename && !oi->sizep) {
+   if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
const char *path;
struct stat st;
if (stat_sha1_file(sha1, , ) < 0)
@@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
map = map_sha1_file(sha1, );
if (!map)
return -1;
+
+   if (!oi->sizep)
+   oi->sizep = _scratch;
+
if (oi->disk_sizep)
*oi->disk_sizep = mapsize;
if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
@@ -2956,10 +2959,18 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
   sha1_to_hex(sha1));
} else if ((status = parse_sha1_header_extended(hdr, oi, flags)

[PATCH v4 0/8] Improvements to sha1_file

2017-06-19 Thread Jonathan Tan

Thanks, Peff and Junio for your comments. Here's an updated version and
some replies to comments.

> I also found this quite subtle. However, I don't think that
> has_sha1_file() actually freshens. It's a bit confusing because
> has_loose_object() reuses the check_and_freshen() function to do the
> lookup, but it actually sets the "freshen" flag to false.
> 
> That's why in 33d4221c7 (write_sha1_file: freshen existing objects,
> 2014-10-15), which introduced the freshening functions and converted
> has_loose_object(), the actual write_sha1_file() function switched to
> using the freshening functions directly (and obviously sets the freshen
> parameter to true).

Good catch.

> I actually think all of that infrastructure could become part of
> Jonathan's consolidated lookup, too. We would just need:
> 
>   1. A QUICK flag to avoid re-reading objects/pack when we don't find
>  anything (which it looks like he already has).
> 
>   2. A FRESHEN flag to update the mtime of any item that we do find.
> 
> I suspect we may also need something like ONLY_LOOSE and ONLY_NONLOCAL
> to meet all the callers (e.g., has_loose_object_nonlocal). Those should
> be easy to implement, I'd think.

For things like FRESHEN, ONLY_LOOSE, and ONLY_NONLOCAL, I was thinking
that I would like to restrict these patches to only handle the cases
that are agnostic to the type of storage (in preparation for missing
blob handling patches).

> I had the same thoughts (both on the name and the "vocabularies"). IMHO
> we should consider allocating the bits from the same set. There's only
> one HAS_SHA1 flag, and it has an exact match in OBJECT_INFO_QUICK.

Agreed - in this patch set, I have also consolidated the relevant flags,
including LOOKUP_REPLACE_OBJECT and LOOKUP_UNKNOWN_OBJECT.

In addition, Junio has mentioned the potential confusion in behavior
between a NULL and an empty struct passed to
sha1_object_info_extended(). In this patch set, I require non-NULL, and
have added an optimization that avoids accessing the pack in certain
situations, but this optimization requires checking a lot of fields. Let
me know what you think.

Jonathan Tan (8):
  sha1_file: teach packed_object_info about typename
  sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  sha1_file: rename LOOKUP_REPLACE_OBJECT
  sha1_file: move delta base cache code up
  sha1_file: refactor read_object
  sha1_file: improve sha1_object_info_extended
  sha1_file: do not access pack if unneeded
  sha1_file: refactor has_sha1_file_with_flags

 builtin/cat-file.c   |   7 +-
 builtin/fetch.c  |  10 +-
 builtin/index-pack.c |   3 +-
 cache.h  |  37 +++--
 sha1_file.c  | 391 ++-
 streaming.c  |   1 +
 6 files changed, 228 insertions(+), 221 deletions(-)

-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 1/8] sha1_file: teach packed_object_info about typename

2017-06-19 Thread Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (oi->typep) {
-   *oi->typep = packed_to_object_type(p, obj_offset, type, 
_curs, curpos);
-   if (*oi->typep < 0) {
+   if (oi->typep || oi->typename) {
+   enum object_type ptot;
+   ptot = packed_to_object_type(p, obj_offset, type, _curs,
+curpos);
+   if (oi->typep)
+   *oi->typep = ptot;
+   if (oi->typename) {
+   const char *tn = typename(ptot);
+   if (tn)
+   strbuf_addstr(oi->typename, tn);
+   }
+   if (ptot < 0) {
type = OBJ_BAD;
goto out;
}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   enum object_type real_type;
const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
return -1;
}
 
-   /*
-* packed_object_info() does not follow the delta chain to
-* find out the real type, unless it is given oi->typep.
-*/
-   if (oi->typename && !oi->typep)
-   oi->typep = _type;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-   if (oi->typep == _type)
-   oi->typep = NULL;
return sha1_object_info_extended(real, oi, 0);
} else if (in_delta_base_cache(e.p, e.offset)) {
oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 rtype == OBJ_OFS_DELTA);
}
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(*oi->typep));
-   if (oi->typep == _type)
-   oi->typep = NULL;
 
return 0;
 }
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT

2017-06-19 Thread Jonathan Tan

The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
("sha1_file: support reading from a loose object of unknown type",
2015-05-03) in order to support a feature in cat-file subsequently
introduced in commit 39e4ae3 ("cat-file: teach cat-file a
'--allow-unknown-type' option", 2015-05-03). Despite its name and
location in cache.h, this flag is used neither in
read_sha1_file_extended() nor in any of the lookup functions, but used
only in sha1_object_info_extended().

Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
name of the cat-file flag that invokes this feature, and move it closer
to the declaration of sha1_object_info_extended(). Also add
documentation for this flag.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/cat-file.c | 2 +-
 cache.h| 3 ++-
 sha1_file.c| 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..209374b3c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -60,7 +60,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
const char *path = force_path;
 
if (unknown_type)
-   flags |= LOOKUP_UNKNOWN_OBJECT;
+   flags |= OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
 
if (get_sha1_with_context(obj_name, GET_SHA1_RECORD_PATH,
  oid.hash, _context))
diff --git a/cache.h b/cache.h
index 4d92aae0e..e2ec45dfe 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
 
 /* object replacement */
 #define LOOKUP_REPLACE_OBJECT 1
-#define LOOKUP_UNKNOWN_OBJECT 2
 extern void *read_sha1_file_extended(const unsigned char *sha1, enum 
object_type *type, unsigned long *size, unsigned flag);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type 
*type, unsigned long *size)
 {
@@ -1866,6 +1865,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Allow reading from a loose object file of unknown/bogus type */
+#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct 
object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..ad04ea8e0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1964,7 +1964,7 @@ static int parse_sha1_header_extended(const char *hdr, 
struct object_info *oi,
 * we're obtaining the type using '--allow-unknown-type'
 * option.
 */
-   if ((flags & LOOKUP_UNKNOWN_OBJECT) && (type < 0))
+   if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE) && (type < 0))
type = 0;
else if (type < 0)
die("invalid object type");
@@ -2941,7 +2941,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
return -1;
if (oi->disk_sizep)
*oi->disk_sizep = mapsize;
-   if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
+   if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
if (unpack_sha1_header_to_strbuf(, map, mapsize, hdr, 
sizeof(hdr), ) < 0)
status = error("unable to unpack %s header with 
--allow-unknown-type",
   sha1_to_hex(sha1));
-- 
2.13.1.611.g7e3b11ae1-goog

[PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT

2017-06-19 Thread Jonathan Tan

The LOOKUP_REPLACE_OBJECT flag controls whether the
lookup_replace_object() function is invoked by
sha1_object_info_extended(), read_sha1_file_extended(), and
lookup_replace_object_extended(), but it is not immediately clear which
functions accept that flag.

Therefore restrict this flag to only sha1_object_info_extended(),
renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
documentation. Update read_sha1_file_extended() to have a boolean
parameter instead, and delete lookup_replace_object_extended().

parse_sha1_header() also passes this flag to
parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
reading from a loose object of unknown type", 2015-05-03), but that has
had no effect since that commit. Therefore this patch also removes this
flag from that invocation.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/cat-file.c |  5 +++--
 cache.h| 17 ++---
 sha1_file.c| 13 -
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 209374b3c..923786c00 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -56,7 +56,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
struct object_context obj_context;
struct object_info oi = OBJECT_INFO_INIT;
struct strbuf sb = STRBUF_INIT;
-   unsigned flags = LOOKUP_REPLACE_OBJECT;
+   unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
const char *path = force_path;
 
if (unknown_type)
@@ -337,7 +337,8 @@ static void batch_object_write(const char *obj_name, struct 
batch_options *opt,
struct strbuf buf = STRBUF_INIT;
 
if (!data->skip_object_info &&
-   sha1_object_info_extended(data->oid.hash, >info, 
LOOKUP_REPLACE_OBJECT) < 0) {
+   sha1_object_info_extended(data->oid.hash, >info,
+ OBJECT_INFO_LOOKUP_REPLACE)) {
printf("%s missing\n",
   obj_name ? obj_name : oid_to_hex(>oid));
fflush(stdout);
diff --git a/cache.h b/cache.h
index e2ec45dfe..a3631b237 100644
--- a/cache.h
+++ b/cache.h
@@ -1205,12 +1205,12 @@ extern char *xdg_config_home(const char *filename);
  */
 extern char *xdg_cache_home(const char *filename);
 
-/* object replacement */
-#define LOOKUP_REPLACE_OBJECT 1
-extern void *read_sha1_file_extended(const unsigned char *sha1, enum 
object_type *type, unsigned long *size, unsigned flag);
+extern void *read_sha1_file_extended(const unsigned char *sha1,
+enum object_type *type,
+unsigned long *size, int lookup_replace);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type 
*type, unsigned long *size)
 {
-   return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
+   return read_sha1_file_extended(sha1, type, size, 1);
 }
 
 /*
@@ -1232,13 +1232,6 @@ static inline const unsigned char 
*lookup_replace_object(const unsigned char *sh
return do_lookup_replace_object(sha1);
 }
 
-static inline const unsigned char *lookup_replace_object_extended(const 
unsigned char *sha1, unsigned flag)
-{
-   if (!(flag & LOOKUP_REPLACE_OBJECT))
-   return sha1;
-   return lookup_replace_object(sha1);
-}
-
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
 extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *sha1);
@@ -1865,6 +1858,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Invoke lookup_replace_object() on the given hash */
+#define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
diff --git a/sha1_file.c b/sha1_file.c
index ad04ea8e0..ae44b32f3 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2002,7 +2002,7 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
struct object_info oi = OBJECT_INFO_INIT;
 
oi.sizep = sizep;
-   return parse_sha1_header_extended(hdr, , LOOKUP_REPLACE_OBJECT);
+   return parse_sha1_header_extended(hdr, , 0);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
@@ -2969,7 +2969,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+   const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?

Re: Behavior of 'git fetch' for commit hashes

2017-06-19 Thread Jonathan Tan

On Mon, 19 Jun 2017 10:49:36 -0700
Jonathan Tan <jonathanta...@google.com> wrote:

> On Mon, 19 Jun 2017 12:09:28 +
> <eero.aalto...@vaisala.com> wrote:
> 
> > For version 2.13.3

Firstly, exactly which version of Git doesn't work? I'm assuming 2.13.1
(as written elsewhere in your e-mail), since 2.13.3 doesn't exist.

> > However, the workaround for descbibed abot for git version 2.7.4 no
> > longer works. The result is always
> > fatal: Couldn't find remote ref
> 
> I'll take a look into this.

I tried to reproduce with this script, but it seems to pass even at
2.13.1:

#!/bin/bash
rm -rf ~/tmp/x &&
make --quiet &&
./git init ~/tmp/x &&
./git -C ~/tmp/x fetch --quiet ~/gitpristine/git master:foo &&
./git -C ~/tmp/x fetch ~/gitpristine/git "$(git -C ~/gitpristine/git 
rev-parse master^)"
exit $?

Commenting out the first fetch line produces, as expected:

error: Server does not allow request for unadvertised object 

And I have not seen the "fatal: Couldn't find remote ref" error you
describe.

As Stefan alluded to in his earlier e-mail [1], I made a change recently
to how "git fetch" handles literal SHA-1s in commit fdb69d3
("fetch-pack: always allow fetching of literal SHA1s", 2017-05-15). In
this case, I don't think that's the cause of this issue (since that
change, if anything, makes things more permissive, not less) but that
might be a point of interest in an investigation.

[1] 
https://public-inbox.org/git/CAGZ79kbFeptKOfpaZ23yK=zw9mj0_evqpsthkkd1hscap_p...@mail.gmail.com/

Re: Behavior of 'git fetch' for commit hashes

2017-06-19 Thread Jonathan Tan

On Mon, 19 Jun 2017 12:09:28 +
 wrote:

> For version 2.7.4
> =
> Git exits with exit code 1.
> 
> However, if I first do 'git fetch ', then 'git fetch  will
> also work
> 
>  * branch-> FETCH_HEAD

I suspect that what is happening is that 'git fetch ' also
downloads the commit referenced by , so the subsequent 'git fetch
' is a no-op (except setting FETCH_HEAD).

> For version 2.13.3
> ==
> Git exits with exit code 128 and message
> fatal: Couldn't find remote ref
> 
> However, the workaround for descbibed abot for git version 2.7.4 no
> longer works. The result is always
> fatal: Couldn't find remote ref

I'll take a look into this.

> Desired result
> ==
> Commit is in .git/FETCH_HEAD and can be checked out.
> 
> 
> I want to checkout a specific commit without creating any extra named
> remotes in the local git clone.
> 
> Finally,
> What is the expected behavior for 'git fetch' in this case?
> Is there some other way I can achieve my goals?

I'll take a look into the expected behavior, but if my assumptions are
correct, you should be able to just checkout the commit you want after
fetching the branch:

  git fetch  
  git checkout

Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-15 Thread Jonathan Tan

On Thu, 15 Jun 2017 16:28:24 -0400
Jeff Hostetler  wrote:

> I agree with Peff here.  I've been working on my partial/narrow/sparse
> clone/fetch ideas since my original RFC and have come to the conclusion
> that the server can do the size limiting efficiently, but we should
> leave the pathname filtering to the client.  That is, let the client
> get the commits and trees and then locally apply pattern matching,
> whether that be a sparse-checkout definition or simple ".git*"
> matching and then make a later request for the "blobs of interest".

This means that the client would need to inflate all the trees it
received, but that might not be that bad.

> Whether we "fault-in" the missing blobs or have a "fetch-blobs"
> command (like fetch-pack) to get them is open to debate.
> 
> There are concerns about the size of the requested blob-id list in a
> fetch-blobs approach, but I think there are ways to say I need all
> of the blobs referenced by the directory /foo in commit  (and
> optionally, that aren't present in directory /foo in commit 
> or in the range ..).  (handwave)

Unless the server no longer has the relevant commit (e.g. because a
branch was rewound), but even then, even if we only sent blob hashes,
the list would be probably much smaller than the downloaded pack anyway,
so things might still be OK.

[PATCH v3 2/4] sha1_file: move delta base cache code up

2017-06-15 Thread Jonathan Tan

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 220 ++--
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..a38319443 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct 
packed_git *p,
goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-  struct object_info *oi)
-{
-   struct pack_window *w_curs = NULL;
-   unsigned long size;
-   off_t curpos = obj_offset;
-   enum object_type type;
-
-   /*
-* We always get the representation type, but only convert it to
-* a "real" type later if the caller is interested.
-*/
-   type = unpack_object_header(p, _curs, , );
-
-   if (oi->sizep) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   off_t tmp_pos = curpos;
-   off_t base_offset = get_delta_base(p, _curs, _pos,
-  type, obj_offset);
-   if (!base_offset) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   *oi->sizep = get_size_from_delta(p, _curs, tmp_pos);
-   if (*oi->sizep == 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   } else {
-   *oi->sizep = size;
-   }
-   }
-
-   if (oi->disk_sizep) {
-   struct revindex_entry *revidx = find_pack_revindex(p, 
obj_offset);
-   *oi->disk_sizep = revidx[1].offset - obj_offset;
-   }
-
-   if (oi->typep || oi->typename) {
-   enum object_type ptot;
-   ptot = packed_to_object_type(p, obj_offset, type, _curs,
-curpos);
-   if (oi->typep)
-   *oi->typep = ptot;
-   if (oi->typename) {
-   const char *tn = typename(ptot);
-   if (tn)
-   strbuf_addstr(oi->typename, tn);
-   }
-   if (ptot < 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   }
-
-   if (oi->delta_base_sha1) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   const unsigned char *base;
-
-   base = get_delta_base_sha1(p, _curs, curpos,
-  type, obj_offset);
-   if (!base) {
-   type = OBJ_BAD;
-   goto out;
-   }
-
-   hashcpy(oi->delta_base_sha1, base);
-   } else
-   hashclr(oi->delta_base_sha1);
-   }
-
-out:
-   unuse_pack(_curs);
-   return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-   struct pack_window **w_curs,
-   off_t curpos,
-   unsigned long size)
-{
-   int st;
-   git_zstream stream;
-   unsigned char *buffer, *in;
-
-   buffer = xmallocz_gently(size);
-   if (!buffer)
-   return NULL;
-   memset(, 0, sizeof(stream));
-   stream.next_out = buffer;
-   stream.avail_out = size + 1;
-
-   git_inflate_init();
-   do {
-   in = use_pack(p, w_curs, curpos, _in);
-   stream.next_in = in;
-   st = git_inflate(, Z_FINISH);
-   if (!stream.avail_out)
-   break; /* the payload is larger than it should be */
-   curpos += stream.next_in - in;
-   } while (st == Z_OK || st == Z_BUF_ERROR);
-   git_inflate_end();
-   if ((st != Z_STREAM_END) || stream.total_out != size) {
-   free(buffer);
-   return NULL;
-   }
-
-   return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, 
off_t base_offset,
hashmap_add(_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+  struct object_info *oi)
+{
+   struct pack_window *w_curs = NULL;
+   unsigned long size;
+   off_t curpos = obj_offset;
+   enum object_type type;
+
+   /*
+* We always get the representa

[PATCH v3 0/4] Improvements to sha1_file

2017-06-15 Thread Jonathan Tan

Thanks - this has been updated following Junio's comments.

Patch 1 is unmodified from the previous version.

Patch 2 has been modified to remove the extraneous code pointed out by
Junio. I previously had an idea of populating those fields in
packed_object_info(), but later changed my mind, but a rebase went
wrong.

Patches 3-4 have been updated as I have described in [1] and [2].

[1] 
https://public-inbox.org/git/20170615111447.1208e...@twelve2.svl.corp.google.com/
[2] 
https://public-inbox.org/git/20170615111447.1208e...@twelve2.svl.corp.google.com/

As before, I would like review on patches 1-3 to go into the tree.
(Patch 4 is a work in progress, and is here just to demonstrate the
effectiveness of the refactoring.)

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: move delta base cache code up
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/fsck.c   |   7 +
 cache.h  |   8 +
 sha1_file.c  | 474 ++-
 t/t3907-missing-blob.sh  |  69 +++
 5 files changed, 400 insertions(+), 168 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.518.g3df882009-goog

[PATCH v3 1/4] sha1_file: teach packed_object_info about typename

2017-06-15 Thread Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (oi->typep) {
-   *oi->typep = packed_to_object_type(p, obj_offset, type, 
_curs, curpos);
-   if (*oi->typep < 0) {
+   if (oi->typep || oi->typename) {
+   enum object_type ptot;
+   ptot = packed_to_object_type(p, obj_offset, type, _curs,
+curpos);
+   if (oi->typep)
+   *oi->typep = ptot;
+   if (oi->typename) {
+   const char *tn = typename(ptot);
+   if (tn)
+   strbuf_addstr(oi->typename, tn);
+   }
+   if (ptot < 0) {
type = OBJ_BAD;
goto out;
}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   enum object_type real_type;
const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
return -1;
}
 
-   /*
-* packed_object_info() does not follow the delta chain to
-* find out the real type, unless it is given oi->typep.
-*/
-   if (oi->typename && !oi->typep)
-   oi->typep = _type;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-   if (oi->typep == _type)
-   oi->typep = NULL;
return sha1_object_info_extended(real, oi, 0);
} else if (in_delta_base_cache(e.p, e.offset)) {
oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 rtype == OBJ_OFS_DELTA);
}
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(*oi->typep));
-   if (oi->typep == _type)
-   oi->typep = NULL;
 
return 0;
 }
-- 
2.13.1.518.g3df882009-goog

[PATCH v3 4/4] sha1_file, fsck: add missing blob support

2017-06-15 Thread Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended() and
has_sha1_file_with_flags().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/config.txt |  10 +++
 builtin/fsck.c   |   7 ++
 cache.h  |   7 ++
 sha1_file.c  | 171 +++
 t/t3907-missing-blob.sh  |  69 +++
 5 files changed, 250 insertions(+), 14 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+   If set, whenever a blob in the local repo is attempted to

[PATCH v3 3/4] sha1_file: consolidate storage-agnostic object fns

2017-06-15 Thread Jonathan Tan

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

However, has_sha1_file_with_flags() does things that the other 2 don't
(skipping cached storage, allowing a "quick" mode that skips retrying
the packed storage after trying the loose storage, and refreshing any
loose files found).

Therefore, consolidate only the other 2 functions by extending
sha1_object_info_extended() to support the functionality needed, and
then modifying read_object() to use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |  1 +
 sha1_file.c | 84 ++---
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e..63a73af17 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,7 @@ struct object_info {
off_t *disk_sizep;
unsigned char *delta_base_sha1;
struct strbuf *typename;
+   void **contentp;
 
/* Response */
enum {
diff --git a/sha1_file.c b/sha1_file.c
index a38319443..60b487c70 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
return parse_sha1_header_extended(hdr, , LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-   int ret;
-   git_zstream stream;
-   char hdr[8192];
-
-   ret = unpack_sha1_header(, map, mapsize, hdr, sizeof(hdr));
-   if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-   return NULL;
-
-   return unpack_sha1_rest(, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
  struct pack_window **w_curs,
  off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, 
off_t base_offset,
if (!ent)
return unpack_entry(p, base_offset, type, base_size);
 
-   *type = ent->type;
-   *base_size = ent->size;
+   if (type)
+   *type = ent->type;
+   if (base_size)
+   *base_size = ent->size;
return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
 * We always get the representation type, but only convert it to
 * a "real" type later if the caller is interested.
 */
-   type = unpack_object_header(p, _curs, , );
+   if (oi->contentp) {
+   *oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+ );
+   if (!*oi->contentp)
+   type = OBJ_BAD;
+   } else {
+   type = unpack_object_header(p, _curs, , );
+   }
 
-   if (oi->sizep) {
+   if (!oi->contentp && oi->sizep) {
if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
off_t tmp_pos = curpos;
off_t base_offset = get_delta_base(p, _curs, _pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
free(external_base);
}
 
-   *final_type = type;
-   *final_size = size;
+   if (final_type)
+   *final_type = type;
+   if (final_size)
+   *final_size = size;
 
unuse_pack(_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
git_zstream stream;
char hdr[32];
struct strbuf hdrbuf = STRBUF_INIT;
+   unsigned long size_scratch;
 
if (oi->delta_base_sha1)
hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
 * return value implicitly indicates whether the
 * object even exists.
 */
-   if

Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-15 Thread Jonathan Tan

A reroll is coming soon, but there is an interesting discussion point
here so I'll reply to this e-mail first.

On Thu, 15 Jun 2017 11:34:45 -0700
Junio C Hamano <gits...@pobox.com> wrote:

> Jonathan Tan <jonathanta...@google.com> writes:
> 
> > +struct missing_blob_manifest {
> > +   struct missing_blob_manifest *next;
> > +   const char *data;
> > +};
> > +struct missing_blob_manifest *missing_blobs;
> > +int missing_blobs_initialized;
> 
> I do not think you meant to make these non-static.  The type of the
> former is not even visible to the outside world, and the latter is
> something that could be made into static to prepare_missing_blobs()
> function (unless and until you start allowing the missing-blobs
> manifest to be re-initialized).  Your ensure_configured() below
> seems to do the "static" right, on the other hand ;-).

Good catch - done.

> Do we expect that we will have only a handful of these missing blob
> manifests?  Each manifest seems to be efficiently looked-up with a
> binary search, but it makes me wonder if it is a good idea to
> consolidate these manifests into a single list of object names to
> eliminate the outer loop in has_missing_blob().  Unlike pack .idx
> files that must stay one-to-one with .pack files, it appears to me
> that there is no reason why we need to keep multiple ones separate
> for extended period of time (e.g. whenever we learn that we receieved
> an incomplete pack from the other side with a list of newly missing
> blobs, we could incorporate that into existing missing blob list).

There is indeed no reason why we need to keep multiple ones separate for
an extended period of time - my thinking was to let fetch/clone be fast
by not needing to scan through the entire existing manifest (in order to
create the new one), letting GC take care of consolidating them (since
it would have to check individual entries to delete those corresponding
to objects that have entered the repo through other means). But this is
at the expense of making the individual object lookups a bit slower.

For now, I'll leave the possibility of multiple files open while I try
to create a set of patches that can implement missing blob support from
fetch to day-to-day usage. But I am not opposed to changing it to a
single-file manifest.

> > +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
> > +{
> 
> This function that answers "is it expected to be missing?" is
> confusingly named.  Is it missing, or does it exist?

Renamed to in_missing_blob_manifest().

> > @@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned 
> > char *sha1,
> > return (status < 0) ? status : 0;
> >  }
> >  
> > +static char *missing_blob_command;
> > +static int sha1_file_config(const char *conf_key, const char *value, void 
> > *cb)
> > +{
> > +   if (!strcmp(conf_key, "core.missingblobcommand")) {
> > +   missing_blob_command = xstrdup(value);
> > +   }
> > +   return 0;
> > +}
> > +
> > +static int configured;
> > +static void ensure_configured(void)
> > +{
> > +   if (configured)
> > +   return;
> 
> Do not be selfish and pretend that this is the _only_ kind of
> configuration that needs to be done inside sha1_file.c.  Call the
> function ensure__is_configured() and rename the run-once
> guard to match.

My thinking was that any additional configuration could be added to this
function, but individual configuration for each feature is fine too. I
have renamed things according to your suggestion.

> The run-once guard can be made static to the "ensure" function, and
> if you do so, then its name can stay to be "configured", as at that
> point it is clear what it is guarding.

Done.

> > +pack() {
> 
> Style: "pack () {"

Done.

> 
> > +   perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
> 
> high-nybble first to match ntohll() done in has_missing_blob()?  OK.

Actually it's to match the printf behavior below that prints the high
nybble first (like in English).

Re: [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns

2017-06-15 Thread Jonathan Tan

On Thu, 15 Jun 2017 10:50:46 -0700
Junio C Hamano <gits...@pobox.com> wrote:

> Jonathan Tan <jonathanta...@google.com> writes:
> 
> > Looking at the 3 primary functions (sha1_object_info_extended,
> > read_object, has_sha1_file_with_flags), they independently implement
> > mechanisms such as object replacement, retrying the packed store after
> > failing to find the object in the packed store then the loose store, and
> > being able to mark a packed object as bad and then retrying the whole
> > process. Consolidating these mechanisms would be a great help to
> > maintainability.
> >
> > Therefore, consolidate all 3 functions by extending
> > sha1_object_info_extended() to support the functionality needed by all 3
> > functions, and then modifying the other 2 to use
> > sha1_object_info_extended().
> 
> This is a rather "ugly" looking patch ;-) but I followed what
> has_sha1_file_with_flags() and read_object() do before and after
> this change, and I think this patch is a no-op wrt their behaviour
> (which is a good thing).
> 
> But I have a very mixed feelings on one aspect of the resulting
> sha1_object_info_extended().
> 
> >  int sha1_object_info_extended(const unsigned char *sha1, struct 
> > object_info *oi, unsigned flags)
> >  {
> > ...
> > if (!find_pack_entry(real, )) {
> > /* Most likely it's a loose object. */
> > -   if (!sha1_loose_object_info(real, oi, flags)) {
> > +   if (oi && !sha1_loose_object_info(real, oi, flags)) {
> > oi->whence = OI_LOOSE;
> > return 0;
> > }
> > +   if (!oi && has_loose_object(real))
> > +   return 0;
> 
> This conversion is not incorrect per-se.  
> 
> We can see that has_sha1_file_with_flags() after this patch still
> calls has_loose_object().  But it bothers me that there is no hint
> to future developers to warn that a rewrite of the above like this
> is incorrect:
> 
> if (!find_pack_entry(read, )) {
> /* Most likely it's a loose object. */
>+struct object_info dummy_oi;
>+if (!oi) {
>+memset(_oi, 0, sizeof(dummy_oi);
>+oi = _oi;
>+}
>-if (oi && !sha1_loose_object_info(real, oi, flags)) {
>+if (!sha1_loose_object_info(real, oi, flags)) {
> oi->whence = OI_LOOSE;
> return 0;
> }
>-if (!oi && has_loose_object(real))
>-return 0;
> 
> It used to be very easy to see that has_sha1_file_with_flags() will
> call has_loose_object() when it does not find the object in a pack,
> which will result in the loose object file freshened.  In the new
> code, it is very subtle to see that---it will happen when the caller
> passes oi == NULL, and has_sha1_file_with_flags() is such a caller,
> but it is unclear if there are other callers of this "consolidated"
> sha1_object_info_extended() that passes oi == NULL, and if they do
> also want to freshen the loose object file when they do so.

Good point - sorry, I didn't pay much attention to the freshening
behavior. After some thought, I now think it might be better to avoid
modifying has_sha1_file_with_flags(). As it is,
sha1_object_info_extended() already needs special handling (special
flags and handling the possibility of "oi" being NULL) to handle the
functionality required by has_sha1_file_with_flags(); adding yet another
thing to handle (freshen or not) would make it much too complicated.

This means that subsequent patches that modify the handling of
storage-agnostic objects would still need to modify 2 functions, but at
least that is fewer than the current 3.

I'll reroll with these changes so that you (and others) can see what it
looks like.

> > @@ -3480,18 +3491,12 @@ int has_sha1_pack(const unsigned char *sha1)
> >  
> >  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
> >  {
> > -   struct pack_entry e;
> > +   int f = OBJECT_INFO_SKIP_CACHED |
> > +   ((flags & HAS_SHA1_QUICK) ? OBJECT_INFO_QUICK : 0);
> >  
> > if (!startup_info->have_repository)
> > return 0;
> > -   if (find_pack_entry(sha1, ))
> > -   return 1;
> > -   if (has_loose_object(sha1))
> > -   return 1;
> > -   if (flags & HAS_SHA1_QUICK)
> > -   return 0;
> > -   reprepare_packed_git();
> > -   return find_pack_entry(sha1, );
> > +   return !sha1_object_in

Re: [PATCHv5 04/17] diff: introduce more flexible emit function

2017-06-13 Thread Jonathan Tan

On Tue, 13 Jun 2017 16:41:57 -0700
Stefan Beller <sbel...@google.com> wrote:

> On Tue, Jun 13, 2017 at 2:54 PM, Jonathan Tan <jonathanta...@google.com> 
> wrote:
> > - could this be called emit() instead?
> 
> Despite having good IDEs available some (including me)
> very much like working with raw text, and then having a function
> named as a common string doesn't help.
> 
> After this patch
> 
>   $ git grep emit_line |wc -l
>   16
>   # not all are this function, there is
>   emit_line_checked as well. But 16 is not too much.
> 
> But if renamed to emit():
> 
>   $ git grep emit -- diff.c |wc -l
>   60
> 
> You could argue I'd just have to grep
> for "emit (" instead, but that then I would have
> rely on correct whitespacing or use a regex already.
> Complexity which I would not like.
> 
> So I am not sure if this is helping a reader. (Not the casual
> reader, but the one grepping for this function)
> 
> Maybe we can settle on a different name though,
> such as emit_string which is not a prefix of a dozen
> different other functions?

emit_string sounds good to me.

Re: [RFC/PATCH] builtin/blame: darken redundant line information

2017-06-13 Thread Jonathan Tan

On Mon, 12 Jun 2017 19:31:51 -0700
Stefan Beller  wrote:

> When using git-blame lots of lines contain redundant information, for
> example in hunks that consist of multiple lines, the metadata (commit name,
> author, timezone) are repeated. A reader may not be interested in those,
> so darken them. The darkening is not just based on hunk, but actually
> takes the previous lines content for that field to compare to.
> 
> Signed-off-by: Stefan Beller 
> ---
> 
>  Example output (blame of blame): http://i.imgur.com/0Y12p2f.png

Looking at this image, how does blame decide what to dim? As it is, I see many
identical timestamps (and also from the same commit) not being dimmed.
(For example, see the very last line with "2013-01-05 ..." which is
identical to the previous line, and I would expect that to be dimmed.)

Also, my preference is to have all-or-nothing dimming (dim the whole
line up to and including the time zone if nothing has changed, and dim
nothing otherwise) but I know that this is a subjective issue.

Re: [PATCH] diff.c: color moved lines differently

2017-06-13 Thread Jonathan Tan

On Wed, 31 May 2017 17:24:29 -0700
Stefan Beller  wrote:

> When a patch consists mostly of moving blocks of code around, it can
> be quite tedious to ensure that the blocks are moved verbatim, and not
> undesirably modified in the move. To that end, color blocks that are
> moved within the same patch differently. For example (OM, del, add,
> and NM are different colors):

[snip]

Junio asks "are we happy with these changes" [1] and my answer is, in
general, yes - this seems like a very useful feature to have, and I'm OK
with the current design.

I do feel a bit of unease at how the emitted strings are collected
without many guarantees as to their contents (e.g. whether they are full
lines or even whether they originate from the text of a file), but this
is already true for the existing code. The potential danger is that we
are now relying more on the format of these strings, but we don't plan
to do anything other than to color them, so this seems fine.

I would also prefer if there was only one coloring method, to ease
testing, but I can tolerate the current multiplicity of options.

I think there was some discussion at trying to avoid indicating that
small blocks (consisting of few non-whitespace characters) are moved,
but I think that that can be added later.

[1] 
https://public-inbox.org/git/cagz79ky2z-fjyxczbzheu1hchlkkkdjecdmwsp-hkn0tjub...@mail.gmail.com/

> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -231,6 +231,38 @@ ifdef::git-diff[]
>  endif::git-diff[]
>   It is the same as `--color=never`.
>  
> +--color-moved[=]::
> + Moved lines of code are colored differently.
> +ifdef::git-diff[]
> + It can be changed by the `diff.colorMoved` configuration setting.
> +endif::git-diff[]
> + The  defaults to 'no' if the option is not given
> + and to 'adjacentbounds' if the option with no mode is given.
> + The mode must be one of:
> ++
> +--
> +no::
> + Moved lines are not highlighted.
> +nobounds::
> + Any line that is added in one location and was removed
> + in another location will be colored with 'color.diff.newmoved'.
> + Similarly 'color.diff.oldmoved' will be used for removed lines

Probably best to consistently capitalize newMoved and oldMoved.

> +static unsigned get_line_hash(struct diff_line *line, unsigned ignore_ws)
> +{
> + static struct strbuf sb = STRBUF_INIT;
> +
> + if (ignore_ws) {
> + strbuf_reset();
> + get_ws_cleaned_string(line, );
> + return memhash(sb.buf, sb.len);
> + } else {
> + return memhash(line->line, line->len);
> + }
> +}

Can be written without the "else".

> +test_expect_success 'move detection does not mess up colored words' '

Probably name this test "--color-moved has no effect when used with
--word-diff".

Re: [PATCHv5 16/17] diff: buffer all output if asked to

2017-06-13 Thread Jonathan Tan

On Wed, 24 May 2017 14:40:35 -0700
Stefan Beller  wrote:

> diff --git a/diff.h b/diff.h
> index 85948ed65a..fad1258556 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -115,6 +115,42 @@ enum diff_submodule_format {
>   DIFF_SUBMODULE_INLINE_DIFF
>  };
>  
> +/*
> + * This struct is used when we need to buffer the output of the diff output.
> + *
> + * NEEDSWORK: Instead of storing a copy of the line, add an offset pointer
> + * into the pre/post image file. This pointer could be a union with the
> + * line pointer. By storing an offset into the file instead of the literal 
> line,
> + * we can decrease the memory footprint for the buffered output. At first we
> + * may want to only have indirection for the content lines, but we could also
> + * enhance the state for emitting prefabricated lines, e.g. the similarity
> + * score line or hunk/file headers would only need to store a number or path
> + * and then the output can be constructed later on depending on state.
> + */
> +struct diff_line {

Probably should be called diff_emission (or just emission), since these
may not be full lines.

Also, can this definition be in the .c file? Callers should use the
diff_emit_line() below, and not need to know how it is implemented
internally.

> + const char *set;
> + const char *reset;
> + const char *line;
> + int len;
> + int sign;
> + int add_line_prefix;
> + enum {
> + /*
> +  * Emits [lineprefix][set][sign][reset] and then calls
> +  * ws_check_emit which will output "line", marked up
> +  * according to ws_rule.
> +  */
> + DIFF_LINE_WS,
> +
> + /* Emits [lineprefix][set][sign] line [reset] */
> + DIFF_LINE_ASIS,
> +
> + /* Reloads the ws_rule; line contains the file name */
> + DIFF_LINE_RELOAD_WS_RULE
> + } state;
> +};
> +#define diff_line_INIT {NULL, NULL, NULL, 0, 0, 0}

Should be DIFF_LINE_INIT (capitalization), and {NULL} is sufficient, I
think.

Re: [PATCHv5 04/17] diff: introduce more flexible emit function

2017-06-13 Thread Jonathan Tan

On Wed, 24 May 2017 14:40:23 -0700
Stefan Beller  wrote:

> Currently, diff output is written either through the emit_line_0
> function or through the FILE * in struct diff_options directly. To
> make it easier to teach diff to buffer its output (which will be done
> in a subsequent commit), introduce a more flexible emit_line() function.
> In this commit, direct usages of emit_line_0() are replaced with
> emit_line(); subsequent commits will also replace usages of the
> FILE * with emit().

Check the names of the functions in this paragraph.

> diff --git a/diff.c b/diff.c
> index 2f9722b382..3569857818 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -516,36 +516,30 @@ static void check_blank_at_eof(mmfile_t *mf1, mmfile_t 
> *mf2,
>   ecbdata->blank_at_eof_in_postimage = (at - l2) + 1;
>  }
>  
> -static void emit_line_0(struct diff_options *o, const char *set, const char 
> *reset,
> - int first, const char *line, int len)
> +static void emit_line(struct diff_options *o, const char *set, const char 
> *reset,
> +   int add_line_prefix, int sign, const char *line, int len)

In the future, this function is going to be used even to emit partial
lines - could this be called emit() instead?

[PATCH v2 4/4] sha1_file, fsck: add missing blob support

2017-06-13 Thread Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/config.txt |  10 +++
 builtin/fsck.c   |   7 +++
 cache.h  |   6 ++
 sha1_file.c  | 156 ++-
 t/t3907-missing-blob.sh  |  69 +
 5 files changed, 233 insertions(+), 15 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+   If set, whenever a blob in the local repo is attempted to be
+   read but is missing, i

[PATCH v2 0/4] Improvements to sha1_file

2017-06-13 Thread Jonathan Tan

Peff suggested putting in a new field in struct object_info for the
object contents; I tried it and it seems to work out quite well.

Patch 1 is unmodified from the previous version. Patches 2-3 have been
rewritten, and patch 4 is similar except that the missing-lookup change
is made to sha1_object_info_extended() instead of the now gone
get_object().

As before, I would like review on patches 1-3 to go into the tree.
(Patch 4 is a work in progress, and is here just to demonstrate the
effectiveness of the refactoring.)

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: move delta base cache code up
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/fsck.c   |   7 +
 cache.h  |  13 ++
 sha1_file.c  | 506 +--
 t/t3907-missing-blob.sh  |  69 +++
 5 files changed, 418 insertions(+), 187 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.518.g3df882009-goog

[PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns

2017-06-13 Thread Jonathan Tan

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Therefore, consolidate all 3 functions by extending
sha1_object_info_extended() to support the functionality needed by all 3
functions, and then modifying the other 2 to use
sha1_object_info_extended().

Note that has_sha1_file_with_flags() does not try cached storage,
whereas the other 2 functions do - this functionality is preserved.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 cache.h |   7 +++
 sha1_file.c | 143 +++-
 2 files changed, 81 insertions(+), 69 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e..3c85867c3 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,7 @@ struct object_info {
off_t *disk_sizep;
unsigned char *delta_base_sha1;
struct strbuf *typename;
+   void **contentp;
 
/* Response */
enum {
@@ -1866,6 +1867,12 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/*
+ * sha1_object_info_extended() supports the LOOKUP_ flags and the OBJECT_INFO_
+ * flags.
+ */
+#define OBJECT_INFO_QUICK 4
+#define OBJECT_INFO_SKIP_CACHED 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info 
*, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct 
object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a158907d1..98086e21e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long 
*sizep)
return parse_sha1_header_extended(hdr, , LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-   int ret;
-   git_zstream stream;
-   char hdr[8192];
-
-   ret = unpack_sha1_header(, map, mapsize, hdr, sizeof(hdr));
-   if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-   return NULL;
-
-   return unpack_sha1_rest(, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
  struct pack_window **w_curs,
  off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, 
off_t base_offset,
if (!ent)
return unpack_entry(p, base_offset, type, base_size);
 
-   *type = ent->type;
-   *base_size = ent->size;
+   if (type)
+   *type = ent->type;
+   if (base_size)
+   *base_size = ent->size;
return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
 * We always get the representation type, but only convert it to
 * a "real" type later if the caller is interested.
 */
-   type = unpack_object_header(p, _curs, , );
+   if (oi->contentp) {
+   *oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+ );
+   if (!*oi->contentp)
+   type = OBJ_BAD;
+   } else {
+   type = unpack_object_header(p, _curs, , );
+   }
 
-   if (oi->sizep) {
+   if (!oi->contentp && oi->sizep) {
if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
off_t tmp_pos = curpos;
off_t base_offset = get_delta_base(p, _curs, _pos,
@@ -2685,8 +2681,10 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
free(external_base);
}
 
-   *final_type = type;
-   *final_size = size;
+   if (final_type)
+   *final_type = type;
+   if (final_size)
+   *final_size = size;
 
unuse_pack(_curs);
 
@@ -2920,6 +2918,7 @@ static int sha1_loose_object_info(const unsigned char 
*sha1,
git_zstream stream;
char hdr[32];
s

[PATCH v2 1/4] sha1_file: teach packed_object_info about typename

2017-06-13 Thread Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (oi->typep) {
-   *oi->typep = packed_to_object_type(p, obj_offset, type, 
_curs, curpos);
-   if (*oi->typep < 0) {
+   if (oi->typep || oi->typename) {
+   enum object_type ptot;
+   ptot = packed_to_object_type(p, obj_offset, type, _curs,
+curpos);
+   if (oi->typep)
+   *oi->typep = ptot;
+   if (oi->typename) {
+   const char *tn = typename(ptot);
+   if (tn)
+   strbuf_addstr(oi->typename, tn);
+   }
+   if (ptot < 0) {
type = OBJ_BAD;
goto out;
}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   enum object_type real_type;
const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
return -1;
}
 
-   /*
-* packed_object_info() does not follow the delta chain to
-* find out the real type, unless it is given oi->typep.
-*/
-   if (oi->typename && !oi->typep)
-   oi->typep = _type;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-   if (oi->typep == _type)
-   oi->typep = NULL;
return sha1_object_info_extended(real, oi, 0);
} else if (in_delta_base_cache(e.p, e.offset)) {
oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 rtype == OBJ_OFS_DELTA);
}
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(*oi->typep));
-   if (oi->typep == _type)
-   oi->typep = NULL;
 
return 0;
 }
-- 
2.13.1.518.g3df882009-goog

[PATCH v2 2/4] sha1_file: move delta base cache code up

2017-06-13 Thread Jonathan Tan

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 226 +++-
 1 file changed, 116 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..a158907d1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct 
packed_git *p,
goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-  struct object_info *oi)
-{
-   struct pack_window *w_curs = NULL;
-   unsigned long size;
-   off_t curpos = obj_offset;
-   enum object_type type;
-
-   /*
-* We always get the representation type, but only convert it to
-* a "real" type later if the caller is interested.
-*/
-   type = unpack_object_header(p, _curs, , );
-
-   if (oi->sizep) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   off_t tmp_pos = curpos;
-   off_t base_offset = get_delta_base(p, _curs, _pos,
-  type, obj_offset);
-   if (!base_offset) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   *oi->sizep = get_size_from_delta(p, _curs, tmp_pos);
-   if (*oi->sizep == 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   } else {
-   *oi->sizep = size;
-   }
-   }
-
-   if (oi->disk_sizep) {
-   struct revindex_entry *revidx = find_pack_revindex(p, 
obj_offset);
-   *oi->disk_sizep = revidx[1].offset - obj_offset;
-   }
-
-   if (oi->typep || oi->typename) {
-   enum object_type ptot;
-   ptot = packed_to_object_type(p, obj_offset, type, _curs,
-curpos);
-   if (oi->typep)
-   *oi->typep = ptot;
-   if (oi->typename) {
-   const char *tn = typename(ptot);
-   if (tn)
-   strbuf_addstr(oi->typename, tn);
-   }
-   if (ptot < 0) {
-   type = OBJ_BAD;
-   goto out;
-   }
-   }
-
-   if (oi->delta_base_sha1) {
-   if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-   const unsigned char *base;
-
-   base = get_delta_base_sha1(p, _curs, curpos,
-  type, obj_offset);
-   if (!base) {
-   type = OBJ_BAD;
-   goto out;
-   }
-
-   hashcpy(oi->delta_base_sha1, base);
-   } else
-   hashclr(oi->delta_base_sha1);
-   }
-
-out:
-   unuse_pack(_curs);
-   return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-   struct pack_window **w_curs,
-   off_t curpos,
-   unsigned long size)
-{
-   int st;
-   git_zstream stream;
-   unsigned char *buffer, *in;
-
-   buffer = xmallocz_gently(size);
-   if (!buffer)
-   return NULL;
-   memset(, 0, sizeof(stream));
-   stream.next_out = buffer;
-   stream.avail_out = size + 1;
-
-   git_inflate_init();
-   do {
-   in = use_pack(p, w_curs, curpos, _in);
-   stream.next_in = in;
-   st = git_inflate(, Z_FINISH);
-   if (!stream.avail_out)
-   break; /* the payload is larger than it should be */
-   curpos += stream.next_in - in;
-   } while (st == Z_OK || st == Z_BUF_ERROR);
-   git_inflate_end();
-   if ((st != Z_STREAM_END) || stream.total_out != size) {
-   free(buffer);
-   return NULL;
-   }
-
-   return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,122 @@ static void add_delta_base_cache(struct packed_git *p, 
off_t base_offset,
hashmap_add(_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+  struct object_info *oi)
+{
+   struct pack_window *w_curs = NULL;
+   unsigned long size;
+   off_t curpos = obj_offset;
+   enum object_type type;
+
+   /*
+* We always get the representa

Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan

On Mon, 12 Jun 2017 12:11:21 -0700
Brandon Williams <bmw...@google.com> wrote:

> On 06/12, Jonathan Tan wrote:
> > On Sat, 10 Jun 2017 02:07:12 -0400
> > Jeff King <p...@peff.net> wrote:
> > 
> > > I do agree that "pass just what the sub-function needs" is a good rule
> > > of thumb. But the reason that these are globals in the first place is
> > > that there are a ton of them, and they are used at the lowest levels of
> > > call chains. So I have a feeling that we're always going to need some
> > > big object to hold all that context when doing multi-repo operations in
> > > a single process.
> > 
> > From my experience with the codebase, it seems that most of these config
> > variables are static (file-local). This means that the lowest levels of
> > call chains could probably get away with storing per-repo configs in a
> > static hashmap or associative array keyed by repo (if they cannot just
> > pass the config around).
> > 
> > Having said that, if it did come to the hashmap, I probably would prefer
> > just putting the config in the repo object too. So maybe that is the way
> > to go.
> 
> This is how the config is already handled.  A config_set is just a
> wrapper around a fancy hashmap.  Callers query using a string as a key
> and are returned the value for that config option.  I say fancy because
> it does stuff to handle multiple values, etc.

The hashmap I meant is one wrapping the one you describe:

  repo -> configset

Or equivalently:

  repo -> (string -> value(s))

> I'm not sure I know what you mean by config variables which are static,
> are you referring to the in-memory options which are populated by
> querying the config?  Those I wouldn't want to see placed in a
> 'repository object'.

Yes. I agree that I wouldn't want to see them placed in a repository
object, but after reading Peff's e-mail, I was thinking of what happens
if a file repeatedly invokes a config-sensitive function in another
file. For example:

 a.c
  for (i = 0; i < 100; i++) b_output(repo, x);

 b.c
  void b_output(struct repository *repo, int x)
  {
   /* print the configured "b.prefix" followed by x */
  }

We probably wouldn't want to parse the repo's configset every time
b_output() is invoked, but then, where to store our parsed "b.prefix"?
The only alternatives I see is to have a static hashmap in b.c (keyed by
repo, as described above), which would work if such a situation is rare
(or can be made rare), but if it is common, maybe we have no choice but
to put it in struct repository.

Re: [RFC PATCH 2/4] sha1_file: extract type and size from object_info

2017-06-12 Thread Jonathan Tan

On Sat, 10 Jun 2017 03:01:33 -0400
Jeff King <p...@peff.net> wrote:

> On Fri, Jun 09, 2017 at 12:23:24PM -0700, Jonathan Tan wrote:
> 
> > Looking at the 3 primary functions (sha1_object_info_extended,
> > read_object, has_sha1_file_with_flags), they independently implement
> > mechanisms such as object replacement, retrying the packed store after
> > failing to find the object in the packed store then the loose store, and
> > being able to mark a packed object as bad and then retrying the whole
> > process. Consolidating these mechanisms would be a great help to
> > maintainability.
> > 
> > Such a consolidated function would need to handle the read_object() case
> > (which returns the object data, type, and size) and the
> > sha1_object_info_extended() case (which returns the object type, size,
> > and some additional information, not all of which can be "turned off" by
> > passing NULL in "struct object_info").
> 
> I like the idea of consolidating the logic. But I can't help but feel
> that pulling these fields out of object_info is a step backwards. The
> point of that struct is to let the caller specify which aspects of the
> object they're interested in

My issue was that there are some parts that cannot be turned off (in
particular, the object_info.u.packed part). Having said that, reading
the packed object itself should give us enough information to populate
that, so I'll take a look and see if this is possible.

> Another approach to this whole mess is to have a single function for
> acquiring a "handle" to an object, along with functions to query aspects
> of a handle. That would let callers iteratively ask for the parts they
> care about, and we could lazily fill the handle info (so information we
> pick up while servicing one property of the object gets cached and
> returned for free if the caller asks for it later).
> 
> That's a much bigger change, though it may have other benefits (e.g., we
> could be passing around handles instead of object buffers, which would
> make it more natural to stream object content in many cases).

There are a few safeguards that, I think, only work with the current
get-everything-then-forget-about-it approach (the packed-loose-packed
retry mechanism, and the desperate retry-if-corrupt-packed-object one).
If we have a handle with a cache, then, for example, we would lose the
ability to retry packed after loose if the handle has already declared
that the object is loose.

Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan

On Sat, 10 Jun 2017 17:43:29 -0700
Brandon Williams  wrote:

> I disagree with a few points of what jonathan said (mostly about
> removing the config from the repo object, as I like the idea of nothing
> knowing about a 'config_set' object) and I think this problem could be
> solved in a couple ways.

Ah...is the plan to eventually delete the git_configset_.* functions and
only have repo_config_get_.* functions? If yes, I would prefer the
in-between state to be "git_config_set_get_int(repo->configset, ...)"
instead of "repo_config_get_int(repo, ...)" to avoid the parallel set of
functions (which will make it more troublesome, for example, to add
support for a new data type) but I can see the benefits of having the
repo_config_get_.* functions too (conciseness and not having to make one
final find-and-replace when we finally complete the migration, at least)
so I don't feel too strongly about this.

> I don't think that the in-memory global variable 'quotepath' (or
> whatever its called) should live in the repository object (I mean it's
> already contained in the config) but rather 'quotepath' is specific to
> how ls-files handles its output.  So really what should happen is you
> pass a pair of objects to the ls-files machinery (or any other command's
> machinery) (1) the repository object being operated on and (2) an
> options struct which can be configured based on the repository.  So when
> recursing you can do something like the following:
> 
>   repo_init(submodule, path_to_submodule);
>   configure_opts(sub_opts, submodule, super_opts)
>   ls_files(submodule, sub_opts);
> 
> This eliminates bloating 'struct repository' and would allow you to have
> things configured differently in submodules (which is crazy if you ask
> me, but people do crazy things).

This does sound good to me.

Re: [PATCH v2 00/32] repository object

2017-06-12 Thread Jonathan Tan

On Sat, 10 Jun 2017 02:07:12 -0400
Jeff King  wrote:

> I do agree that "pass just what the sub-function needs" is a good rule
> of thumb. But the reason that these are globals in the first place is
> that there are a ton of them, and they are used at the lowest levels of
> call chains. So I have a feeling that we're always going to need some
> big object to hold all that context when doing multi-repo operations in
> a single process.

From my experience with the codebase, it seems that most of these config
variables are static (file-local). This means that the lowest levels of
call chains could probably get away with storing per-repo configs in a
static hashmap or associative array keyed by repo (if they cannot just
pass the config around).

Having said that, if it did come to the hashmap, I probably would prefer
just putting the config in the repo object too. So maybe that is the way
to go.

Re: [PATCH v2 00/32] repository object

2017-06-09 Thread Jonathan Tan

On Thu,  8 Jun 2017 16:40:28 -0700
Brandon Williams  wrote:

> When I sent out my RFC series there seemed to be a lot of interest but I
> haven't seen many people jump to review this series.  Despite lack of review I
> wanted to get out another version which includes some changes to fix things
> that were bugging me about the series.  Hopfully this v2 will prod some more
> people to take a look.
> 
> The meat of the series is really from 04-15 and patch 32 which converts
> ls-files to recurse using a repository object.  So if you're pressed for time
> you can focus on those patches.

Thanks - I can see in patch 32 that one immediate benefit is that
ls-files can now recurse into submodules without needing to invoke an
extra process just to change the GIT_DIR.

Before I get into the details, I have some questions:

1. I am concerned that "struct repository" will end up growing without
bounds as we store more and more repo-specific concerns in it. Could it
be restricted to just the fields populated by repo_init()?
repo_read_index() will then return the index itself, instead of using
"struct repository" as a cache. This means that code using
repo_read_index() will need to maintain its own variable holding the
returned index, but that is likely a positive - it's better for code to
just pass around the specific thing needed between functions anyway, as
opposed to passing a giant "struct repository" (which partially defeats
the purpose of eliminating the usage of globals).

2. If we do the above, another potential benefit is that (I think) the
repo_config_get_.* functions would no longer be necessary, since we
would have a function that generates a "struct config_set" given a repo,
and then we just use the git_configset_get_.* functions on it. This
would also reduce the urgency of extracting the config methods into its
own header file, and reduce the size of this patch set.

(I was also going to suggest that you remove the "convert" patches, but
that is not possible - ls-files uses get_cached_convert_stats_ascii()
from convert.h despite not #include-ing it.)

[RFC PATCH 2/4] sha1_file: extract type and size from object_info

2017-06-09 Thread Jonathan Tan

This is patch 1 of 2 to consolidate all storage-agnostic object
information functions.

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Such a consolidated function would need to handle the read_object() case
(which returns the object data, type, and size) and the
sha1_object_info_extended() case (which returns the object type, size,
and some additional information, not all of which can be "turned off" by
passing NULL in "struct object_info").

To make it easier to implement and use such a function, remove the type
and size fields from "struct object_info", making them additional
parameters in sha1_object_info_extended (and related functions) instead.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/cat-file.c | 29 +++-
 builtin/pack-objects.c |  5 ++--
 cache.h|  6 ++---
 sha1_file.c| 72 --
 streaming.c|  4 +--
 5 files changed, 62 insertions(+), 54 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..5bb16c045 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -75,7 +75,8 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
switch (opt) {
case 't':
oi.typename = 
-   if (sha1_object_info_extended(oid.hash, , flags) < 0)
+   if (sha1_object_info_extended(oid.hash, NULL, NULL, ,
+ flags) < 0)
die("git cat-file: could not get object info");
if (sb.len) {
printf("%s\n", sb.buf);
@@ -85,8 +86,8 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
break;
 
case 's':
-   oi.sizep = 
-   if (sha1_object_info_extended(oid.hash, , flags) < 0)
+   if (sha1_object_info_extended(oid.hash, NULL, , ,
+ flags) < 0)
die("git cat-file: could not get object info");
printf("%lu\n", size);
return 0;
@@ -194,10 +195,12 @@ struct expand_data {
int split_on_whitespace;
 
/*
-* After a mark_query run, this object_info is set up to be
-* passed to sha1_object_info_extended. It will point to the data
+* After a mark_query run, these fields are set up to be
+* passed to sha1_object_info_extended. They will point to the data
 * elements above, so you can retrieve the response from there.
 */
+   enum object_type *typep;
+   unsigned long *sizep;
struct object_info info;
 
/*
@@ -224,12 +227,12 @@ static void expand_atom(struct strbuf *sb, const char 
*atom, int len,
strbuf_addstr(sb, oid_to_hex(>oid));
} else if (is_atom("objecttype", atom, len)) {
if (data->mark_query)
-   data->info.typep = >type;
+   data->typep = >type;
else
strbuf_addstr(sb, typename(data->type));
} else if (is_atom("objectsize", atom, len)) {
if (data->mark_query)
-   data->info.sizep = >size;
+   data->sizep = >size;
else
strbuf_addf(sb, "%lu", data->size);
} else if (is_atom("objectsize:disk", atom, len)) {
@@ -280,7 +283,7 @@ static void print_object_or_die(struct batch_options *opt, 
struct expand_data *d
 {
const struct object_id *oid = >oid;
 
-   assert(data->info.typep);
+   assert(data->typep);
 
if (data->type == OBJ_BLOB) {
if (opt->buffer_output)
@@ -323,7 +326,7 @@ static void print_object_or_die(struct batch_options *opt, 
struct expand_data *d

[RFC PATCH 3/4] sha1_file: consolidate storage-agnostic object fns

2017-06-09 Thread Jonathan Tan

This is patch 2 of 2 to consolidate all storage-agnostic object
information functions.

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Therefore, consolidate all 3 functions into 1 function.

Note that has_sha1_file_with_flags() does not try cached storage,
whereas the other 2 functions do - this functionality is preserved.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 294 ++--
 1 file changed, 165 insertions(+), 129 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index ac4d77ccc..deb08b0f1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1959,7 +1959,7 @@ static int parse_sha1_header_extended(const char *hdr, 
enum object_type *typep,
}
 
type = type_from_string_gently(type_buf, type_len, 1);
-   if (oi->typename)
+   if (oi && oi->typename)
strbuf_add(oi->typename, type_buf, type_len);
/*
 * Set type to 0 if its an unknown object and
@@ -2001,12 +2001,13 @@ static int parse_sha1_header_extended(const char *hdr, 
enum object_type *typep,
 
 int parse_sha1_header(const char *hdr, unsigned long *sizep)
 {
-   struct object_info oi = OBJECT_INFO_INIT;
-   return parse_sha1_header_extended(hdr, NULL, sizep, ,
+   return parse_sha1_header_extended(hdr, NULL, sizep, NULL,
  LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum 
object_type *type, unsigned long *size, const unsigned char *sha1)
+static void *unpack_sha1_file(void *map, unsigned long mapsize,
+ enum object_type *type, unsigned long *size,
+ const unsigned char *sha1)
 {
int ret;
git_zstream stream;
@@ -2274,18 +2275,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
}
}
 
-   if (oi->disk_sizep) {
+   if (oi && oi->disk_sizep) {
struct revindex_entry *revidx = find_pack_revindex(p, 
obj_offset);
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (typep || oi->typename) {
+   if (typep || (oi && oi->typename)) {
enum object_type ptot;
ptot = packed_to_object_type(p, obj_offset, type, _curs,
 curpos);
if (typep)
*typep = ptot;
-   if (oi->typename) {
+   if (oi && oi->typename) {
const char *tn = typename(ptot);
if (tn)
strbuf_addstr(oi->typename, tn);
@@ -2296,7 +2297,7 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
}
}
 
-   if (oi->delta_base_sha1) {
+   if (oi && oi->delta_base_sha1) {
if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
const unsigned char *base;
 
@@ -2438,8 +2439,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, 
off_t base_offset,
if (!ent)
return unpack_entry(p, base_offset, type, base_size);
 
-   *type = ent->type;
-   *base_size = ent->size;
+   if (type)
+   *type = ent->type;
+   if (base_size)
+   *base_size = ent->size;
return xmemdupz(ent->data, ent->size);
 }
 
@@ -2907,43 +2910,20 @@ struct packed_git *find_sha1_pack(const unsigned char 
*sha1,
 }
 
 static int sha1_loose_object_info(const unsigned char *sha1,
+ void *map, unsigned long mapsize,
  enum object_type *typep,
  unsigned long *sizep,
  struct object_info *oi,
  int flags)
 {
int status = 0;
-   unsigned long mapsize;
-   void *map;
git_zstream stre

[RFC PATCH 4/4] sha1_file, fsck: add missing blob support

2017-06-09 Thread Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to get_object().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/config.txt |  10 
 builtin/fsck.c   |   7 +++
 cache.h  |   6 ++
 sha1_file.c  | 147 +++
 t/t3907-missing-blob.sh  |  69 ++
 5 files changed, 229 insertions(+), 10 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or 
linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+   If set, whenever a blob in the local repo is attempted to be
+   read but is missing, invoke th

[RFC PATCH 0/4] Improvements to sha1_file

2017-06-09 Thread Jonathan Tan

I was investigating how to adapt my existing patch for missing blob
support [1] to consult a manifest of missing blobs, and found it
difficult to further modify sha1_file.c without doing some further
refactoring. So here are some patches to do that.

I think patch 1 is an independently good change - it makes the code
clearer and is also a net reduction in lines. If none of the other
patches here make it, maybe patch 1 should go in independently.

Patches 2-3 are also collectively independent, but more invasive. The
commit messages explain what's going on in more detail, but basically
there are 3 functions doing similar things (getting information for an
object regardless of where it's stored) with duplicated mechanisms, and
for maintainability, it is better to combine them into one function.

Patch 4 is my adaptation of [1] after all the refactoring - notice that
I just needed to edit 1 storage-agnostic object info function instead
of previously needing to edit 3. It is still a work in progress - the
code looks complete, but I would probably need to at least document the
missing blob manifest format. I am providing it here just to show the
effectiveness of the refactoring in patches 2-3.

I am hoping for reviews on patches 1-3 to be included into the tree.

[1] 
https://public-inbox.org/git/20170426221346.25337-1-jonathanta...@google.com/

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: extract type and size from object_info
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/cat-file.c   |  29 +--
 builtin/fsck.c   |   7 +
 builtin/pack-objects.c   |   5 +-
 cache.h  |  12 +-
 sha1_file.c  | 484 +++
 streaming.c  |   4 +-
 t/t3907-missing-blob.sh  |  69 +++
 8 files changed, 439 insertions(+), 181 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.508.gb3defc5cc-goog

[RFC PATCH 1/4] sha1_file: teach packed_object_info about typename

2017-06-09 Thread Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 sha1_file.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t 
obj_offset,
*oi->disk_sizep = revidx[1].offset - obj_offset;
}
 
-   if (oi->typep) {
-   *oi->typep = packed_to_object_type(p, obj_offset, type, 
_curs, curpos);
-   if (*oi->typep < 0) {
+   if (oi->typep || oi->typename) {
+   enum object_type ptot;
+   ptot = packed_to_object_type(p, obj_offset, type, _curs,
+curpos);
+   if (oi->typep)
+   *oi->typep = ptot;
+   if (oi->typename) {
+   const char *tn = typename(ptot);
+   if (tn)
+   strbuf_addstr(oi->typename, tn);
+   }
+   if (ptot < 0) {
type = OBJ_BAD;
goto out;
}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
struct cached_object *co;
struct pack_entry e;
int rtype;
-   enum object_type real_type;
const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
return -1;
}
 
-   /*
-* packed_object_info() does not follow the delta chain to
-* find out the real type, unless it is given oi->typep.
-*/
-   if (oi->typename && !oi->typep)
-   oi->typep = _type;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
-   if (oi->typep == _type)
-   oi->typep = NULL;
return sha1_object_info_extended(real, oi, 0);
} else if (in_delta_base_cache(e.p, e.offset)) {
oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 rtype == OBJ_OFS_DELTA);
}
-   if (oi->typename)
-   strbuf_addstr(oi->typename, typename(*oi->typep));
-   if (oi->typep == _type)
-   oi->typep = NULL;
 
return 0;
 }
-- 
2.13.1.508.gb3defc5cc-goog

Re: [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes

2017-06-05 Thread Jonathan Tan

Thanks for your comments.

On Fri, 2 Jun 2017 18:16:45 -0400
Jeff King <p...@peff.net> wrote:

> On Fri, Jun 02, 2017 at 12:38:43PM -0700, Jonathan Tan wrote:
> 
> > > Do we need to future-proof the output format so that we can later
> > > use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> > > is hexadecimal text, and it may not be so bad to make this also
> > > text, e.g. " SP  LF".  That way, you do not have to
> > > worry about byte order, hash size, or length limited to uint64.
> > 
> > The reason for using binary is for the convenience of the client to
> > avoid another conversion before storing it to disk (and also network
> > savings). In a large repo, I think this list will be quite large. I
> > realized that I didn't mention anything about this in the commit
> > message, so I have added an explanation.
> > 
> > I think this is sufficiently future-proof in that the format of this
> > hash matches the format of the hashes used in the objects in the
> > packfile. As for object size being limited to uint64, I think the
> > benefits of the space savings (in using binary) outweigh the small risk
> > that our files will get larger than that before we upgrade our protocol
> > :-P
> 
> The rest of the pack code uses a varint encoding which is generally
> much smaller than a uint64 for most files, but can handle arbitrary
> sizes.
> 
> The one thing it loses is that you wouldn't have a fixed-size record, so
> if you were planning to dump this directly to disk and binary-search it,
> that won't work. OTOH, you could make pseudo-pack-entries and just
> index them along with the rest of the objects in the pack .idx.
> 
> The one subtle thing there is that the pseudo-entry would have to say
> "this is my sha1". And then we'd end up repeating that sha1 in the .idx
> file. So it's optimal on the network but wastes 20 bytes on disk (unless
> index-pack throws away the in-pack sha1s as it indexes, which is
> certainly an option).

If we end up going with the varint approach (which seems reasonable),
maybe the client could just expand the varints into uint64s so that it
has a binary-searchable file. I think it's better to keep this list
separate from the pack .idx file (there has been some discussion on this
- [1] and its replies).

[1] 
https://public-inbox.org/git/777ab8f2-c31a-d07b-ffe3-f8333f408...@jeffhostetler.com/

> > > Can this multiplication overflow (hint: st_mult)?
> > 
> > Thanks - used st_mult.
> 
> Actually, I think this is a good candidate for ALLOC_ARRAY().

Thanks - I've made this change in my local version.

> > > This sorting is a part of external contract (i.e. "output in hash
> > > order" is documented), but is it necessary?  Just wondering.
> > 
> > It is convenient for the client because the client can then store it
> > directly and binary search when accessing it. (Added a note about
> > convenience to the commit message.)
> 
> In general the Git client doesn't trust the pack data coming from a
> remote, and you can't corrupt a client by sending it bogus data. Either
> the client computes it from scratch (e.g., the sha1s of each object) or
> the client will reject nonsense (missing bases, refs pointing to objects
> that aren't sent, etc).
> 
> I know this feature implies a certain amount of trust (after all, the
> server could claim that it omitted any sha1 it likes), but we should
> probably still be as strict as possible that what the other side is
> sending makes sense. In this case, we should probably hashcmp() each
> entry with the last and make sure they're strictly increasing (no
> out-of-order and no duplicates).

Good point.

[WIP v2 2/2] pack-objects: support --blob-max-bytes

2017-06-02 Thread Jonathan Tan

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, teach pack-objects to support --blob-max-bytes, packing only
blobs not exceeding that size unless the blob corresponds to a file
whose name starts with ".git". upload-pack will eventually be taught to
use this new parameter if needed to exclude certain blobs during a fetch
or clone, potentially drastically reducing network consumption when
serving these very large repositories.

Since any excluded blob should not act as a delta base, I did this by
avoiding such oversized blobs from ever going into the "to_pack" data
structure, which contains both preferred bases (which do not go into the
final packfile but can act as delta bases) and the objects to be packed
(which go into the final packfile and also can act as delta bases). To
that end, add_object_entry() has been modified to exclude the
appropriate non-preferred-base objects. (Preferred bases, regardless of
size, still enter "to_pack" - they are something that the client
indicates that it has, not something that the server needs to serve, so
no exclusion occurs.)

In addition, any such exclusions are recorded and sent before the
packfile. These are recorded in binary format and in hash order, in a
format convenient for a client to use if stored directly to disk.

The other method in which objects are added to "to_pack",
add_object_entry_from_bitmap(), has not been modified - thus packing in
the presence of bitmaps still packs all blobs regardless of size. See
the documentation update in this commit for the rationale.

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 Documentation/git-pack-objects.txt |  19 ++-
 builtin/pack-objects.c | 105 -
 t/t5300-pack-object.sh |  71 +
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 8973510a4..23ff8b5be 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]] [--stdout | base-name]
+   [--revs [--unpacked | --all]]
+   [--stdout [--blob-max-bytes=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
 
@@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it 
creates a bundle.
With this option, parents that are hidden by grafts are packed
nevertheless.
 
+--blob-max-bytes=::
+   This option can only be used with --stdout. If specified, a blob
+   larger than this will not be packed unless a to-be-packed tree
+   has that blob with a filename beginning with ".git".  The size
+   can be suffixed with "k", "m", or "g", and may be "0".
++
+If specified, after printing the packfile, pack-objects will print the
+count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
+in hash order, pack-objects will print its hash (20 bytes) and its size
+(8 bytes). All numbers are in network byte order.
++
+If pack-objects cannot efficiently determine, for an arbitrary blob,
+that no to-be-packed tree has that blob with a filename beginning with
+".git" (for example, if bitmaps are used to calculate the objects to be
+packed), it will pack all blobs regardless of size.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 730fa3d85..46578c05b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static long blob_max_bytes = -1;
+
 /*
  * stats
  */
@@ -1069,17 +1071,73 @@ static const char no_closure_warning[] = N_(
 "disabling bitmap writing, as some objects are not being packed"
 );
 
+struct oversized_blob {
+   struct hashmap_entry entry;
+   struct object_id oid;
+   unsigned long size;
+};
+struct hashmap oversized_blobs;
+
+static int oversized_blob_cmp_fn(const void *b1, const void *b2,
+const void *keydata)
+{
+   const struct object_id *oid2 = keydata ? keydata :
+   &((const struct oversized_blob *) b2)->oid;
+   return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2);
+}
+
+/*
+ * Returns 1 and records this blob in oversized_blobs if the given blob does
+ * not meet any defined blob size limits.  Blobs that appear as a tree entry
+ * whose basename begins with ".git" are never considered oversiz

[WIP v2 1/2] pack-objects: rename want_.* to ignore_.*

2017-06-02 Thread Jonathan Tan

Currently, in pack_objects, add_object_entry() distinguishes between 2
types of non-preferred-base objects:

 (1) objects that should not be in "to_pack" because an option like
 --local or --honor-pack-keep is set
 (2) objects that should be in "to_pack"

A subsequent commit will teach pack-objects to exclude certain objects
(specifically, blobs that are larger than a user-given threshold and are
not potentially a Git special file [1]), but unlike in (1) above, these
exclusions need to be reported. So now we have 3 types of
non-preferred-base objects:

 (a) objects that should not be in "to_pack" and should not be reported
 because an option like --local or --honor-pack-keep is set
 (b) objects that should not be in "to_pack" and should be reported
 because they are blobs that are oversized and non-Git-special
 (c) objects that should be in "to_pack"

add_object_entry() will be taught to distinguish between these 3 types
by the following:

 - renaming want_found_object() and want_object_in_pack() to ignore_.*
   to make it clear that they only check for (a) (this commit)
 - adding a new function to check for (b) and using it within
   add_object_entry() (a subsequent commit)

The alternative would be to retain the names want_found_object() and
want_object_in_pack() and have them handle both the cases (a) and (b),
but that would result in more complicated code.

We also take the opportunity to use the terminology "preferred_base"
instead of "excluded" in these methods. It is true that preferred bases
are not included in the final packfile generation, but at this point in
the code, there is no exclusion taking place - on the contrary, if
something is "excluded", it is in fact guaranteed to be in to_pack.

[1] For the purposes of pack_objects, a blob is a Git special file if it
appears in a to-be-packed tree with a filename beginning with
".git".

Signed-off-by: Jonathan Tan <jonathanta...@google.com>
---
 builtin/pack-objects.c | 56 +-
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80439047a..730fa3d85 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -946,12 +946,12 @@ static int have_duplicate_entry(const unsigned char *sha1,
return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int ignore_found_object(int preferred_base, struct packed_git *p)
 {
-   if (exclude)
-   return 1;
-   if (incremental)
+   if (preferred_base)
return 0;
+   if (incremental)
+   return 1;
 
/*
 * When asked to do --local (do not include an object that appears in a
@@ -969,19 +969,19 @@ static int want_found_object(int exclude, struct 
packed_git *p)
 */
if (!ignore_packed_keep &&
(!local || !have_non_local_packs))
-   return 1;
+   return 0;
 
if (local && !p->pack_local)
-   return 0;
+   return 1;
if (ignore_packed_keep && p->pack_local && p->pack_keep)
-   return 0;
+   return 1;
 
/* we don't know yet; keep looking for more packs */
return -1;
 }
 
 /*
- * Check whether we want the object in the pack (e.g., we do not want
+ * Check whether we should ignore this object (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
  * If the caller already knows an existing pack it wants to take the object
@@ -989,16 +989,16 @@ static int want_found_object(int exclude, struct 
packed_git *p)
  * function finds if there is any pack that has the object and returns the pack
  * and its offset in these variables.
  */
-static int want_object_in_pack(const unsigned char *sha1,
-  int exclude,
-  struct packed_git **found_pack,
-  off_t *found_offset)
+static int ignore_object(const unsigned char *sha1,
+int preferred_base,
+struct packed_git **found_pack,
+off_t *found_offset)
 {
struct mru_entry *entry;
-   int want;
+   int ignore;
 
-   if (!exclude && local && has_loose_object_nonlocal(sha1))
-   return 0;
+   if (!preferred_base && local && has_loose_object_nonlocal(sha1))
+   return 1;
 
/*
 * If we already know the pack object lives in, start checks from that
@@ -1006,9 +1006,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 * are present we will determine the answer right now.
 */
if (*found_pack) {
-   want = want_found_object(exclude, *found_pack

[WIP v2 0/2] Modifying pack objects to support --blob-max-bytes

2017-06-02 Thread Jonathan Tan

ta if we wanted to?  I have a suspicion that it is easier
> to handle at the receiving end, after the capability exchange
> decides to use this "omit oversized ones" extension, to receive this
> before the packdata begins.  If it is at the end, the receiver needs
> to spawn either unpack-objects or index-objects to cause it write to
> the disk, but when it does so, it has to have a FD open to read from
> their standard output so that the receiver can grab the "remainder"
> that these programs did not process.  "Write to the end of the pack
> stream data to these programs, and process the remainder ourselves"
> is much harder to arrange, I'd suspect.

Yes, we can write at the beginning - done.

> > +/*
> > + * Returns 1 and records this blob in oversized_blobs if the given blob 
> > does
> > + * not meet any defined blob size limits.  Blobs that appear as a tree 
> > entry
> > + * whose basename begins with ".git" are never considered oversized.
> > + *
> > + * If the tree entry corresponding to the given blob is unknown, pass NULL 
> > as
> > + * name. In this case, this function will always return 0 to avoid false
> > + * positives.
> > + */
> > +static int oversized(const unsigned char *sha1, const char *name) {  
>
> Not object_id?

This would require a "(const struct object_id *) sha1" cast in its
caller that potentially violates alignment constraints, as far as I
know, so I used "const unsigned char *" for now. If the object_id
conversion doesn't include this file in the near future, I'll try to
send out a patch for this file.

> When we have two paths to a blob, and the blob is first fed to this
> function with one name that does not begin with ".git", we would
> register it to the hashmap.  And then the other name that begins
> with ".git" is later discovered to point at the same blob, what
> happens?  Would we need to unregister it from the hashmap elsewhere
> in the code?
[snip]
> Ah, is this where the "unregistering" happens?

Yes.

> >  static int add_object_entry(const unsigned char *sha1, enum object_type 
> > type,
> > -   const char *name, int exclude)
> > +   const char *name, int preferred_base)  
> 
> This belongs to the previous step that renamed "exclude", no?

Yes - done.

Jonathan Tan (2):
  pack-objects: rename want_.* to ignore_.*
  pack-objects: support --blob-max-bytes

 Documentation/git-pack-objects.txt |  19 -
 builtin/pack-objects.c | 159 ++---
 t/t5300-pack-object.sh |  71 +
 3 files changed, 220 insertions(+), 29 deletions(-)

-- 
2.13.0.506.g27d5fe0cd-goog

< 3 4 5 6 7 8 9 10 11 >

701 - 800 of 1061 matches

Mail list logo