Re: [PATCH 1/3] revision: unify {tree,blob}_objects in rev_info

2017-03-02 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Feb 28, 2017 at 01:42:44PM -0800, Junio C Hamano wrote:
>
>> Jonathan Tan  writes:
>> 
>> > It could be argued that in the future, Git might need to distinguish
>> > tree_objects from blob_objects - in particular, a user might want
>> > rev-list to print the trees but not the blobs. 
>> 
>> That was exactly why these bits were originally made to "appear
>> independent but in practice nobody sets only one and leaves others
>> off".  
>> 
>> And it didn't happen in the past 10 years, which tells us that we
>> should take this patch.
>
> I actually have a patch which uses the distinction. It's for
> upload-archive doing reachability checks (which seems rather familiar to
> what's going on here).

OK.  Thanks for stopping me ;-)


Re: [PATCH 1/3] revision: unify {tree,blob}_objects in rev_info

2017-02-28 Thread Jeff King
On Fri, Feb 24, 2017 at 05:18:36PM -0800, Jonathan Tan wrote:

> Whenever tree_objects is set to 1 in revision.h's struct rev_info,
> blob_objects is likewise set, and vice versa. Combine those two fields
> into one.
> 
> Some of the existing code does not handle tree_objects being different
> from blob_objects properly. For example, "handle_commit" in revision.c
> recurses from an UNINTERESTING tree into its subtree if tree_objects ==
> 1, completely ignoring blob_objects; it probably should still recurse if
> tree_objects == 0 and blob_objects == 1 (to mark the blobs), and should
> behave differently depending on blob_objects (controlling the
> instantiation and marking of blob objects). This commit resolves the
> issue by forbidding tree_objects from being different to blob_objects.

Yeah, I agree that is awkward. I'm OK with the rule "if blob_objects is
set, then tree_objects must also be set". It's the other way around I
care more about.

> It could be argued that in the future, Git might need to distinguish
> tree_objects from blob_objects - in particular, a user might want
> rev-list to print the trees but not the blobs. However, this results in
> a minor performance savings at best in that objects no longer need to be
> instantiated (causing memory allocations and hashtable insertions) - no
> disk reads are being done for objects whether blob_objects is set or
> not.

In a full object-graph traversal, we actually spend a big chunk of our
time in hash lookups. My measurements (admittedly from 2013, which I
haven't repeated lately) show something like a 20-25% speedup for this
case.

My only use for it (and the source of those timings) was to compute
archive reachability, which nobody seems to care too much about. But I
suspect we could speed up your case, too, when we are just computing the
reachability of a non-blob. I.e., you should be able to turn on the
smallest subset of "commits only", "commits and trees", and "commits,
trees, and blobs", based on what the other side has asked for.

-Peff


Re: [PATCH 1/3] revision: unify {tree,blob}_objects in rev_info

2017-02-28 Thread Jeff King
On Tue, Feb 28, 2017 at 01:42:44PM -0800, Junio C Hamano wrote:

> Jonathan Tan  writes:
> 
> > It could be argued that in the future, Git might need to distinguish
> > tree_objects from blob_objects - in particular, a user might want
> > rev-list to print the trees but not the blobs. 
> 
> That was exactly why these bits were originally made to "appear
> independent but in practice nobody sets only one and leaves others
> off".  
> 
> And it didn't happen in the past 10 years, which tells us that we
> should take this patch.

I actually have a patch which uses the distinction. It's for
upload-archive doing reachability checks (which seems rather familiar to
what's going on here).

The whole series (from 2013!) is at:

  git://github.com/peff/git jk/archive-reachability

but the relevant commits are below.

I don't think the same logic holds for this case, though, because
somebody actually can ask for a single blob.

-- >8 --
From: Jeff King 
Date: Wed, 5 Jun 2013 17:57:02 -0400
Subject: [PATCH] list-objects: optimize "revs->blob_objects = 0" case

If we are traversing trees during a "--objects"
traversal, we may skip blobs if the "blob_objects" field of
rev_info is not set. But we do so as the first thing in
process_blob(), only after we have actually created the
"struct blob" object, incurring a hash lookup. We can
optimize out this no-op call completely.

This does not actually affect any current code, as all of
the current traversals always set blob_objects when looking
at objects, anyway.

Signed-off-by: Jeff King 
---
 list-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/list-objects.c b/list-objects.c
index f3ca6aafb..58ad69557 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -117,7 +117,7 @@ static void process_tree(struct rev_info *revs,
process_gitlink(revs, entry.oid->hash,
show, base, entry.path,
cb_data);
-   else
+   else if (revs->blob_objects)
process_blob(revs,
 lookup_blob(entry.oid->hash),
 show, base, entry.path,
-- 
2.12.0.359.gd4c8c42e9

-- >8 --
From: Jeff King 
Date: Wed, 5 Jun 2013 18:02:42 -0400
Subject: [PATCH] archive: ignore blob objects when checking reachability

We cannot create an archive from a blob object, so we would
not expect anyone to provide one to us. And if they do, we
will fail anyway just after the reachability check.  We can
therefore optimize our reachability check to ignore blobs
completely, and not even create a "struct blob" for them.

Depending on the repository size and the exact place we find
the reachable object in the traversal, this can save 20-25%,
a we can avoid many lookups in the object hash.

The downside of this is that a blob provided to a remote
archive process will fail with "no such object" rather than
"object is not a tree" (we could organize the code to retain
the old message, but since we no longer know whether the
blob is reachable or not, we would potentially be leaking
information about the existence of unreachable objects).

Signed-off-by: Jeff King 
---
 archive.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/archive.c b/archive.c
index ef89b2556..489115f9f 100644
--- a/archive.c
+++ b/archive.c
@@ -383,6 +383,7 @@ static int object_is_reachable(const unsigned char *sha1)
save_commit_buffer = 0;
init_revisions(, NULL);
setup_revisions(ARRAY_SIZE(argv) - 1, argv, , NULL);
+   data.revs.blob_objects = 0;
if (prepare_revision_walk())
return 0;
 
-- 
2.12.0.359.gd4c8c42e9



Re: [PATCH 1/3] revision: unify {tree,blob}_objects in rev_info

2017-02-28 Thread Junio C Hamano
Jonathan Tan  writes:

> It could be argued that in the future, Git might need to distinguish
> tree_objects from blob_objects - in particular, a user might want
> rev-list to print the trees but not the blobs. 

That was exactly why these bits were originally made to "appear
independent but in practice nobody sets only one and leaves others
off".  

And it didn't happen in the past 10 years, which tells us that we
should take this patch.

Thanks.



[PATCH 1/3] revision: unify {tree,blob}_objects in rev_info

2017-02-24 Thread Jonathan Tan
Whenever tree_objects is set to 1 in revision.h's struct rev_info,
blob_objects is likewise set, and vice versa. Combine those two fields
into one.

Some of the existing code does not handle tree_objects being different
from blob_objects properly. For example, "handle_commit" in revision.c
recurses from an UNINTERESTING tree into its subtree if tree_objects ==
1, completely ignoring blob_objects; it probably should still recurse if
tree_objects == 0 and blob_objects == 1 (to mark the blobs), and should
behave differently depending on blob_objects (controlling the
instantiation and marking of blob objects). This commit resolves the
issue by forbidding tree_objects from being different to blob_objects.

It could be argued that in the future, Git might need to distinguish
tree_objects from blob_objects - in particular, a user might want
rev-list to print the trees but not the blobs. However, this results in
a minor performance savings at best in that objects no longer need to be
instantiated (causing memory allocations and hashtable insertions) - no
disk reads are being done for objects whether blob_objects is set or
not.

Signed-off-by: Jonathan Tan 
---
 bisect.c|  2 +-
 builtin/rev-list.c  |  6 +++---
 list-objects.c  |  4 ++--
 pack-bitmap-write.c |  3 +--
 pack-bitmap.c   |  3 +--
 reachable.c |  3 +--
 revision.c  | 16 ++--
 revision.h  |  3 +--
 8 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/bisect.c b/bisect.c
index 8e63c40d2..265b32905 100644
--- a/bisect.c
+++ b/bisect.c
@@ -634,7 +634,7 @@ static void bisect_common(struct rev_info *revs)
 {
if (prepare_revision_walk(revs))
die("revision walk setup failed");
-   if (revs->tree_objects)
+   if (revs->tree_and_blob_objects)
mark_edges_uninteresting(revs, NULL);
 }
 
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 0aa93d589..6c2651b31 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -345,7 +345,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
revs.commit_format = CMIT_FMT_RAW;
 
if ((!revs.commits &&
-(!(revs.tag_objects || revs.tree_objects || revs.blob_objects) &&
+(!(revs.tag_objects || revs.tree_and_blob_objects) &&
  !revs.pending.nr)) ||
revs.diff)
usage(rev_list_usage);
@@ -374,7 +374,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
return 0;
}
} else if (revs.max_count < 0 &&
-  revs.tag_objects && revs.tree_objects && 
revs.blob_objects) {
+  revs.tag_objects && revs.tree_and_blob_objects) {
if (!prepare_bitmap_walk()) {
traverse_bitmap_commit_list(_object_fast);
return 0;
@@ -384,7 +384,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
 
if (prepare_revision_walk())
die("revision walk setup failed");
-   if (revs.tree_objects)
+   if (revs.tree_and_blob_objects)
mark_edges_uninteresting(, show_edge);
 
if (bisect_list) {
diff --git a/list-objects.c b/list-objects.c
index f3ca6aafb..796957105 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -18,7 +18,7 @@ static void process_blob(struct rev_info *revs,
struct object *obj = >object;
size_t pathlen;
 
-   if (!revs->blob_objects)
+   if (!revs->tree_and_blob_objects)
return;
if (!obj)
die("bad blob object");
@@ -78,7 +78,7 @@ static void process_tree(struct rev_info *revs,
all_entries_interesting: entry_not_interesting;
int baselen = base->len;
 
-   if (!revs->tree_objects)
+   if (!revs->tree_and_blob_objects)
return;
if (!obj)
die("bad tree object");
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 970559601..5080e276b 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -258,8 +258,7 @@ void bitmap_writer_build(struct packing_data *to_pack)
 
init_revisions(, NULL);
revs.tag_objects = 1;
-   revs.tree_objects = 1;
-   revs.blob_objects = 1;
+   revs.tree_and_blob_objects = 1;
revs.no_walk = 0;
 
revs.include_check = should_include;
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 39bcc1684..445a24e0d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -951,8 +951,7 @@ void test_bitmap_walk(struct rev_info *revs)
die("Commit %s doesn't have an indexed bitmap", 
oid_to_hex(>oid));
 
revs->tag_objects = 1;
-   revs->tree_objects = 1;
-   revs->blob_objects = 1;
+   revs->tree_and_blob_objects = 1;
 
result_popcnt = bitmap_popcount(result);
 
diff --git a/reachable.c