Re: [PATCH 5/5] pack-objects: walk tag chains for --include-tag

2016-09-07 Thread Jeff King
On Wed, Sep 07, 2016 at 11:49:28AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > As explained further in the commit message, "fetch" is robust to this,
> > because it does a real connectivity check and follow-on fetch before
> > writing anything it thinks it got via include-tag. So perhaps one could
> > argue that pack-objects is correct; include-tag is best-effort, and it
> > is the client's job to make sure it has everything it needs. And that
> > would mean the bug is in git-clone, which should be doing the
> > connectivity check and follow-on fetch.
> 
> I think that is probably a more technically correct interpretation
> of the history.
> 
> I think upgrading "best-effort" to "guarantee" like you did is a
> right approach nevertheless.  I think the "best-effort" we initially
> did was merely us being lazy.

Yeah, after sleeping on it, the conclusion I came to was that it does
not _hurt_ to have include-tag be a bit more careful.

I also wondered about the corner case I noted in the commit message.  If
you have a tag chain of A->B->C, and you already have "C" (a commit),
but are fetching "B" (a tag), then include-tag does not notice "A".

That's OK for git-fetch. It will collect "A" during its backfill phase
(not because of "B" at all, but because it knows that "A" eventually
peels to "C", which it already has). "git-clone" does not have a
backfill, of course. But neither can it "already have" a commit. So
either we get "C" as part of the clone (in which case include-tag will
include "A"), or it does not (in which case we cannot be getting "B"
either, because "C" is reachable from it).

And of course that's only when single-branch is in use. Normally
git-clone just grabs all the tags blindly. :)

So I think everything Just Works after my patch, though we do still rely
on fetch backfill to pick up some obscure cases.

-Peff


Re: [PATCH 5/5] pack-objects: walk tag chains for --include-tag

2016-09-07 Thread Junio C Hamano
Jeff King  writes:

> As explained further in the commit message, "fetch" is robust to this,
> because it does a real connectivity check and follow-on fetch before
> writing anything it thinks it got via include-tag. So perhaps one could
> argue that pack-objects is correct; include-tag is best-effort, and it
> is the client's job to make sure it has everything it needs. And that
> would mean the bug is in git-clone, which should be doing the
> connectivity check and follow-on fetch.

I think that is probably a more technically correct interpretation
of the history.

I think upgrading "best-effort" to "guarantee" like you did is a
right approach nevertheless.  I think the "best-effort" we initially
did was merely us being lazy.


Re: [PATCH 5/5] pack-objects: walk tag chains for --include-tag

2016-09-05 Thread Jeff King
On Mon, Sep 05, 2016 at 05:52:26PM -0400, Jeff King wrote:

> When pack-objects is given --include-tag, it peels each tag
> ref down to a non-tag object, and if that non-tag object is
> going to be packed, we include the tag, too. But what
> happens if we have a chain of tags (e.g., tag "A" points to
> tag "B", which points to commit "C")?
> 
> We'll peel down to "C" and realize that we want to include
> tag "A", but we do not ever consider tag "B", leading to a
> broken pack (assuming "B" was not otherwise selected).
> Instead, we have to walk the whole chain, adding any tags we
> find to the pack.

So technically one might argue that this pack isn't "broken", in that it
_is_ a valid pack. It's just that it doesn't contain what the user was
asking for.

As explained further in the commit message, "fetch" is robust to this,
because it does a real connectivity check and follow-on fetch before
writing anything it thinks it got via include-tag. So perhaps one could
argue that pack-objects is correct; include-tag is best-effort, and it
is the client's job to make sure it has everything it needs. And that
would mean the bug is in git-clone, which should be doing the
connectivity check and follow-on fetch.

I dunno. This seems like the most elegant place to fix it, though it
does mean that pack-objects will go to some slight extra work when
auto-packing a tag (it has to parse the tag to find out whether it's a
chain). I'm doubt it matters much in practice.

-Peff


[PATCH 5/5] pack-objects: walk tag chains for --include-tag

2016-09-05 Thread Jeff King
When pack-objects is given --include-tag, it peels each tag
ref down to a non-tag object, and if that non-tag object is
going to be packed, we include the tag, too. But what
happens if we have a chain of tags (e.g., tag "A" points to
tag "B", which points to commit "C")?

We'll peel down to "C" and realize that we want to include
tag "A", but we do not ever consider tag "B", leading to a
broken pack (assuming "B" was not otherwise selected).
Instead, we have to walk the whole chain, adding any tags we
find to the pack.

Interestingly, it doesn't seem possible to trigger this
problem with "git fetch", but you can with "git clone
--single-branch". The reason is that we generate the correct
pack when the client explicitly asks for "A" (because we do
a real reachability analysis there), and "fetch" is more
willing to do so. There are basically two cases:

  1. If "C" is already a ref tip, then the client can deduce
 that it needs "A" itself (via find_non_local_tags), and
 will ask for it explicitly rather than relying on the
 include-tag capability. Everything works.

  2. If "C" is not already a ref tip, then we hope for
 include-tag to send us the correct tag. But it doesn't;
 it generates a broken pack. However, the next step is
 to do a follow-up run of find_non_local_tags(),
 followed by fetch_refs() to backfill any tags we
 learned about.

 In the normal case, fetch_refs() calls quickfetch(),
 which does a connectivity check and sees we have no
 new objects to fetch. We just write the refs.

 But for the broken-pack case, the connectivity check
 fails, and quickfetch will follow-up with the remote,
 asking explicitly for each of the ref tips. This picks
 up the missing object in a new pack.

For a regular "git clone", we are similarly OK, because we
explicitly request all of the tag refs, and get a correct
pack. But with "--single-branch", we kick in tag
auto-following via "include-tag", but do _not_ do a
follow-up backfill. We just take whatever the server sent us
via include-tag and write out tag refs for any tag objects
we were sent. So prior to c6807a4 (clone: open a shortcut
for connectivity check, 2013-05-26), we actually claimed the
clone was a success, but the result was silently
corrupted!  Since c6807a4, index-pack's connectivity
check catches this case, and we correctly complain.

The included test directly checks that pack-objects does not
generate a broken pack, but also confirms that "clone
--single-branch" does not hit the bug.

Note that tag chains introduce another interesting question:
if we are packing the tag "B" but not the commit "C", should
"A" be included?

Both before and after this patch, we do not include "A",
because the initial peel_ref() check only knows about the
bottom-most level, "C". To realize that "B" is involved at
all, we would have to switch to an incremental peel, in
which we examine each tagged object, asking if it is being
packed (and including the outer tag if so).

But that runs contrary to the optimizations in peel_ref(),
which avoid accessing the objects at all, in favor of using
the value we pull from packed-refs. It's OK to walk the
whole chain once we know we're going to include the tag (we
have to access it anyway, so the effort is proportional to
the pack we're generating). But for the initial selection,
we have to look at every ref. If we're only packing a few
objects, we'd still have to parse every single referenced
tag object just to confirm that it isn't part of a tag
chain.

This could be addressed if packed-refs stored the complete
tag chain for each peeled ref (in most cases, this would be
the same cost as now, as each "chain" is only a single
link). But given the size of that project, it's out of scope
for this fix (and probably nobody cares enough anyway, as
it's such an obscure situation). This commit limits itself
to just avoiding the creation of a broken pack.

Signed-off-by: Jeff King 
---
 builtin/pack-objects.c | 31 +-
 t/t5305-include-tag.sh | 52 ++
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4a63398..0954375 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2123,6 +2123,35 @@ static void ll_find_deltas(struct object_entry **list, 
unsigned list_size,
 #define ll_find_deltas(l, s, w, d, p)  find_deltas(l, &s, w, d, p)
 #endif
 
+static void add_tag_chain(const struct object_id *oid)
+{
+   struct tag *tag;
+
+   /*
+* We catch duplicates already in add_object_entry(), but we'd
+* prefer to do this extra check to avoid having to parse the
+* tag at all if we already know that it's being packed (e.g., if
+* it was included via bitmaps, we would not have parsed it
+* previously).
+*/
+   if (packlist_find(&to_pack, oid->hash, NULL))
+   return;
+