Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase

2016-08-11 Thread Jeff King
On Thu, Aug 11, 2016 at 01:02:52AM -0400, Jeff King wrote:

> > > + *   2. Updating our size; check_object() will have filled in the size 
> > > of our
> > > + *  delta, but a non-delta object needs it true size.
> > 
> > Excellent point.
> 
> I was not clever enough to think of it; the pack-objects code is filled
> with nice assertions (Thanks, Nico!) that help out when you are stupid. :)
> 
> One thing to be careful of is that there are more things this
> drop_reused_delta() should be doing. But I looked through the rest of
> check_object() and could not find anything else.

Argh, I spoke too soon.

It is true that the size lookup is the only part of check_object()
we skip if we are reusing the delta. But what I didn't notice is that
when we choose to reuse a delta, we overwrite entry->type (the real
type!) with the in_pack_type (OFS_DELTA, etc). We need to undo that so
that later stages see the real type.

I'm not sure how my existing tests worked (I confirmed that they do
indeed break the delta). It may be that only some code paths actually
care about the real type. But when playing with the depth-limit (which
uses the same drop_reused_delta() helper), I managed to make some pretty
broken packs.

So please disregard the v4 patch I just sent; I haven't triggered it,
but I imagine it has the same problem, and I just didn't manage to
trigger it.

I'll clean that up and send out a new series.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase

2016-08-10 Thread Jeff King
On Wed, Aug 10, 2016 at 01:17:22PM -0700, Junio C Hamano wrote:

> > Actually, skimming the sha1_file code, I am not 100% sure that we detect
> > cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since
> > they always point backwards in the pack). But if that is the case, then
> > I think we should fix that, not worry about special-casing it here.
> 
> Yes, but sha1_file.c?  It is the reading side and it is too late if
> we notice a problem, I would think.

We already are covered on the writing side. That is what your code in
write_one() does. The reason to warn is on the reading side ("I fixed
this for you, but by the way, your existing packs were bogus").

But of more concern is whether read_sha1_file() would recurse
infinitely, which would be bad (though I do not think it would be a
feasible attack vector; index-pack already rejects such packs before
they are admitted to the repository).

> > + *   2. Updating our size; check_object() will have filled in the size of 
> > our
> > + *  delta, but a non-delta object needs it true size.
> 
> Excellent point.

I was not clever enough to think of it; the pack-objects code is filled
with nice assertions (Thanks, Nico!) that help out when you are stupid. :)

One thing to be careful of is that there are more things this
drop_reused_delta() should be doing. But I looked through the rest of
check_object() and could not find anything else.

> > +   case DFS_ACTIVE:
> > +   /*
> > +* We found a cycle that needs broken. It would be correct to
> > +* break any link in the chain, but it's convenient to
> > +* break this one.
> > +*/
> > +   drop_reused_delta(entry);
> > +   break;
> > +   }
> > +}
> 
> Do we need to do anything to the DFS state of an entry when
> drop_reused_delta() resets its other fields?

Good catch. It should be marked DONE after we have broken the delta. It
doesn't matter in practice, because...

> If we later find D that is directly based on A, wouldn't we end up
> visiting A and attempt to drop it again?  drop_reused_delta() is
> idempotent so there will be no data structure corruption, I think,
> but we can safely declare that the entry is now DONE after calling
> drop_reused_delta() on it (either in the function or in the caller
> after it calls the function), no?

I think the idempotency of drop_reused_delta() doesn't matter. When we
visit A again later, its "delta" field will be NULL, so we'll hit the
condition at the top of the function: this is a base object, mark DONE
and don't recurse.

So it's correct as-is, but I agree it feels weird that the DFS would end
with some objects potentially marked ACTIVE. Everything should be DONE
at the end.

> > +# Create a pack containing the the tree $1 and blob $1:file, with
> > +# the latter stored as a delta against $2:file.
> > +#
> > +# We convince pack-objects to make the delta in the direction of our 
> > choosing
> > +# by marking $2 as a preferred-base edge. That results in $1:file as a thin
> > +# delta, and index-pack completes it by adding $2:file as a base.
> 
> Tricky but clever and correct ;-)

Thanks, it took a long time to think up. ;)

I actually wish we had better tools for making fake packs. Something
where you could say "add A, then add B as a delta of A, then...".
Because you often have to jump through quite a few hoops to convince
pack-objects to generate the pack you want, and even some things are
impossible (for example, I would like to make a chain of 10 deltas; how
do I convince the delta search to put my objects in the right order?).

I tried briefly yesterday to convince pack-objects to just take a list
of objects and their deltas, but it got ugly very quickly.  I think we'd
be better off writing a new tool that happens to reuse some of the
formatting functions from pack-objects. But even then, we've got to
write an .idx, which means going through index-pack (which will complain
if we are writing bogus packs for testing odd situations), or we have to
keep a valid list of "struct object_entry" to feed to the idx writer.

So even that approach is not quite trivial.

> > +make_pack () {
> > +{
> > +echo "-$(git rev-parse $2)"
> 
> Is everybody's 'echo' happy with dash followed by unknown string?

I'd assume so because it will be "-", and I think echoes which
take options are careful about that.

Still, it would not be hard to tweak.

> > +echo "$(git rev-parse $1:dummy) dummy"
> > +echo "$(git rev-parse $1:file) file"
> > +} |
> > +git pack-objects --stdout |
> > +git index-pack --stdin --fix-thin
> 
> An alternative
> 
>   git pack-objects --stdout <<-EOF |
>   -$(git rev-parse $2)
> $(git rev-parse $1:dummy) dummy
> $(git rev-parse $1:file) file
>   EOF
> git index-pack --stdin --fix-thin
> 
> looks somewhat ugly, though.

Yeah, I think we would be better to just switch to printf if we want to
be 

Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase

2016-08-10 Thread Junio C Hamano
Jeff King  writes:

> ...
> We could do analysis on any cycles that we find to
> distinguish the two cases (i.e., it is a bogus pack if and
> only if every delta in the cycle is in the same pack), but
> we don't need to. If there is a cycle inside a pack, we'll
> run into problems not only reusing the delta, but accessing
> the object data at all. So when we try to dig up the actual
> size of the object, we'll hit that same cycle and kick in
> our usual complain-and-try-another-source code.

I agree with all of the above reasoning.

> Actually, skimming the sha1_file code, I am not 100% sure that we detect
> cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since
> they always point backwards in the pack). But if that is the case, then
> I think we should fix that, not worry about special-casing it here.

Yes, but sha1_file.c?  It is the reading side and it is too late if
we notice a problem, I would think.

> +/*
> + * Drop an on-disk delta we were planning to reuse. Naively, this would
> + * just involve blanking out the "delta" field, but we have to deal
> + * with two extra pieces of book-keeping:
> + *
> + *   1. Removing ourselves from the delta_sibling linked list.
> + *
> + *   2. Updating our size; check_object() will have filled in the size of our
> + *  delta, but a non-delta object needs it true size.

Excellent point.

> +/*
> + * Follow the chain of deltas from this entry onward, throwing away any links
> + * that cause us to hit a cycle (as determined by the DFS state flags in
> + * the entries).
> + */
> +static void break_delta_cycles(struct object_entry *entry)
> +{
> + /* If it's not a delta, it can't be part of a cycle. */
> + if (!entry->delta) {
> + entry->dfs_state = DFS_DONE;
> + return;
> + }
> +
> + switch (entry->dfs_state) {
> + case DFS_NONE:
> + /*
> +  * This is the first time we've seen the object. We mark it as
> +  * part of the active potential cycle and recurse.
> +  */
> + entry->dfs_state = DFS_ACTIVE;
> + break_delta_cycles(entry->delta);
> + entry->dfs_state = DFS_DONE;
> + break;
> +
> + case DFS_DONE:
> + /* object already examined, and not part of a cycle */
> + break;
> +
> + case DFS_ACTIVE:
> + /*
> +  * We found a cycle that needs broken. It would be correct to
> +  * break any link in the chain, but it's convenient to
> +  * break this one.
> +  */
> + drop_reused_delta(entry);
> + break;
> + }
> +}

Do we need to do anything to the DFS state of an entry when
drop_reused_delta() resets its other fields?  If we had this and
started from A (read "A--->B" as "A is based on B"):

A--->B--->C--->A

we paint A as ACTIVE, visit B and then C and paint them as active,
and when we visit A for the second time, we drop it (i.e. break the
link between A and B), return and paint C as DONE, return and paint
B as DONE, and leaving A painted as ACTIVE, while the chain is now

 B--->C--->A

If we later find D that is directly based on A, wouldn't we end up
visiting A and attempt to drop it again?  drop_reused_delta() is
idempotent so there will be no data structure corruption, I think,
but we can safely declare that the entry is now DONE after calling
drop_reused_delta() on it (either in the function or in the caller
after it calls the function), no?

> + 2. Picking the next pack to examine based on locality (i.e., where we found
> +something else recently).
> +
> +In this case, we want to make sure that we find the delta versions of A 
> and
> +B and not their base versions. We can do this by putting two blobs in 
> each
> +pack. The first is a "dummy" blob that can only be found in the pack in
> +question.  And then the second is the actual delta we want to find.
> +
> +The two blobs must be present in the same tree, not present in other 
> trees,
> +and the dummy pathname must sort before the delta path.

> +# Create a pack containing the the tree $1 and blob $1:file, with
> +# the latter stored as a delta against $2:file.
> +#
> +# We convince pack-objects to make the delta in the direction of our choosing
> +# by marking $2 as a preferred-base edge. That results in $1:file as a thin
> +# delta, and index-pack completes it by adding $2:file as a base.

Tricky but clever and correct ;-)

> +make_pack () {
> +  {
> +  echo "-$(git rev-parse $2)"

Is everybody's 'echo' happy with dash followed by unknown string?

> +  echo "$(git rev-parse $1:dummy) dummy"
> +  echo "$(git rev-parse $1:file) file"
> +  } |
> +  git pack-objects --stdout |
> +  git index-pack --stdin --fix-thin

An alternative

git pack-objects --stdout <<-EOF |
-$(git rev-parse $2)
$(git rev-parse 

[PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase

2016-08-10 Thread Jeff King
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.

There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:

  1. By this time, it's too late to find another delta for
 the object, so the resulting pack is larger than it
 otherwise could be.

  2. The warning is there because this is something that
 _shouldn't_ ever happen. If it does, then either:

   a. a pack we are reusing deltas from had its own
  cycle

   b. we are reusing deltas from multiple packs, and
  we found a cycle among them (i.e., A is a delta of
  B in one pack, but B is a delta of A in another,
  and we choose to use both deltas).

   c. there is a bug in the delta-search code

 So this code serves as a final check that none of these
 things has happened, warns the user, and prevents us
 from writing a bogus pack.

Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.

However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.

We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).

This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it.  However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.

We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.

Signed-off-by: Jeff King 
---
Actually, skimming the sha1_file code, I am not 100% sure that we detect
cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since
they always point backwards in the pack). But if that is the case, then
I think we should fix that, not worry about special-casing it here.

 builtin/pack-objects.c  |  83 +
 pack-objects.h  |   9 
 t/t5314-pack-cycle-detection.sh | 113 
 3 files changed, 205 insertions(+)
 create mode 100755 t/t5314-pack-cycle-detection.sh

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4a63398..3e30eae 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1495,6 +1495,82 @@ static int pack_offset_sort(const void *_a, const void 
*_b)
(a->in_pack_offset > b->in_pack_offset);
 }
 
+/*
+ * Drop an on-disk delta we were planning to reuse. Naively, this would
+ * just involve blanking out the "delta" field, but we have to deal
+ * with two extra pieces of book-keeping:
+ *
+ *   1. Removing ourselves from the delta_sibling linked list.
+ *
+ *   2. Updating our size; check_object() will have filled in the size of our
+ *  delta, but a non-delta object needs it true size.
+ */
+static void drop_reused_delta(struct object_entry *entry)
+{
+   struct object_entry **p = >delta->delta_child;
+   struct pack_window *w_curs = NULL;
+
+   while (*p) {
+   if (*p == entry)
+   *p = (*p)->delta_sibling;
+   else
+   p = &(*p)->delta_sibling;
+   }
+   entry->delta = NULL;
+
+   entry->size = get_size_from_delta(entry->in_pack, _curs,
+ entry->in_pack_offset + entry->in_pack_header_size);
+   unuse_pack(_curs);
+
+   /*
+* If we failed to get the size from this pack for whatever reason,
+* fall back to sha1_object_info, which may find another copy. And if
+* that fails, the error will be recorded in entry->type and dealt
+* with in prepare_pack().
+*/
+   if (entry->size == 0)
+   entry->type = sha1_object_info(entry->idx.sha1, >size);
+}
+
+/*
+ * Follow the chain of