Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase
On Thu, Aug 11, 2016 at 01:02:52AM -0400, Jeff King wrote: > > > + * 2. Updating our size; check_object() will have filled in the size > > > of our > > > + * delta, but a non-delta object needs it true size. > > > > Excellent point. > > I was not clever enough to think of it; the pack-objects code is filled > with nice assertions (Thanks, Nico!) that help out when you are stupid. :) > > One thing to be careful of is that there are more things this > drop_reused_delta() should be doing. But I looked through the rest of > check_object() and could not find anything else. Argh, I spoke too soon. It is true that the size lookup is the only part of check_object() we skip if we are reusing the delta. But what I didn't notice is that when we choose to reuse a delta, we overwrite entry->type (the real type!) with the in_pack_type (OFS_DELTA, etc). We need to undo that so that later stages see the real type. I'm not sure how my existing tests worked (I confirmed that they do indeed break the delta). It may be that only some code paths actually care about the real type. But when playing with the depth-limit (which uses the same drop_reused_delta() helper), I managed to make some pretty broken packs. So please disregard the v4 patch I just sent; I haven't triggered it, but I imagine it has the same problem, and I just didn't manage to trigger it. I'll clean that up and send out a new series. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase
On Wed, Aug 10, 2016 at 01:17:22PM -0700, Junio C Hamano wrote: > > Actually, skimming the sha1_file code, I am not 100% sure that we detect > > cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since > > they always point backwards in the pack). But if that is the case, then > > I think we should fix that, not worry about special-casing it here. > > Yes, but sha1_file.c? It is the reading side and it is too late if > we notice a problem, I would think. We already are covered on the writing side. That is what your code in write_one() does. The reason to warn is on the reading side ("I fixed this for you, but by the way, your existing packs were bogus"). But of more concern is whether read_sha1_file() would recurse infinitely, which would be bad (though I do not think it would be a feasible attack vector; index-pack already rejects such packs before they are admitted to the repository). > > + * 2. Updating our size; check_object() will have filled in the size of > > our > > + * delta, but a non-delta object needs it true size. > > Excellent point. I was not clever enough to think of it; the pack-objects code is filled with nice assertions (Thanks, Nico!) that help out when you are stupid. :) One thing to be careful of is that there are more things this drop_reused_delta() should be doing. But I looked through the rest of check_object() and could not find anything else. > > + case DFS_ACTIVE: > > + /* > > +* We found a cycle that needs broken. It would be correct to > > +* break any link in the chain, but it's convenient to > > +* break this one. > > +*/ > > + drop_reused_delta(entry); > > + break; > > + } > > +} > > Do we need to do anything to the DFS state of an entry when > drop_reused_delta() resets its other fields? Good catch. It should be marked DONE after we have broken the delta. It doesn't matter in practice, because... > If we later find D that is directly based on A, wouldn't we end up > visiting A and attempt to drop it again? drop_reused_delta() is > idempotent so there will be no data structure corruption, I think, > but we can safely declare that the entry is now DONE after calling > drop_reused_delta() on it (either in the function or in the caller > after it calls the function), no? I think the idempotency of drop_reused_delta() doesn't matter. When we visit A again later, its "delta" field will be NULL, so we'll hit the condition at the top of the function: this is a base object, mark DONE and don't recurse. So it's correct as-is, but I agree it feels weird that the DFS would end with some objects potentially marked ACTIVE. Everything should be DONE at the end. > > +# Create a pack containing the the tree $1 and blob $1:file, with > > +# the latter stored as a delta against $2:file. > > +# > > +# We convince pack-objects to make the delta in the direction of our > > choosing > > +# by marking $2 as a preferred-base edge. That results in $1:file as a thin > > +# delta, and index-pack completes it by adding $2:file as a base. > > Tricky but clever and correct ;-) Thanks, it took a long time to think up. ;) I actually wish we had better tools for making fake packs. Something where you could say "add A, then add B as a delta of A, then...". Because you often have to jump through quite a few hoops to convince pack-objects to generate the pack you want, and even some things are impossible (for example, I would like to make a chain of 10 deltas; how do I convince the delta search to put my objects in the right order?). I tried briefly yesterday to convince pack-objects to just take a list of objects and their deltas, but it got ugly very quickly. I think we'd be better off writing a new tool that happens to reuse some of the formatting functions from pack-objects. But even then, we've got to write an .idx, which means going through index-pack (which will complain if we are writing bogus packs for testing odd situations), or we have to keep a valid list of "struct object_entry" to feed to the idx writer. So even that approach is not quite trivial. > > +make_pack () { > > +{ > > +echo "-$(git rev-parse $2)" > > Is everybody's 'echo' happy with dash followed by unknown string? I'd assume so because it will be "-", and I think echoes which take options are careful about that. Still, it would not be hard to tweak. > > +echo "$(git rev-parse $1:dummy) dummy" > > +echo "$(git rev-parse $1:file) file" > > +} | > > +git pack-objects --stdout | > > +git index-pack --stdin --fix-thin > > An alternative > > git pack-objects --stdout <<-EOF | > -$(git rev-parse $2) > $(git rev-parse $1:dummy) dummy > $(git rev-parse $1:file) file > EOF > git index-pack --stdin --fix-thin > > looks somewhat ugly, though. Yeah, I think we would be better to just switch to printf if we want to be
Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase
Jeff Kingwrites: > ... > We could do analysis on any cycles that we find to > distinguish the two cases (i.e., it is a bogus pack if and > only if every delta in the cycle is in the same pack), but > we don't need to. If there is a cycle inside a pack, we'll > run into problems not only reusing the delta, but accessing > the object data at all. So when we try to dig up the actual > size of the object, we'll hit that same cycle and kick in > our usual complain-and-try-another-source code. I agree with all of the above reasoning. > Actually, skimming the sha1_file code, I am not 100% sure that we detect > cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since > they always point backwards in the pack). But if that is the case, then > I think we should fix that, not worry about special-casing it here. Yes, but sha1_file.c? It is the reading side and it is too late if we notice a problem, I would think. > +/* > + * Drop an on-disk delta we were planning to reuse. Naively, this would > + * just involve blanking out the "delta" field, but we have to deal > + * with two extra pieces of book-keeping: > + * > + * 1. Removing ourselves from the delta_sibling linked list. > + * > + * 2. Updating our size; check_object() will have filled in the size of our > + * delta, but a non-delta object needs it true size. Excellent point. > +/* > + * Follow the chain of deltas from this entry onward, throwing away any links > + * that cause us to hit a cycle (as determined by the DFS state flags in > + * the entries). > + */ > +static void break_delta_cycles(struct object_entry *entry) > +{ > + /* If it's not a delta, it can't be part of a cycle. */ > + if (!entry->delta) { > + entry->dfs_state = DFS_DONE; > + return; > + } > + > + switch (entry->dfs_state) { > + case DFS_NONE: > + /* > + * This is the first time we've seen the object. We mark it as > + * part of the active potential cycle and recurse. > + */ > + entry->dfs_state = DFS_ACTIVE; > + break_delta_cycles(entry->delta); > + entry->dfs_state = DFS_DONE; > + break; > + > + case DFS_DONE: > + /* object already examined, and not part of a cycle */ > + break; > + > + case DFS_ACTIVE: > + /* > + * We found a cycle that needs broken. It would be correct to > + * break any link in the chain, but it's convenient to > + * break this one. > + */ > + drop_reused_delta(entry); > + break; > + } > +} Do we need to do anything to the DFS state of an entry when drop_reused_delta() resets its other fields? If we had this and started from A (read "A--->B" as "A is based on B"): A--->B--->C--->A we paint A as ACTIVE, visit B and then C and paint them as active, and when we visit A for the second time, we drop it (i.e. break the link between A and B), return and paint C as DONE, return and paint B as DONE, and leaving A painted as ACTIVE, while the chain is now B--->C--->A If we later find D that is directly based on A, wouldn't we end up visiting A and attempt to drop it again? drop_reused_delta() is idempotent so there will be no data structure corruption, I think, but we can safely declare that the entry is now DONE after calling drop_reused_delta() on it (either in the function or in the caller after it calls the function), no? > + 2. Picking the next pack to examine based on locality (i.e., where we found > +something else recently). > + > +In this case, we want to make sure that we find the delta versions of A > and > +B and not their base versions. We can do this by putting two blobs in > each > +pack. The first is a "dummy" blob that can only be found in the pack in > +question. And then the second is the actual delta we want to find. > + > +The two blobs must be present in the same tree, not present in other > trees, > +and the dummy pathname must sort before the delta path. > +# Create a pack containing the the tree $1 and blob $1:file, with > +# the latter stored as a delta against $2:file. > +# > +# We convince pack-objects to make the delta in the direction of our choosing > +# by marking $2 as a preferred-base edge. That results in $1:file as a thin > +# delta, and index-pack completes it by adding $2:file as a base. Tricky but clever and correct ;-) > +make_pack () { > + { > + echo "-$(git rev-parse $2)" Is everybody's 'echo' happy with dash followed by unknown string? > + echo "$(git rev-parse $1:dummy) dummy" > + echo "$(git rev-parse $1:file) file" > + } | > + git pack-objects --stdout | > + git index-pack --stdin --fix-thin An alternative git pack-objects --stdout <<-EOF | -$(git rev-parse $2) $(git rev-parse
[PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A is a delta of B which is a delta of A) for the obvious reason that you cannot actually access any of the objects in such a case. There's a last-ditch attempt to notice cycles during the write phase, during which we issue a warning to the user and write one of the objects out in full. However, this is "last-ditch" for two reasons: 1. By this time, it's too late to find another delta for the object, so the resulting pack is larger than it otherwise could be. 2. The warning is there because this is something that _shouldn't_ ever happen. If it does, then either: a. a pack we are reusing deltas from had its own cycle b. we are reusing deltas from multiple packs, and we found a cycle among them (i.e., A is a delta of B in one pack, but B is a delta of A in another, and we choose to use both deltas). c. there is a bug in the delta-search code So this code serves as a final check that none of these things has happened, warns the user, and prevents us from writing a bogus pack. Right now, (2b) should never happen because of the static ordering of packs in want_object_in_pack(). If two objects have a delta relationship, then they must be in the same pack, and therefore we will find them from that same pack. However, a future patch would like to change that static ordering, which will make (2b) a common occurrence. In preparation, we should be able to handle those kinds of cycles better. This patch does by introducing a cycle-breaking step during the get_object_details() phase, when we are deciding which deltas can be reused. That gives us the chance to feed the objects into the delta search as if the cycle did not exist. We'll leave the detection and warning in the write_object() phase in place, as it still serves as a check for case (2c). This does mean we will stop warning for (2a). That case is caused by bogus input packs, and we ideally would warn the user about it. However, since those cycles show up after picking reusable deltas, they look the same as (2b) to us; our new code will break the cycles early and the last-ditch check will never see them. We could do analysis on any cycles that we find to distinguish the two cases (i.e., it is a bogus pack if and only if every delta in the cycle is in the same pack), but we don't need to. If there is a cycle inside a pack, we'll run into problems not only reusing the delta, but accessing the object data at all. So when we try to dig up the actual size of the object, we'll hit that same cycle and kick in our usual complain-and-try-another-source code. Signed-off-by: Jeff King--- Actually, skimming the sha1_file code, I am not 100% sure that we detect cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since they always point backwards in the pack). But if that is the case, then I think we should fix that, not worry about special-casing it here. builtin/pack-objects.c | 83 + pack-objects.h | 9 t/t5314-pack-cycle-detection.sh | 113 3 files changed, 205 insertions(+) create mode 100755 t/t5314-pack-cycle-detection.sh diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 4a63398..3e30eae 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1495,6 +1495,82 @@ static int pack_offset_sort(const void *_a, const void *_b) (a->in_pack_offset > b->in_pack_offset); } +/* + * Drop an on-disk delta we were planning to reuse. Naively, this would + * just involve blanking out the "delta" field, but we have to deal + * with two extra pieces of book-keeping: + * + * 1. Removing ourselves from the delta_sibling linked list. + * + * 2. Updating our size; check_object() will have filled in the size of our + * delta, but a non-delta object needs it true size. + */ +static void drop_reused_delta(struct object_entry *entry) +{ + struct object_entry **p = >delta->delta_child; + struct pack_window *w_curs = NULL; + + while (*p) { + if (*p == entry) + *p = (*p)->delta_sibling; + else + p = &(*p)->delta_sibling; + } + entry->delta = NULL; + + entry->size = get_size_from_delta(entry->in_pack, _curs, + entry->in_pack_offset + entry->in_pack_header_size); + unuse_pack(_curs); + + /* +* If we failed to get the size from this pack for whatever reason, +* fall back to sha1_object_info, which may find another copy. And if +* that fails, the error will be recorded in entry->type and dealt +* with in prepare_pack(). +*/ + if (entry->size == 0) + entry->type = sha1_object_info(entry->idx.sha1, >size); +} + +/* + * Follow the chain of