Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
On Thu, May 11, 2017 at 10:00:50AM -0700, Brandon Williams wrote: > > None of this is your problem now either way; the advertisement-limiting > > extension is still vaporware, albeit one we've discussed a lot. I just > > wanted to make sure we weren't painting ourselves into any corners. And > > I think this case could probably be handled. > > I can't remember, is there any reason why it is still vaporware? As in > what is holding us back from doing the advertisement-limiting (or doing > away with the initial ref advertisement)? The tricky part is handling the compatibility issues. The client has no way to speak first in the current protocol, so we have to break protocol to tell the server our refspecs before the advertisement. I actually put together patches last fall for this, got derailed by a small snag while polishing them, and just haven't picked them up again. If you're interested, they're at the jk/early-caps-wip branch of https://github.com/peff/git. I got as far as breaking it all up into patches, but the commit messages still need to be written. Some of them are pretty obvious to me still after 6 months, but a few are a bit inscrutable. -Peff
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
On Thu, May 11, 2017 at 10:51:37AM -0700, Jonathan Tan wrote: > > This is inside the nr_sought loop. So if I were to do: > > > > git fetch origin $(git ls-remote origin | awk '{print $1}') > > > > we're back to being quadratic. I realize that's probably a silly thing > > to do, but in the general case, you're O(m*n), where "n" is number of > > unmatched remote refs and "m" is the number of SHA-1s you're looking > > for. > > The original patch was quadratic regardless of whether we're fetching names > or SHA-1s, which can be said to be bad since it is a regression in an > existing and common use case (and I agree). This one is O(m*n), as you said > - the "quadratic-ness" only kicks in if you fetch SHA-1s, which before this > patch is impossible. Yeah, no question this is an improvement over the original. I think this still "regresses" a certain request performance-wise, but it's a request that could never have succeeded in the original. Mostly I'd just rather not leave a hidden quadratic loop that will blow up in somebody's face a year or two down the road. > Having a sha1_array or oidset would require allocation (O(n log n) time, I > think, in addition to O(n) space) and this cost would be incurred regardless > of how many SHA-1s were actually fetched (if m is an order of magnitude less > than log n, for example, having a sha1_array might be actually worse). Also, > presumably we don't want to incur this cost if we are fetching zero SHA-1s, > so we would need to do some sort of pre-check. I'm inclined to leave it the > way it is and consider this only if the use case of fetching many SHA1-s > comes up. An oidset should be O(n) in both time and space. Which we're already spending in the earlier loop (and in general, when we allocate O(n) ref structs in the first place). I don't think we need to care too much about micro-optimizing bumps to the constant-factor here; we just need to make sure we don't end up in an algorithmic explosion. And if we initialize the oidset lazily, then we incur that cost only when we would be doing the linear walk in your patch anyway. See the patch below. > > AIUI, you could also avoid creating the unmatched list entirely when the > > server advertises tip/reachable sha1s. That's a small optimization, but > > I think it may actually make the logic clearer. > > If you mean adding an "if" block at the point where we add the unmatched ref > to the unmatched list (that either adds it to the list or immediately frees > it), I think it makes the logic slightly more complicated. Or maybe you had > something else in mind? I was just thinking that it does not leave a case where we create a data structure (the unmatched list) even though we know right then that it will not ever be used. So it's an extra line of logic there, but the overall function may be clearer. I dunno, it's probably not a big deal either way. -Peff --- diff --git a/fetch-pack.c b/fetch-pack.c index 5cace7458..a660331e6 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -15,6 +15,7 @@ #include "version.h" #include "prio-queue.h" #include "sha1-array.h" +#include "oidset.h" static int transfer_unpack_limit = -1; static int fetch_unpack_limit = -1; @@ -592,6 +593,12 @@ static void mark_recent_complete_commits(struct fetch_pack_args *args, } } +static void add_refs_to_oidset(struct oidset *oids, const struct ref *refs) +{ + for (; refs; refs = refs->next) + oidset_insert(oids, >old_oid); +} + static void filter_refs(struct fetch_pack_args *args, struct ref **refs, struct ref **sought, int nr_sought) @@ -600,6 +607,8 @@ static void filter_refs(struct fetch_pack_args *args, struct ref **newtail = struct ref *unmatched = NULL; struct ref *ref, *next; + struct oidset tip_oids = OIDSET_INIT; + int tip_oids_initialized = 0; int i; i = 0; @@ -654,22 +663,15 @@ static void filter_refs(struct fetch_pack_args *args, (ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) { can_append = 1; } else { - struct ref *u; - /* Check all refs, including those already matched */ - for (u = unmatched; u; u = u->next) { - if (!oidcmp(>old_oid, >old_oid)) { - can_append = 1; - goto can_append; - } - } - for (u = newlist; u; u = u->next) { - if (!oidcmp(>old_oid, >old_oid)) { - can_append = 1; - break; - } + if (!tip_oids_initialized) { + /* Check all refs, including those already matched */ +
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
Thanks for your suggestions. I'll hold off on sending out a new patch (following Jonathan Nieder's suggestions [1]) until we decide if further optimizations (for example, as suggested by Peff) need to be done. [1] <20170510232231.gc28...@aiede.svl.corp.google.com> On 05/11/2017 02:46 AM, Jeff King wrote: On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote: After looking at ways to solve jrnieder's performance concerns, if we're going to need to manage one more item of state within the function, I might as well use my earlier idea of storing unmatched refs in its own list instead of immediately freeing them. This version of the patch should have much better performance characteristics. Hrm. So the problem in your original was that the loop became quadratic in the number of refs when fetching all of them (because the original relies on the sorting to essentially do a list-merge). Are there any quadratic bits left? @@ -649,6 +652,25 @@ static void filter_refs(struct fetch_pack_args *args, if ((allow_unadvertised_object_request & (ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) { + can_append = 1; + } else { + struct ref *u; + /* Check all refs, including those already matched */ + for (u = unmatched; u; u = u->next) { + if (!oidcmp(>old_oid, >old_oid)) { + can_append = 1; + goto can_append; + } + } This is inside the nr_sought loop. So if I were to do: git fetch origin $(git ls-remote origin | awk '{print $1}') we're back to being quadratic. I realize that's probably a silly thing to do, but in the general case, you're O(m*n), where "n" is number of unmatched remote refs and "m" is the number of SHA-1s you're looking for. The original patch was quadratic regardless of whether we're fetching names or SHA-1s, which can be said to be bad since it is a regression in an existing and common use case (and I agree). This one is O(m*n), as you said - the "quadratic-ness" only kicks in if you fetch SHA-1s, which before this patch is impossible. Doing better would require either sorting both lists, or storing the oids in something that has better than linear-time lookup. Perhaps a sha1_array or an oidset? We don't actually need to know anything about the unmatched refs after the first loop. We just need the list of oids that let us do can_append. Having a sha1_array or oidset would require allocation (O(n log n) time, I think, in addition to O(n) space) and this cost would be incurred regardless of how many SHA-1s were actually fetched (if m is an order of magnitude less than log n, for example, having a sha1_array might be actually worse). Also, presumably we don't want to incur this cost if we are fetching zero SHA-1s, so we would need to do some sort of pre-check. I'm inclined to leave it the way it is and consider this only if the use case of fetching many SHA1-s comes up. AIUI, you could also avoid creating the unmatched list entirely when the server advertises tip/reachable sha1s. That's a small optimization, but I think it may actually make the logic clearer. If you mean adding an "if" block at the point where we add the unmatched ref to the unmatched list (that either adds it to the list or immediately frees it), I think it makes the logic slightly more complicated. Or maybe you had something else in mind?
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
On 05/11, Jeff King wrote: > On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote: > > > fetch-pack, when fetching a literal SHA-1 from a server that is not > > configured with uploadpack.allowtipsha1inwant (or similar), always > > returns an error message of the form "Server does not allow request for > > unadvertised object %s". However, it is sometimes the case that such > > object is advertised. This situation would occur, for example, if a user > > or a script was provided a SHA-1 instead of a branch or tag name for > > fetching, and wanted to invoke "git fetch" or "git fetch-pack" using > > that SHA-1. > > > > Teach fetch-pack to also check the SHA-1s of the refs in the received > > ref advertisement if a literal SHA-1 was given by the user. > > Stepping back a bit, what does this mean for a world where we implement > protocol extensions to let the client specify a set of refspecs to limit > the advertisement? > > If we give the server our usual set of fetch refspecs, then we might > fail to fulfill a request that would have been advertised outside of > that set. And the behavior is confusing and non-transparent to the user. > I don't think that really makes sense, though; the advertisement we ask > for from the server should include only the bits we're interested in for > _this_ fetch. > > If we tell the server "we are interested in abcd1234", then it's not > going to find any matching ref by name, obviously. So should servers > then treat 40-hex names in the incoming refspecs as a request to show > any names which have a matching sha1? That works against any server-side > optimizations to avoid looking at the complete set of refs, but it would > only have to kick in when the user actually specified a single SHA-1 > (and even then only when allowAnySHA1 isn't on). So that's probably > workable. > > None of this is your problem now either way; the advertisement-limiting > extension is still vaporware, albeit one we've discussed a lot. I just > wanted to make sure we weren't painting ourselves into any corners. And > I think this case could probably be handled. I can't remember, is there any reason why it is still vaporware? As in what is holding us back from doing the advertisement-limiting (or doing away with the initial ref advertisement)? -- Brandon Williams
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote: > fetch-pack, when fetching a literal SHA-1 from a server that is not > configured with uploadpack.allowtipsha1inwant (or similar), always > returns an error message of the form "Server does not allow request for > unadvertised object %s". However, it is sometimes the case that such > object is advertised. This situation would occur, for example, if a user > or a script was provided a SHA-1 instead of a branch or tag name for > fetching, and wanted to invoke "git fetch" or "git fetch-pack" using > that SHA-1. > > Teach fetch-pack to also check the SHA-1s of the refs in the received > ref advertisement if a literal SHA-1 was given by the user. Stepping back a bit, what does this mean for a world where we implement protocol extensions to let the client specify a set of refspecs to limit the advertisement? If we give the server our usual set of fetch refspecs, then we might fail to fulfill a request that would have been advertised outside of that set. And the behavior is confusing and non-transparent to the user. I don't think that really makes sense, though; the advertisement we ask for from the server should include only the bits we're interested in for _this_ fetch. If we tell the server "we are interested in abcd1234", then it's not going to find any matching ref by name, obviously. So should servers then treat 40-hex names in the incoming refspecs as a request to show any names which have a matching sha1? That works against any server-side optimizations to avoid looking at the complete set of refs, but it would only have to kick in when the user actually specified a single SHA-1 (and even then only when allowAnySHA1 isn't on). So that's probably workable. None of this is your problem now either way; the advertisement-limiting extension is still vaporware, albeit one we've discussed a lot. I just wanted to make sure we weren't painting ourselves into any corners. And I think this case could probably be handled. -Peff
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote: > After looking at ways to solve jrnieder's performance concerns, if we're > going to need to manage one more item of state within the function, I > might as well use my earlier idea of storing unmatched refs in its own > list instead of immediately freeing them. This version of the patch > should have much better performance characteristics. Hrm. So the problem in your original was that the loop became quadratic in the number of refs when fetching all of them (because the original relies on the sorting to essentially do a list-merge). Are there any quadratic bits left? > @@ -649,6 +652,25 @@ static void filter_refs(struct fetch_pack_args *args, > > if ((allow_unadvertised_object_request & > (ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) { > + can_append = 1; > + } else { > + struct ref *u; > + /* Check all refs, including those already matched */ > + for (u = unmatched; u; u = u->next) { > + if (!oidcmp(>old_oid, >old_oid)) { > + can_append = 1; > + goto can_append; > + } > + } This is inside the nr_sought loop. So if I were to do: git fetch origin $(git ls-remote origin | awk '{print $1}') we're back to being quadratic. I realize that's probably a silly thing to do, but in the general case, you're O(m*n), where "n" is number of unmatched remote refs and "m" is the number of SHA-1s you're looking for. Doing better would require either sorting both lists, or storing the oids in something that has better than linear-time lookup. Perhaps a sha1_array or an oidset? We don't actually need to know anything about the unmatched refs after the first loop. We just need the list of oids that let us do can_append. AIUI, you could also avoid creating the unmatched list entirely when the server advertises tip/reachable sha1s. That's a small optimization, but I think it may actually make the logic clearer. -Peff
Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s
Hi, Jonathan Tan wrote: > fetch-pack, when fetching a literal SHA-1 from a server that is not > configured with uploadpack.allowtipsha1inwant (or similar), always > returns an error message of the form "Server does not allow request for > unadvertised object %s". However, it is sometimes the case that such > object is advertised. This situation would occur, for example, if a user > or a script was provided a SHA-1 instead of a branch or tag name for > fetching, and wanted to invoke "git fetch" or "git fetch-pack" using > that SHA-1. Yay, thanks again. [...] > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -598,6 +598,7 @@ static void filter_refs(struct fetch_pack_args *args, > { > struct ref *newlist = NULL; > struct ref **newtail = > + struct ref *unmatched = NULL; > struct ref *ref, *next; > int i; > > @@ -631,13 +632,15 @@ static void filter_refs(struct fetch_pack_args *args, > ref->next = NULL; > newtail = >next; > } else { > - free(ref); > + ref->next = unmatched; > + unmatched = ref; Interesting. Makes sense. [...] > /* Append unmatched requests to the list */ > for (i = 0; i < nr_sought; i++) { > unsigned char sha1[20]; > + int can_append = 0; Can this be simplified by factoring out a function? I.e. something like if ((... ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1) || find_oid_among_refs(unmatched, oid) || find_oid_among_refs(newlist, oid)) { ref->match_status = REF_MATCHED; ... } else { ref->match_status = REF_UNADVERTISED_NOT_ALLOWED; } [...] > @@ -657,6 +679,11 @@ static void filter_refs(struct fetch_pack_args *args, > } > } > *refs = newlist; > + > + for (ref = unmatched; ref; ref = next) { > + next = ref->next; > + free(ref); > + } > } optional nit: keeping the "*refs =" line as the last line of the function would make it easier to see the contract at a glance. (That said, a doc comment at the top would make it even clearer.) > --- a/t/t5500-fetch-pack.sh > +++ b/t/t5500-fetch-pack.sh > @@ -547,6 +547,41 @@ test_expect_success 'fetch-pack can fetch a raw sha1' ' Yay, thanks much for these. [...] > +test_expect_success 'fetch-pack can fetch a raw sha1 overlapping a named > ref' ' Ha, you read my mind. :) Except for the search-ref-list-for-oid function needing to be factored out, Reviewed-by: Jonathan NiederThanks.