Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-13 Thread Jeff King
On Thu, May 11, 2017 at 10:00:50AM -0700, Brandon Williams wrote:

> > None of this is your problem now either way; the advertisement-limiting
> > extension is still vaporware, albeit one we've discussed a lot. I just
> > wanted to make sure we weren't painting ourselves into any corners. And
> > I think this case could probably be handled.
> 
> I can't remember, is there any reason why it is still vaporware? As in
> what is holding us back from doing the advertisement-limiting (or doing
> away with the initial ref  advertisement)?

The tricky part is handling the compatibility issues. The client has no
way to speak first in the current protocol, so we have to break protocol
to tell the server our refspecs before the advertisement.

I actually put together patches last fall for this, got derailed by a
small snag while polishing them, and just haven't picked them up again.
If you're interested, they're at the jk/early-caps-wip branch of
https://github.com/peff/git.

I got as far as breaking it all up into patches, but the commit messages
still need to be written. Some of them are pretty obvious to me still
after 6 months, but a few are a bit inscrutable.

-Peff


Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-11 Thread Jeff King
On Thu, May 11, 2017 at 10:51:37AM -0700, Jonathan Tan wrote:

> > This is inside the nr_sought loop. So if I were to do:
> > 
> >   git fetch origin $(git ls-remote origin | awk '{print $1}')
> > 
> > we're back to being quadratic. I realize that's probably a silly thing
> > to do, but in the general case, you're O(m*n), where "n" is number of
> > unmatched remote refs and "m" is the number of SHA-1s you're looking
> > for.
> 
> The original patch was quadratic regardless of whether we're fetching names
> or SHA-1s, which can be said to be bad since it is a regression in an
> existing and common use case (and I agree). This one is O(m*n), as you said
> - the "quadratic-ness" only kicks in if you fetch SHA-1s, which before this
> patch is impossible.

Yeah, no question this is an improvement over the original. I think this
still "regresses" a certain request performance-wise, but it's a request
that could never have succeeded in the original. Mostly I'd just rather
not leave a hidden quadratic loop that will blow up in somebody's face
a year or two down the road.

> Having a sha1_array or oidset would require allocation (O(n log n) time, I
> think, in addition to O(n) space) and this cost would be incurred regardless
> of how many SHA-1s were actually fetched (if m is an order of magnitude less
> than log n, for example, having a sha1_array might be actually worse). Also,
> presumably we don't want to incur this cost if we are fetching zero SHA-1s,
> so we would need to do some sort of pre-check. I'm inclined to leave it the
> way it is and consider this only if the use case of fetching many SHA1-s
> comes up.

An oidset should be O(n) in both time and space. Which we're already
spending in the earlier loop (and in general, when we allocate O(n) ref
structs in the first place). I don't think we need to care too much
about micro-optimizing bumps to the constant-factor here; we just need
to make sure we don't end up in an algorithmic explosion.

And if we initialize the oidset lazily, then we incur that cost only
when we would be doing the linear walk in your patch anyway. See the
patch below.

> > AIUI, you could also avoid creating the unmatched list entirely when the
> > server advertises tip/reachable sha1s. That's a small optimization, but
> > I think it may actually make the logic clearer.
> 
> If you mean adding an "if" block at the point where we add the unmatched ref
> to the unmatched list (that either adds it to the list or immediately frees
> it), I think it makes the logic slightly more complicated. Or maybe you had
> something else in mind?

I was just thinking that it does not leave a case where we create a
data structure (the unmatched list) even though we know right then that
it will not ever be used. So it's an extra line of logic there, but the
overall function may be clearer. I dunno, it's probably not a big deal
either way.

-Peff

---
diff --git a/fetch-pack.c b/fetch-pack.c
index 5cace7458..a660331e6 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -15,6 +15,7 @@
 #include "version.h"
 #include "prio-queue.h"
 #include "sha1-array.h"
+#include "oidset.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -592,6 +593,12 @@ static void mark_recent_complete_commits(struct 
fetch_pack_args *args,
}
 }
 
+static void add_refs_to_oidset(struct oidset *oids, const struct ref *refs)
+{
+   for (; refs; refs = refs->next)
+   oidset_insert(oids, >old_oid);
+}
+
 static void filter_refs(struct fetch_pack_args *args,
struct ref **refs,
struct ref **sought, int nr_sought)
@@ -600,6 +607,8 @@ static void filter_refs(struct fetch_pack_args *args,
struct ref **newtail = 
struct ref *unmatched = NULL;
struct ref *ref, *next;
+   struct oidset tip_oids = OIDSET_INIT;
+   int tip_oids_initialized = 0;
int i;
 
i = 0;
@@ -654,22 +663,15 @@ static void filter_refs(struct fetch_pack_args *args,
(ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) {
can_append = 1;
} else {
-   struct ref *u;
-   /* Check all refs, including those already matched */
-   for (u = unmatched; u; u = u->next) {
-   if (!oidcmp(>old_oid, >old_oid)) {
-   can_append = 1;
-   goto can_append;
-   }
-   }
-   for (u = newlist; u; u = u->next) {
-   if (!oidcmp(>old_oid, >old_oid)) {
-   can_append = 1;
-   break;
-   }
+   if (!tip_oids_initialized) {
+   /* Check all refs, including those already 
matched */
+  

Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-11 Thread Jonathan Tan
Thanks for your suggestions. I'll hold off on sending out a new patch 
(following Jonathan Nieder's suggestions [1]) until we decide if further 
optimizations (for example, as suggested by Peff) need to be done.


[1] <20170510232231.gc28...@aiede.svl.corp.google.com>

On 05/11/2017 02:46 AM, Jeff King wrote:

On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote:


After looking at ways to solve jrnieder's performance concerns, if we're
going to need to manage one more item of state within the function, I
might as well use my earlier idea of storing unmatched refs in its own
list instead of immediately freeing them. This version of the patch
should have much better performance characteristics.


Hrm. So the problem in your original was that the loop became quadratic
in the number of refs when fetching all of them (because the original
relies on the sorting to essentially do a list-merge). Are there any
quadratic bits left?


@@ -649,6 +652,25 @@ static void filter_refs(struct fetch_pack_args *args,

if ((allow_unadvertised_object_request &
(ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) {
+   can_append = 1;
+   } else {
+   struct ref *u;
+   /* Check all refs, including those already matched */
+   for (u = unmatched; u; u = u->next) {
+   if (!oidcmp(>old_oid, >old_oid)) {
+   can_append = 1;
+   goto can_append;
+   }
+   }


This is inside the nr_sought loop. So if I were to do:

  git fetch origin $(git ls-remote origin | awk '{print $1}')

we're back to being quadratic. I realize that's probably a silly thing
to do, but in the general case, you're O(m*n), where "n" is number of
unmatched remote refs and "m" is the number of SHA-1s you're looking
for.


The original patch was quadratic regardless of whether we're fetching 
names or SHA-1s, which can be said to be bad since it is a regression in 
an existing and common use case (and I agree). This one is O(m*n), as 
you said - the "quadratic-ness" only kicks in if you fetch SHA-1s, which 
before this patch is impossible.



Doing better would require either sorting both lists, or storing the
oids in something that has better than linear-time lookup.  Perhaps a
sha1_array or an oidset? We don't actually need to know anything about
the unmatched refs after the first loop. We just need the list of oids
that let us do can_append.


Having a sha1_array or oidset would require allocation (O(n log n) time, 
I think, in addition to O(n) space) and this cost would be incurred 
regardless of how many SHA-1s were actually fetched (if m is an order of 
magnitude less than log n, for example, having a sha1_array might be 
actually worse). Also, presumably we don't want to incur this cost if we 
are fetching zero SHA-1s, so we would need to do some sort of pre-check. 
I'm inclined to leave it the way it is and consider this only if the use 
case of fetching many SHA1-s comes up.



AIUI, you could also avoid creating the unmatched list entirely when the
server advertises tip/reachable sha1s. That's a small optimization, but
I think it may actually make the logic clearer.


If you mean adding an "if" block at the point where we add the unmatched 
ref to the unmatched list (that either adds it to the list or 
immediately frees it), I think it makes the logic slightly more 
complicated. Or maybe you had something else in mind?


Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-11 Thread Brandon Williams
On 05/11, Jeff King wrote:
> On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote:
> 
> > fetch-pack, when fetching a literal SHA-1 from a server that is not
> > configured with uploadpack.allowtipsha1inwant (or similar), always
> > returns an error message of the form "Server does not allow request for
> > unadvertised object %s". However, it is sometimes the case that such
> > object is advertised. This situation would occur, for example, if a user
> > or a script was provided a SHA-1 instead of a branch or tag name for
> > fetching, and wanted to invoke "git fetch" or "git fetch-pack" using
> > that SHA-1.
> > 
> > Teach fetch-pack to also check the SHA-1s of the refs in the received
> > ref advertisement if a literal SHA-1 was given by the user.
> 
> Stepping back a bit, what does this mean for a world where we implement
> protocol extensions to let the client specify a set of refspecs to limit
> the advertisement?
> 
> If we give the server our usual set of fetch refspecs, then we might
> fail to fulfill a request that would have been advertised outside of
> that set. And the behavior is confusing and non-transparent to the user.
> I don't think that really makes sense, though; the advertisement we ask
> for from the server should include only the bits we're interested in for
> _this_ fetch.
> 
> If we tell the server "we are interested in abcd1234", then it's not
> going to find any matching ref by name, obviously. So should servers
> then treat 40-hex names in the incoming refspecs as a request to show
> any names which have a matching sha1? That works against any server-side
> optimizations to avoid looking at the complete set of refs, but it would
> only have to kick in when the user actually specified a single SHA-1
> (and even then only when allowAnySHA1 isn't on). So that's probably
> workable.
> 
> None of this is your problem now either way; the advertisement-limiting
> extension is still vaporware, albeit one we've discussed a lot. I just
> wanted to make sure we weren't painting ourselves into any corners. And
> I think this case could probably be handled.

I can't remember, is there any reason why it is still vaporware? As in
what is holding us back from doing the advertisement-limiting (or doing
away with the initial ref  advertisement)?

-- 
Brandon Williams


Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-11 Thread Jeff King
On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote:

> fetch-pack, when fetching a literal SHA-1 from a server that is not
> configured with uploadpack.allowtipsha1inwant (or similar), always
> returns an error message of the form "Server does not allow request for
> unadvertised object %s". However, it is sometimes the case that such
> object is advertised. This situation would occur, for example, if a user
> or a script was provided a SHA-1 instead of a branch or tag name for
> fetching, and wanted to invoke "git fetch" or "git fetch-pack" using
> that SHA-1.
> 
> Teach fetch-pack to also check the SHA-1s of the refs in the received
> ref advertisement if a literal SHA-1 was given by the user.

Stepping back a bit, what does this mean for a world where we implement
protocol extensions to let the client specify a set of refspecs to limit
the advertisement?

If we give the server our usual set of fetch refspecs, then we might
fail to fulfill a request that would have been advertised outside of
that set. And the behavior is confusing and non-transparent to the user.
I don't think that really makes sense, though; the advertisement we ask
for from the server should include only the bits we're interested in for
_this_ fetch.

If we tell the server "we are interested in abcd1234", then it's not
going to find any matching ref by name, obviously. So should servers
then treat 40-hex names in the incoming refspecs as a request to show
any names which have a matching sha1? That works against any server-side
optimizations to avoid looking at the complete set of refs, but it would
only have to kick in when the user actually specified a single SHA-1
(and even then only when allowAnySHA1 isn't on). So that's probably
workable.

None of this is your problem now either way; the advertisement-limiting
extension is still vaporware, albeit one we've discussed a lot. I just
wanted to make sure we weren't painting ourselves into any corners. And
I think this case could probably be handled.

-Peff


Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-11 Thread Jeff King
On Wed, May 10, 2017 at 03:11:57PM -0700, Jonathan Tan wrote:

> After looking at ways to solve jrnieder's performance concerns, if we're
> going to need to manage one more item of state within the function, I
> might as well use my earlier idea of storing unmatched refs in its own
> list instead of immediately freeing them. This version of the patch
> should have much better performance characteristics.

Hrm. So the problem in your original was that the loop became quadratic
in the number of refs when fetching all of them (because the original
relies on the sorting to essentially do a list-merge). Are there any
quadratic bits left?

> @@ -649,6 +652,25 @@ static void filter_refs(struct fetch_pack_args *args,
>  
>   if ((allow_unadvertised_object_request &
>   (ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1))) {
> + can_append = 1;
> + } else {
> + struct ref *u;
> + /* Check all refs, including those already matched */
> + for (u = unmatched; u; u = u->next) {
> + if (!oidcmp(>old_oid, >old_oid)) {
> + can_append = 1;
> + goto can_append;
> + }
> + }

This is inside the nr_sought loop. So if I were to do:

  git fetch origin $(git ls-remote origin | awk '{print $1}')

we're back to being quadratic. I realize that's probably a silly thing
to do, but in the general case, you're O(m*n), where "n" is number of
unmatched remote refs and "m" is the number of SHA-1s you're looking
for.

Doing better would require either sorting both lists, or storing the
oids in something that has better than linear-time lookup.  Perhaps a
sha1_array or an oidset? We don't actually need to know anything about
the unmatched refs after the first loop. We just need the list of oids
that let us do can_append.

AIUI, you could also avoid creating the unmatched list entirely when the
server advertises tip/reachable sha1s. That's a small optimization, but
I think it may actually make the logic clearer.

-Peff


Re: [PATCH v3] fetch-pack: always allow fetching of literal SHA1s

2017-05-10 Thread Jonathan Nieder
Hi,

Jonathan Tan wrote:

> fetch-pack, when fetching a literal SHA-1 from a server that is not
> configured with uploadpack.allowtipsha1inwant (or similar), always
> returns an error message of the form "Server does not allow request for
> unadvertised object %s". However, it is sometimes the case that such
> object is advertised. This situation would occur, for example, if a user
> or a script was provided a SHA-1 instead of a branch or tag name for
> fetching, and wanted to invoke "git fetch" or "git fetch-pack" using
> that SHA-1.

Yay, thanks again.

[...]
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -598,6 +598,7 @@ static void filter_refs(struct fetch_pack_args *args,
>  {
>   struct ref *newlist = NULL;
>   struct ref **newtail = 
> + struct ref *unmatched = NULL;
>   struct ref *ref, *next;
>   int i;
>  
> @@ -631,13 +632,15 @@ static void filter_refs(struct fetch_pack_args *args,
>   ref->next = NULL;
>   newtail = >next;
>   } else {
> - free(ref);
> + ref->next = unmatched;
> + unmatched = ref;

Interesting.  Makes sense.

[...]
>   /* Append unmatched requests to the list */
>   for (i = 0; i < nr_sought; i++) {
>   unsigned char sha1[20];
> + int can_append = 0;

Can this be simplified by factoring out a function?  I.e. something
like

if ((... ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1)
|| find_oid_among_refs(unmatched, oid)
|| find_oid_among_refs(newlist, oid)) {
ref->match_status = REF_MATCHED;
...
} else {
ref->match_status = REF_UNADVERTISED_NOT_ALLOWED;
}

[...]
> @@ -657,6 +679,11 @@ static void filter_refs(struct fetch_pack_args *args,
>   }
>   }
>   *refs = newlist;
> +
> + for (ref = unmatched; ref; ref = next) {
> + next = ref->next;
> + free(ref);
> + }
>  }

optional nit: keeping the "*refs =" line as the last line of the
function would make it easier to see the contract at a glance.  (That
said, a doc comment at the top would make it even clearer.)

> --- a/t/t5500-fetch-pack.sh
> +++ b/t/t5500-fetch-pack.sh
> @@ -547,6 +547,41 @@ test_expect_success 'fetch-pack can fetch a raw sha1' '

Yay, thanks much for these.

[...]
> +test_expect_success 'fetch-pack can fetch a raw sha1 overlapping a named 
> ref' '

Ha, you read my mind. :)

Except for the search-ref-list-for-oid function needing to be factored out,
Reviewed-by: Jonathan Nieder 

Thanks.