Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-29 Thread Jonathan Tan
On Wed, 23 May 2018 12:42:10 +0900
Junio C Hamano  wrote:

> Somehow this feels more like a WIP than RFC, primarily for two
> reasons.  It was unclear what "edge" computation is trying to do; it
> seems way under-explained, especially the part that takes min-max
> while. merging two candidates.

Agreed that WIP would be a good designation. I'll make sure that the
merging is better explained in the next version.

> It also was unclear if this should be organized as a "take it or
> leave it" patch like this one, or eventually should be split into
> multiple steps when it gets polished enough to be considered for
> application, the early ones introducing a separate negotiator module
> without changing the common ancestor discovery algorithm at all,
> with later steps refining that negotiator and add more efficient
> common ancestor discovery process.

As for the question of one or (at least) two patches, right now I'm
working on a way to simplify what I have (probably to just implementing
the "skip exponentially" part you describe in [1]). I think that it will
be simple enough to put in one patch, but we can decide
once I've completed that patch.


Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-22 Thread Junio C Hamano
Jonathan Tan  writes:

>  Makefile   |   1 +
>  fetch-negotiator.c | 309 +
>  fetch-negotiator.h |  40 ++
>  fetch-pack.c   | 174 ++---
>  object.h   |   1 +
>  5 files changed, 392 insertions(+), 133 deletions(-)
>  create mode 100644 fetch-negotiator.c
>  create mode 100644 fetch-negotiator.h

Somehow this feels more like a WIP than RFC, primarily for two
reasons.  It was unclear what "edge" computation is trying to do; it
seems way under-explained, especially the part that takes min-max
while. merging two candidates.

It also was unclear if this should be organized as a "take it or
leave it" patch like this one, or eventually should be split into
multiple steps when it gets polished enough to be considered for
application, the early ones introducing a separate negotiator module
without changing the common ancestor discovery algorithm at all,
with later steps refining that negotiator and add more efficient
common ancestor discovery process.

> diff --git a/Makefile b/Makefile
> index ad880d1fc5..8bbedfa521 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -859,6 +859,7 @@ LIB_OBJS += ewah/ewah_bitmap.o
>  LIB_OBJS += ewah/ewah_io.o
>  LIB_OBJS += ewah/ewah_rlw.o
>  LIB_OBJS += exec-cmd.o
> +LIB_OBJS += fetch-negotiator.o
>  LIB_OBJS += fetch-object.o
>  LIB_OBJS += fetch-pack.o
>  LIB_OBJS += fsck.o
> diff --git a/fetch-negotiator.c b/fetch-negotiator.c
> new file mode 100644
> index 00..58975e1c37
> --- /dev/null
> +++ b/fetch-negotiator.c
> @@ -0,0 +1,309 @@
> +#include "cache.h"
> +#include "commit.h"
> +#include "fetch-negotiator.h"
> +
> +#define NO_THE_INDEX_COMPATIBILITY_MACROS

A totally unrelated tangent, but will we also benefit from
NO_THE_REPO_COMPATIBILITY_MACROS eventually?



Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-22 Thread Junio C Hamano
Jonathan Tan  writes:

> I was thinking about fetch negotiation in some non-ideal situations
> (specifically, when the client repo contains two or more independent
> branches that meet only somewhere far in the past) and thought about
> skipping over intermediate commits, using exponentially larger skips as
> we proceed, when generating "have" lines.

I recall a "wouldn't it be wonderful if we had..." discussion on
this exact topic much earlier in the project's life to skip
exponentialy (I think fibonacci was what was suggested in the
thread) and then backfill after overshooting, but I do not recall
anybody following through and coming up with a concrete design
and/or implementation.  Maybe protocol v2 in place we can have
something this time around.

Looking forward to read the remainder of the message and the patch
;-)

Thanks.


Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-22 Thread Stefan Beller
Hi Jonathan,
> I wouldn't characterize the errors as "off by one errors".

Yes, I put it in quotes but realized that would not convey it very well.

>  They are
> more like...let me use a diagram:
>
> A
> |\
> B D
> | |
> C E
>
> Suppose we know that the server does not have A, has C, and may or may
> not have E (we sent "have E" but didn't get a response yet). My method
> restarts the walk at all the parents of A (that is, B and D), but D is
> irrelevant to the situation (and should not be walked over - this is the
> error).

D is irrelevant to the A, C situation, but it is still a useful probing point?
So I would not call it an error but an inefficiency?


> In this patch, I wrote the new algorithm and deleted the old one.
...
> You're proposing that if I proceed with this, I split the patch into 2 -
> one to move the negotiation algorithm, and one to update it? If yes,
> normally I would agree, but the current negotiation algorithm is not
> very sophisticated (and does not take up much code), so I think it's not
> worth it.

ok, in that case I'll just dive into the code.

>
>> > +struct fetch_negotiator {
>> > +   struct sent_commit **sent_commits;
>> > +   size_t sent_commit_nr, sent_commit_alloc;
>> > +   struct prio_queue candidates;
>> > +};
>>
>> Maybe we can just declare the struct fetch_negotiator here and not
>> define it, such that nobody outside the actual implementation tries
>> to access its internals?
>
> That's possible - I wanted to allow allocation of this on the stack (to
> save a malloc), but perhaps that's over-optimization.

Ah good call. Please keep it that way then.


>> So even if we do not use the skip commit logic, this would be a benefit for 
>> any
>> http(-v0) and v2 users of the protocol?
>
> It would conserve bandwidth, yes, but storing all the commits sent with
> additional metadata for each would require more memory.

I did not see the memory requirements here as a problem until now.
Are you saying this memory might be too much to keep around?

Stefan


Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-22 Thread Jonathan Tan
On Mon, 21 May 2018 15:57:18 -0700
Stefan Beller  wrote:

> In an ideal world, the server and client would both estimate the potential
> reduction of the packfile to send, and base the decision if to continue
> negotiating on the trade off if the packfile reduction savings are greater
> than the cost of negotiation (in terms of bandwidth or time).
> (e.g. the server could keep track of the "potential largest packfile to
> sent" as well as the "potential smallest packfile to sent" given the
> state of negotiation. And as soon as the difference between those
> two packs is smaller than the size of one round of negotiation,
> it is better to stop and just sent the large file).
> 
> You state that you do not want to change the server side, and stick to
> the current protocol, which makes this ideal world scenario moot, but
> shifts the problem to "picking haves more intelligently".

Thanks for thinking about this!

This requires a modification on the server side, as you said, but this
sounds like a good idea that can be combined with the approach in my
patch - once we've found a match, instead of the client restarting the
fine walk, the server could then send the hashes of the commits between
its tip and the match (or some subset thereof) and the client can send
the specific hashes it has.

> > I'm not sure if this is the best way,
> 
> I think it is the best for a short term gain, as the picking algorithm is
> not part of the protocol, so it can be easily extended/reverted/improved
> as we go. So I would continue this way.

That's true in that this can be subsequently modified without backwards
incompatibility - the only issue is the opportunity cost of the author
and reviewer's time and effort.

> > (1) The implementation that I have
> >
> > This patch contains some drop-in code that passes all existing tests,
> > but the new negotiation algorithm is not tested.
> >
> > To mitigate the effect of skipping, I included functionality wherein
> > the client will retry the commits in a skip if the server ACKs the
> > destination of the skip, but this is currently imperfect - in
> > particular, the server might end the negotiation early, and the commits
> > retried in my current implementation are a superset due to the fact that
> > I didn't store the commits in the skip.
> 
> So we start with exponential hops, fall back to linear probing and then
> "make off by one errors" in the linear probes?

I wouldn't characterize the errors as "off by one errors". They are
more like...let me use a diagram:

A
|\
B D
| |
C E

Suppose we know that the server does not have A, has C, and may or may
not have E (we sent "have E" but didn't get a response yet). My method
restarts the walk at all the parents of A (that is, B and D), but D is
irrelevant to the situation (and should not be walked over - this is the
error).

> > (3) Other ways of improving negotiation
> >
> > If we're prepared to commit-walk a significant part of the entire local
> > repo (as we are, in the situation I described in the first paragraph),
> 
> > and if we have access to corresponding remote-tracking information,
> 
> This is a dangerous assumption, as not everyone is having a 1:1 relationship
> with their remote server (for e.g. code review), but there are these triangle
> workflows in the kernel community for example, where you push in one
> remote direction and (re-)obtain the history merged into the bigger picture
> from another remote. And these two remotes are not special to each other
> on the client side.

Precisely for this reason (where the local repo could have obtained a
remote's commits through means other than through the remote - in this
case, written by the local repo's user themself) I wanted to include
both ancestors and descendants of the remote tracking tip.

> This patch is moving the algorithm driving the selection of new
> commits to pick to
> a new file, but there is no new algorithm, yet?
> As hinted at from (1), this is smarter than what we did before by
> picking commits
> non-linearly but with some sort of exponential back off, how does it end the
> exponential phase?

In this patch, I wrote the new algorithm and deleted the old one. I
think the answer to your question is that the exponential phase never
ends. If what you mean is what happens when we reach a parentless commit
- we will emit it (that is, send "have X") regardless of how many
commits have been skipped.

> The way forward out of RFC state, might be to separate the introduction of a 
> new
> improved algorithm and the refactoring. So first move code literally into the
> fetch-negotiator file, and then add improvements in there, or is it
> just not worth
> the refactoring and directly put in the new algorithm?

You're proposing that if I proceed with this, I split the patch into 2 -
one to move the negotiation algorithm, and one to update it? If yes,
normally I would agree, but the current negotiation algorithm is not
very sophisticated (and 

Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

2018-05-21 Thread Stefan Beller
Hi Jonathan,

On Mon, May 21, 2018 at 1:43 PM, Jonathan Tan  wrote:
> I was thinking about fetch negotiation in some non-ideal situations
> (specifically, when the client repo contains two or more independent
> branches that meet only somewhere far in the past) and thought about
> skipping over intermediate commits, using exponentially larger skips as
> we proceed, when generating "have" lines. This is in the hope of
> reducing the bandwidth and roundtrips needed when fetching, and does not
> require a modification to the server.

In an ideal world, the server and client would both estimate the potential
reduction of the packfile to send, and base the decision if to continue
negotiating on the trade off if the packfile reduction savings are greater
than the cost of negotiation (in terms of bandwidth or time).
(e.g. the server could keep track of the "potential largest packfile to
sent" as well as the "potential smallest packfile to sent" given the
state of negotiation. And as soon as the difference between those
two packs is smaller than the size of one round of negotiation,
it is better to stop and just sent the large file).

You state that you do not want to change the server side, and stick to
the current protocol, which makes this ideal world scenario moot, but
shifts the problem to "picking haves more intelligently".

> I'm not sure if this is the best way,

I think it is the best for a short term gain, as the picking algorithm is
not part of the protocol, so it can be easily extended/reverted/improved
as we go. So I would continue this way.

> (1) The implementation that I have
>
> This patch contains some drop-in code that passes all existing tests,
> but the new negotiation algorithm is not tested.
>
> To mitigate the effect of skipping, I included functionality wherein
> the client will retry the commits in a skip if the server ACKs the
> destination of the skip, but this is currently imperfect - in
> particular, the server might end the negotiation early, and the commits
> retried in my current implementation are a superset due to the fact that
> I didn't store the commits in the skip.

So we start with exponential hops, fall back to linear probing and then
"make off by one errors" in the linear probes?

> (2) Possible future work for my implementation
>
> Since each sent commit maintains pointers to sent descendants and sent
> ancestors (strictly speaking, only the "close" ones - to find all of
> them, you need the transitive closure), this can be used for some sort
> of error correction when, during a stateless RPC negotiation, the server
> (which may be a group of eventually consistent servers behind a load
> balancer) reports that it no longer has a commit that it said it had.
> For example, we could in this case mark that commit as "they_have=NO"
> and for all its closest ancestors, set it to "they_have=YES" unless they
> in turn have a descendant with "they_have=YES" or
> "they_have=HAVE_DESCENDANT".
>
> (3) Other ways of improving negotiation
>
> If we're prepared to commit-walk a significant part of the entire local
> repo (as we are, in the situation I described in the first paragraph),

> and if we have access to corresponding remote-tracking information,

This is a dangerous assumption, as not everyone is having a 1:1 relationship
with their remote server (for e.g. code review), but there are these triangle
workflows in the kernel community for example, where you push in one
remote direction and (re-)obtain the history merged into the bigger picture
from another remote. And these two remotes are not special to each other
on the client side.

>  fetch-negotiator.c | 309 +
>  fetch-negotiator.h |  40 ++

This patch is moving the algorithm driving the selection of new
commits to pick to
a new file, but there is no new algorithm, yet?
As hinted at from (1), this is smarter than what we did before by
picking commits
non-linearly but with some sort of exponential back off, how does it end the
exponential phase?

The way forward out of RFC state, might be to separate the introduction of a new
improved algorithm and the refactoring. So first move code literally into the
fetch-negotiator file, and then add improvements in there, or is it
just not worth
the refactoring and directly put in the new algorithm?

Another use case we discussed was "open-ended bisection", where you know
you are in a bad state and "once upon a time it worked", and now you are tasked
to find the offending commit. To find such a commit, you probably
would also start
with such an exponential back off until you run into a "good frontier"
of commits
and then use conventional bisect to narrow down the exact commit.


> +enum they_have {
> +   /*
> +* We do not know if the server has this commit, or we know that the
> +* server does not have this commit.
> +*/
> +   NO,
> +
> +   /*
> +* The server has this