Re: Resumable clone

2016-03-08 Thread Kevin Wern
Junio,

> Yes, that is very close to what I said in the "what remains?"
> section, but with a crucial difference in a detail.  Perhaps reading
> the message you are respoinding to again more carefully will clear
> the confusion.  This is what we want to allow the server to say
> (from the message you are responding to, but rephrased slightly,
> hoping that it would help unconfuse you):
>
> I prefer not to serve a full clone to you in the usual route if
> I can avoid it.  You can help me by populate your history first
> with something else (which would bring you to a state as if you
> cloned potentially a bit older version of me) and then coming
> back to me for an additional fetch to complete the history.
>
> That "something else" does not have to be, and is not expected to
> be, the "full" history of the current state.  As long as it can be
> used to bring the cloner to a reasonably recent state, sufficient to
> make a follow up incremental fetch inexpesive enough, it is
> appropriate.

Sorry, I was thrown off by:

> - the type of resource, if we want this to be extensible.  I
>   think we should initially limit it to "a single full history
>   .pack",

I misinterpreted what you meant by "a single full history .pack," and
used that to limit what you said earlier. A lot of my reasoning from
there misses the point, then, because it involved finding some way to
determine if a history .pack contains the full history, which
obviously is irrelevant.

>> I'm not sure how the server should determine the returned resource. A
>> packfile alone does not guarantee the full repo history, and I'm not
>> positive checking the idx file for HEAD's commit hash ensures every
>> sub-object is in that file (though I feel it should, because it is
>> delta-compressed).
>
> The above reasoning does not make much technical sense.  delta
> compression does not ensure connectivity in the commit history and
> commit->tree->blob containment.  Again I am not sure where you are
> going with this.

Related to the above, I was trying to find a way to determine if the
packfile contained the full history, which actually doesn't matter. My
technical reasoning was affected by the way deltas represent changes
in other VCS's, but I realize now that git's delta compression is
based around the similarity of objects, not their historical relation
to each other.

>> Which leaves me with questions on how to test the above condition. Is
>> there an expected place, such as config, where the user will specify
>> the type of alternate resource, and should we assume some default if
>> it isn't specified? Can the user optionally specify the exact file to
>> use (I can't see why because it only invites more errors)? Should the
>> specification of this option change git's behavior on update, such as
>> making sure the full history is compressed? Does the existence of the
>> HEAD object in the packfile ensure the repo's entire history is
>> contained in that file?
>
> Those (except for your assumption that no follow-up fetch is
> allowed, which requires you to limit yourself to "full" history,
> which is an unnecessary requirement) are good points one should be
> making design decisions on when building this part of the system.

Awesome, all of this is enough info to get me started. Thanks!

Kevin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-08 Thread Junio C Hamano
Duy Nguyen  writes:

> Yeah the determination is tricky, it depends on server setup. Let's
> start with select the pack for download first because there could be
> many of them. A heuristic (*) of choosing the biggest one in
> $GIT_DIR/objects/pack is probably ok for now (we don't need full
> history, "the biggest part of history" is good enough).

You need to choose a pack that is fully connected, though.  I was
envisioning that an updated "git repack" would have an extra option
that help server operators to manage it easily.  You would need to
consider for example:

 - You may not want to rebuild such a base pack too frequently
   (e.g. you may want to repack a busy repository twice a day, but
   you would want to redo the base pack only once a week).  It is
   possible to mark it with .keep for subsequent repacks to leave it
   alone, but there needs a mechanism to manage that .keep marker.

 - You may not want to have all refs in such a base pack (e.g. you
   may want to exclude refs/changes/ from it).  There need to be a
   configuration to specify which refs are included in the base
   pack.

while designing such an update.  Then the repack with such an option
would roughly be:

- If it is time to redo a base repack, then

  - unplug the .keep bit the previous base pack has;

  - create a pack that contains full history reachable from the
specified refs;

  - mark the new base pack as such;

- Pack all objects that are not in the base pack that are
  reachable from any refs (and other anchoring points, such as
  reflogs and the index) into a separate pack.

And the prime_clone() advertisement would be just the matter of
knowing how the "base" pack is marked in the above process.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-08 Thread Junio C Hamano
Kevin Wern  writes:

> From what I understand, a pattern exists in clone to download a
> packfile when a desired object isn't found as a resource. In this
> case, if no alternative is listed in http-alternatives, the client
> automatically checks the pack index(es) to see which packfile contains
> the object it needs.

You sound as if you are describing how a fetch over the dumb commit
walker http transport works.  That does not have anything to do with
the discussion of resumable clone, though, so I am not sure where
you are going with this.

> What I believe *doesn't* exist is a
> way for the server to say, "I have a resource, in this case a
> full-history packfile, and I *prefer* you get that file instead of
> attempting to traverse the object tree." This should be implemented in
> a way that is extensible to other resource types moving forward.

Yes, that is very close to what I said in the "what remains?"
section, but with a crucial difference in a detail.  Perhaps reading
the message you are respoinding to again more carefully will clear
the confusion.  This is what we want to allow the server to say
(from the message you are responding to, but rephrased slightly,
hoping that it would help unconfuse you):

I prefer not to serve a full clone to you in the usual route if
I can avoid it.  You can help me by populate your history first
with something else (which would bring you to a state as if you
cloned potentially a bit older version of me) and then coming
back to me for an additional fetch to complete the history.

That "something else" does not have to be, and is not expected to
be, the "full" history of the current state.  As long as it can be
used to bring the cloner to a reasonably recent state, sufficient to
make a follow up incremental fetch inexpesive enough, it is
appropriate.

> I'm not sure how the server should determine the returned resource. A
> packfile alone does not guarantee the full repo history, and I'm not
> positive checking the idx file for HEAD's commit hash ensures every
> sub-object is in that file (though I feel it should, because it is
> delta-compressed).

The above reasoning does not make much technical sense.  delta
compression does not ensure connectivity in the commit history and
commit->tree->blob containment.  Again I am not sure where you are
going with this.

> With that in mind, my best guess at the server
> logic for packfiles is something like:
>
> Do I have a full history packfile, and am I configured to return one?
> - If yes, then return an answer specifying the file url and type (packfile)
> - Otherwise, return some other answer indicating the client must go
> through the original cloning process (or possibly return a different
> kind of file and type, once we expand that capability)

Roughly speaking, yes.

> Which leaves me with questions on how to test the above condition. Is
> there an expected place, such as config, where the user will specify
> the type of alternate resource, and should we assume some default if
> it isn't specified? Can the user optionally specify the exact file to
> use (I can't see why because it only invites more errors)? Should the
> specification of this option change git's behavior on update, such as
> making sure the full history is compressed? Does the existence of the
> HEAD object in the packfile ensure the repo's entire history is
> contained in that file?

Those (except for your assumption that no follow-up fetch is
allowed, which requires you to limit yourself to "full" history,
which is an unnecessary requirement) are good points one should be
making design decisions on when building this part of the system.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-08 Thread Duy Nguyen
On Tue, Mar 8, 2016 at 10:33 AM, Kevin Wern  wrote:
> Hey Junio and Duy,
>
> Thank you for your thorough responses! I'm new to git dev, so it's
> extremely helpful.
>
>> - The server side endpoint does not have to be, and I think it
>> should not be, implemented as an extension to the current
>> upload-pack protocol. It is perfectly fine to add a new "git
>> prime-clone" program next to existing "git upload-pack" and
>> "git receive-pack" programs and drive it through the
>> git-daemon, curl remote helper, and direct execution over ssh.
>
> I'd like to work on this, and continue through to implementing the
> prime_clone() client-side function.

Great! Although I think you started with the most configurable part,
something to work out I guess.

> From what I understand, a pattern exists in clone to download a
> packfile when a desired object isn't found as a resource. In this
> case, if no alternative is listed in http-alternatives, the client
> automatically checks the pack index(es) to see which packfile contains
> the object it needs.

I don't follow this. What is "a resource"?

> However, the above is a fallback. What I believe *doesn't* exist is a
> way for the server to say, "I have a resource, in this case a
> full-history packfile

ah a resource could be the pack file to be downloaded, ok..

> , and I *prefer* you get that file instead of
> attempting to traverse the object tree." This should be implemented in
> a way that is extensible to other resource types moving forward.
>
> I'm not sure how the server should determine the returned resource. A
> packfile alone does not guarantee the full repo history

That's for the later part. At this point, I think the update "git
clone" will request the new service you're writing and ask "do you
have a resumable pack I can download?" and it can return an URL. Then
prime_clone() proceeds to download and figure out what's in that pack.

Yeah the determination is tricky, it depends on server setup. Let's
start with select the pack for download first because there could be
many of them. A heuristic (*) of choosing the biggest one in
$GIT_DIR/objects/pack is probably ok for now (we don't need full
history, "the biggest part of history" is good enough). Then we get
the pack file name, which can be used as pack ID.

For the simplest setup, I suppose the admin would give us an URL
prefix (or multiple prefixes), e.g. http://myserver.com/cache-here/
and we are supposed to append the pack file name to it, and the full
URL would be http://myserver.com/cache-here/pack-$SHA1.pack. This is
what the new service will return to git-clone.

For more complex setup, I guess the admin can provide a script that
takes pack id as key and returns the list of URLs for us. They can
give us the path to this script via config file.

(*) The source of producing this cached pack (and maybe sending them
to CDN) is git-repack. But when it's done and how it's done is really
up to admins. So the admin really needs to provide us a script or
something that provides this info back, if we want to avoid
heuristics. Such a script can even choose to ignore the given pack id
and output URLs based on repository identify only.

> , and I'm not
> positive checking the idx file for HEAD's commit hash ensures every
> sub-object is in that file (though I feel it should, because it is
> delta-compressed). With that in mind, my best guess at the server
> logic for packfiles is something like:
>
> Do I have a full history packfile, and am I configured to return one?
> - If yes, then return an answer specifying the file url and type (packfile)
> - Otherwise, return some other answer indicating the client must go
> through the original cloning process (or possibly return a different
> kind of file and type, once we expand that capability)

Well, the lack of this new service should be enough for git-clone to
fall back to normal cloning protocol. The admin must enable this
service in git-daemon first if they want to use it. If there's no
suitable URL to show, it's ok to just disconnect. git-clone must be
able to deal with that and fall back.

> Which leaves me with questions on how to test the above condition. Is
> there an expected place, such as config, where the user will specify
> the type of alternate resource, and should we assume some default if
> it isn't specified? Can the user optionally specify the exact file to
> use (I can't see why because it only invites more errors)? Should the
> specification of this option change git's behavior on update, such as
> making sure the full history is compressed? Does the existence of the
> HEAD object in the packfile ensure the repo's entire history is
> contained in that file?

I think some of these questions are basically "ask admins", the other
half we can deal with when implementing prime_clone().

Following up what I wrote above. Suppose your service's name is
clone-download (or any other name), the config variable
daemon.clonedownload 

Re: Resumable clone

2016-03-07 Thread Kevin Wern
Hey Junio and Duy,

Thank you for your thorough responses! I'm new to git dev, so it's
extremely helpful.

> - The server side endpoint does not have to be, and I think it
> should not be, implemented as an extension to the current
> upload-pack protocol. It is perfectly fine to add a new "git
> prime-clone" program next to existing "git upload-pack" and
> "git receive-pack" programs and drive it through the
> git-daemon, curl remote helper, and direct execution over ssh.

I'd like to work on this, and continue through to implementing the
prime_clone() client-side function.

>From what I understand, a pattern exists in clone to download a
packfile when a desired object isn't found as a resource. In this
case, if no alternative is listed in http-alternatives, the client
automatically checks the pack index(es) to see which packfile contains
the object it needs.

However, the above is a fallback. What I believe *doesn't* exist is a
way for the server to say, "I have a resource, in this case a
full-history packfile, and I *prefer* you get that file instead of
attempting to traverse the object tree." This should be implemented in
a way that is extensible to other resource types moving forward.

I'm not sure how the server should determine the returned resource. A
packfile alone does not guarantee the full repo history, and I'm not
positive checking the idx file for HEAD's commit hash ensures every
sub-object is in that file (though I feel it should, because it is
delta-compressed). With that in mind, my best guess at the server
logic for packfiles is something like:

Do I have a full history packfile, and am I configured to return one?
- If yes, then return an answer specifying the file url and type (packfile)
- Otherwise, return some other answer indicating the client must go
through the original cloning process (or possibly return a different
kind of file and type, once we expand that capability)

Which leaves me with questions on how to test the above condition. Is
there an expected place, such as config, where the user will specify
the type of alternate resource, and should we assume some default if
it isn't specified? Can the user optionally specify the exact file to
use (I can't see why because it only invites more errors)? Should the
specification of this option change git's behavior on update, such as
making sure the full history is compressed? Does the existence of the
HEAD object in the packfile ensure the repo's entire history is
contained in that file?

Also, for now I'm assuming the same options should be available for
prime-clone as are available for upload-pack (--strict,
--timeout=). Let me know if any other features are necessary.
Also, let me know if I'm headed in the complete wrong direction...

Thank you so much for your help!
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-06 Thread Junio C Hamano
Johannes Schindelin  writes:

> First of all: my main gripe with the discussed approach is that it uses
> bundles. I know, I introduced bundles, but they just seem too klunky and
> too static for the resumable clone feature.

We should make the mechanism extensible so that we can later support
multiple "alternate resource" formats, and "bundle" could be one of
the options, my current thinking is that the initial version should
use just a bare packfile to bootstramp, not a bundle.

The format being "static" is both a feature and a practical
compromise.  It is a feature to allow clone traffic, which is a
significant portion of the whole traffic to hosting sites, diverted
off of the core server network for a busy hosting site, saving both
networking and CPU cost.  And that benefit will be felt even if the
client has a good enough connection to the server that it does not
have to worry about resuming.  It is a practical compromise that the
mechanism will not be extensible for helping incremental fetch but I
heard that the server side statistics tells us that there aren't
many "duplicate incremental fetch" requests (i.e. many clients
having the same set of "have"s so that the server side can prepare,
serve, and cache the same incremental pack, which can be served on a
resumable transport, to help resuming clients by supporting
partial/range requests), I do not think it is practical to try to
use the same mechanism to help incremental and clone traffic.  One
size would not fit both here.

I think a better approach to help incremental fetches is along the
line of what was discussed in the discussion with Al Viro and others
the other day.  You'd need various building blocks implemented anew,
including:

 - A protocol extension to allow the client to tell the server a
   list of "not necessarily connected" objects that it has, so that
   the server side can exclude them from the set of objects the
   traditional "have"-"ack" exchange would determine to be sent when
   building a pack.

   - A design of deciding what "list of objects" is worth sending to
 the server side.  The total number of objects in the receiving
 end is an obvious upper bound, and it might be sufficient to
 send the whole thing as-is, but there may be more efficient way
 to determine this set [*1*]

 - A way to salvage objects from a truncated pack, as there is no
   such tool in core-git.


[Footnote]

*1* Once the traditional "have"-"ack" determines the set of objects
the sender thinks the receiver may not have, we need to figure
out the ones that happen to exist on the receiver end already,
either because they were salvaged from a truncated pack data it
received earlier, or perhaps because they already existed by
fetching from a side branch (e.g. two repositories derived from
the same upstream, updating from Linus's kernel tree by somebody
who regularly interacts with linux-next tree), and exclude them
from the set of objects sent from the sender.

I've long felt that Eppstein's invertible bloom filter might be
a good way to determine efficiently, among the set of objects
the sending and the receiving ends have, which ones are common,
but I didn't look into this deeply myself.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-06 Thread Junio C Hamano
Duy Nguyen  writes:

> One thing Junio didn't mention in his summary is the use of pack
> bitmap [1]. Jeff talked about GitHub specific needs,...

Do not take it as the "summary of the whole discussion".

I deliberately tried to limit the list to absolute minimum to allow
building a workable initial version while leaving the door open for
future extension.  There are other things that I didn't mention
because they would not have to be in the absolute minimum necessary
for such an initial design.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-06 Thread Duy Nguyen
On Sun, Mar 6, 2016 at 3:49 PM, Duy Nguyen  wrote:
> One thing Junio didn't mention in his summary is the use of pack
> bitmap [1]. 
>
> [1] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222

Oops, wrong link. Should be this one

http://article.gmane.org/gmane.comp.version-control.git/288258
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-06 Thread Duy Nguyen
On Sun, Mar 6, 2016 at 2:59 PM, Johannes Schindelin
 wrote:
> First of all: my main gripe with the discussed approach is that it uses
> bundles. I know, I introduced bundles, but they just seem too klunky and
> too static for the resumable clone feature.

One thing Junio didn't mention in his summary is the use of pack
bitmap [1]. Jeff talked about GitHub specific needs, but I think it
has values even outside of GitHub: if people store some secret refs in
the initial pack (e.g. hidden by ref namespace, or in reflog - imagine
someone committed a password and did a reset --hard then pushed
again), they probably do not want to publish initial pack as-is. With
pack bitmap, we can sort of recreate the "clean initial pack" on the
fly relatively cheaply because all object order is stable in this
particular pack. We go a tiny bit less static with this resume+pack
bitmap combination.

[1] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222

> So I wonder whether it would be possible to come up with a subset of the
> revs with a stable order, with associated thin packs (using prior revs as
> negative revs in the commit range) such that each thin pack weighs roughly
> 1MB (or whatever granularity you desire). My thinking was that it should
> be possible to follow a similar strategy as bisect to come up with said
> list.
>
> The client could then state that it was interrupted at downloading a given
> rev's pack, with a specific offset, and the (thin) pack could be
> regenerated on the fly (or cached), serving only the desired chunk. The
> server would then also automatically know where in the list of
> stable-ordered revs the clone was interrupted and continue with the next
> one.
>
> Oh, and if regenerating the thin pack instead of caching it, we need to
> ensure a stable packing (i.e. no threads!). That is, given a commit range,
> we need to (re-)generate bytewise-identical thin packs.

The bytewise-identical idea is already shot down. But I like the
splitting into multiple thin packs (and re-downloading the whole thin
pack when failed). Multiple thin packs allow resume capability.
Chaining thin packs saves bandwidth. And pack-objects still has
freedom doing anything inside each thin pack. For gigantic repos and
good-enough connections, this could work (even for fetch/pull).

The biggest problem I see is it's hard for rev-list to split thin
packs based on pack size because we do not know that until
pack-objects has consumed all revs and produced the pack.
Approximation based on the number of objects should probably be ok
unless there are very large blobs. But that probably should be
addressed separately by the resurrection of Junio's split-blob series.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-05 Thread Johannes Schindelin
Hi Junio & Duy,

On Sat, 5 Mar 2016, Junio C Hamano wrote:

> Duy Nguyen  writes:
> 
> > Resumable clone is happening. See [1] for the basic idea, [2] and [3]
> > for some preparation work. I'm sure you can help. Once you've gone
> > through at least [1], I think you can pick something (e.g. finalizing
> > the protocol, update the server side, or git-clone)
> >
> > [1] http://thread.gmane.org/gmane.comp.version-control.git/285921
> > [2] 
> > http://thread.gmane.org/gmane.comp.version-control.git/288080/focus=288150
> > [3] 
> > http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222
> 
> I think your response needs to be refined with a bit higher level
> overview, though.  Here are some thoughts to summarize the discussion
> and to extend it.
> 
> I think the right way to think about this is that we are adding a
> capability for the server to instruct the clients: I prefer not to
> serve a full clone to you in the usual route if I can avoid it.  You
> can help me by going to an alternate resource and populate your
> history first and then coming back to me for an additional fetch to
> complete the history if you want to.  Doing so would also help you
> because that alternate resource can be a static file (or two) that
> you can download over a resumable transport (like static files
> served over HTTPS).

For quite some time I considered presenting some alternate/additional
ideas. I feel a little bad for mentioning them here because I *really*
have no time to follow up on them whatsoever. But maybe they turn out to
contribute something to the final solution.

I tried to follow the discussion as much as possible, sometimes failing
due to time constraints, therefore I'd like to apologize in advance if any
of these ideas have been mentioned already.

First of all: my main gripe with the discussed approach is that it uses
bundles. I know, I introduced bundles, but they just seem too klunky and
too static for the resumable clone feature.

So I wonder whether it would be possible to come up with a subset of the
revs with a stable order, with associated thin packs (using prior revs as
negative revs in the commit range) such that each thin pack weighs roughly
1MB (or whatever granularity you desire). My thinking was that it should
be possible to follow a similar strategy as bisect to come up with said
list.

The client could then state that it was interrupted at downloading a given
rev's pack, with a specific offset, and the (thin) pack could be
regenerated on the fly (or cached), serving only the desired chunk. The
server would then also automatically know where in the list of
stable-ordered revs the clone was interrupted and continue with the next
one.

Oh, and if regenerating the thin pack instead of caching it, we need to
ensure a stable packing (i.e. no threads!). That is, given a commit range,
we need to (re-)generate bytewise-identical thin packs.

Of course this stable-ordered rev list would have to be persisted when the
server serves its first resumable clone and then extended with future
resumable clones whenever new revisions were pushed. (And there would also
have to be some way to evict no-longer-reachable revs, maybe by simply
regenerating the whole shebang.)

For all of this to work, the most crucial idea would be this one: a clone
can *always* start as-is. Only when interrupted, and when the server
supports the "resumable clone" capability, and only when "resuming"
the clone, the client could *actually* ask for a resumable clone.

Yes, this could potentially waste a bit of bandwidth on the part of the
user with a flakey connection (because whatever was transferred during the
first, non-resumable clone would be blown out of the window), but it might
make it easier for us to provide a non-fragile upgrade path because the
cloning process would still default to the current one.

Food for thought?

Ciao,
Dscho
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-05 Thread Junio C Hamano
Junio C Hamano  writes:

> So what remains?  Here is a rough and still slushy outline:
>
>  - A new method, prime_clone(), in "struct transport" for "git
>clone" client to first call to learn the location of the
>"alternate resource" from the server.
>
>...
>- The format of the returned "answer" needs to be designed.  It
>  must be able to express:
>
>  - the location of the resource, i.e. a URL;
>
>  - the type of resource, if we want this to be extensible.  I
>think we should initially limit it to "a single full history
>.pack", so from that point of view this may not be absolutely
>necessary, but we already know that we may want to say "go
>there and you will find an old-style bundle file" to support
>the kernel.org CDN, and we may also want to support Jeff's
>"split bundle" or Shawn's ".info" file.  A resource poor
>(read: personal) machine that hosts a personal of a popular

Sorry for a typo: s/of a/fork &/;

>project might want to name a "git clone" URL for that popular
>project it forked from (e.g. "Clone Linus's repository from
>kernel.org and then come back here for incremental fetch").
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resumable clone

2016-03-05 Thread Junio C Hamano
Duy Nguyen  writes:

> Resumable clone is happening. See [1] for the basic idea, [2] and [3]
> for some preparation work. I'm sure you can help. Once you've gone
> through at least [1], I think you can pick something (e.g. finalizing
> the protocol, update the server side, or git-clone)
>
> [1] http://thread.gmane.org/gmane.comp.version-control.git/285921
> [2] http://thread.gmane.org/gmane.comp.version-control.git/288080/focus=288150
> [3] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222

I think your response needs to be refined with a bit higher level
overview, though.  Here are some thoughts to summarize the
discussion and to extend it.

I think the right way to think about this is that we are adding a
capability for the server to instruct the clients: I prefer not to
serve a full clone to you in the usual route if I can avoid it.  You
can help me by going to an alternate resource and populate your
history first and then coming back to me for an additional fetch to
complete the history if you want to.  Doing so would also help you
because that alternate resource can be a static file (or two) that
you can download over a resumable transport (like static files
served over HTTPS).

That alternate resource could be just an old-style bundle file
(e.g. kernel.org prepares such a bundle file for Linus's kernel
repository and makes it available on CDN on a weekly basis;
cf. https://kernel.org/cloning-linux-from-a-bundle.html).

One downside of using the old-style bundle is that it would weigh
about the same as the fully repacked bare repository itself, and
would require the same amount of CPU and disk resource to generate
as it would take to repack.  The "split bundle" discussion with Jeff
King is about one possible way to reduce that waste.  The old-style
bundle is just a header file tucked in front of a packfile, and by
introducing a new bundle format that stores only the header part in
a file that points at an existing packfile, we can reduce the waste.
A few patches from me on "bundle" and "index-pack --clone-bundle"
sent for the past several days are about that approach.  During a
repack the server operators periodically make, we can also create
the header part of the new bundle format that points at the full
packfile that is produced in order to serve the regular "fetch/push"
traffic.

My response to [3] in the thread further points at a new direction.
The "alternate resource" does not have to be a bundle, but can be
just a full packfile (i.e. pack-$name.pack).  After a full repack,
the server operators can make the packfile available to clients over
a resumable transport.  The client has to run "index-pack" on the
downloaded pack-$name.pack to generate the "pack-$name.idx" file in
order to make it usable, so the logic to implement "--clone-bundle"
introduced initially for the "split bundle" approach can be
repurposed to be run on the client.  With a single pack-$name.pack
file, the client can

 - Place it in .git/objects/pack in an empty repository;

 - Generate corresponding pack-$name.idx file next to it;

 - Learn where the tips of histories (i.e. "all objects that are
   reachable from these objects are already available in this
   repository") are.

And the above is sufficient to do the "coming back to me for an
additional fetch" efficiently.  The tips of histories can be sent as
extra "have" records during such a fetch with a minor update to the
"fetch" code.

So what remains?  Here is a rough and still slushy outline:

 - A new method, prime_clone(), in "struct transport" for "git
   clone" client to first call to learn the location of the
   "alternate resource" from the server.

   - The server side endpoint does not have to be, and I think it
 should not be, implemented as an extension to the current
 upload-pack protocol.  It is perfectly fine to add a new "git
 prime-clone" program next to existing "git upload-pack" and
 "git receive-pack" programs and drive it through the
 git-daemon, curl remote helper, and direct execution over ssh.

   - The format of the returned "answer" needs to be designed.  It
 must be able to express:

 - the location of the resource, i.e. a URL;

 - the type of resource, if we want this to be extensible.  I
   think we should initially limit it to "a single full history
   .pack", so from that point of view this may not be absolutely
   necessary, but we already know that we may want to say "go
   there and you will find an old-style bundle file" to support
   the kernel.org CDN, and we may also want to support Jeff's
   "split bundle" or Shawn's ".info" file.  A resource poor
   (read: personal) machine that hosts a personal of a popular
   project might want to name a "git clone" URL for that popular
   project it forked from (e.g. "Clone Linus's repository from
   kernel.org and then come back here for incremental fetch").

 - A new method, 

Re: Resumable clone

2016-03-05 Thread Duy Nguyen
On Sat, Mar 5, 2016 at 4:23 PM, Kevin Wern  wrote:
> Hey, all,
>
> A while ago, I noticed that if a clone is interrupted, it has to be
> restarted completely. When I looked up the issue, I saw that someone
> suggested resumable clones as a feature as early as 2007.  It doesn't
> seem like any progress has been made on this issue, so I began looking
> at relevant code yesterday to start working on it.
>
> Is someone working on this currently?  Are there any things I should
> know moving forward?  Is there a certain way I should break
> down/organize the feature when writing patches?

Resumable clone is happening. See [1] for the basic idea, [2] and [3]
for some preparation work. I'm sure you can help. Once you've gone
through at least [1], I think you can pick something (e.g. finalizing
the protocol, update the server side, or git-clone)

[1] http://thread.gmane.org/gmane.comp.version-control.git/285921
[2] http://thread.gmane.org/gmane.comp.version-control.git/288080/focus=288150
[3] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html