Re: [PATCH 00/11] Resumable clone
Junio C Hamano wrote: > Eric Wong writes: > > >> [primeclone] > >>url = http://location/pack-$NAME.pack > >>filetype = pack > > > > If unconfigured, I wonder if a primeclone pack can be inferred by > > the existence of a pack bitmap (or merely being the biggest+oldest > > pack for dumb HTTP). > > That would probably be a nice heuristics but it is unclear who > should find that out at runtime. The downloading side would not > have a visiblity into directory listing. I think making a bunch of HEAD requests based on the contents of $GIT_DIR/objects/info/packs wouldn't be too expensive on either end, especially when HTTP/1.1 persistent connections + pipelining may be used.
Re: [PATCH 00/11] Resumable clone
Junio C Hamano writes: > Junio C Hamano writes: > > What "git clone" should have been was: > > * Parse command line arguments; > > * Create a new repository and go into it; this step would > require us to have parsed the command line for --template, > , --separate-git-dir, etc. > > * Talk to the remote and do get_remote_heads() aka ls-remote > output; > > * Decide what fetch refspec to use, which alternate object store > to borrow from; this step would require us to have parsed the > command line for --reference, --mirror, --origin, etc; > > --- we'll insert something new here --- > > * Issue "git fetch" with the refspec determined above; this step > would require us to have parsed the command line for --depth, etc. > > * Run "git checkout -b" to create an initial checkout; this step > would require us to have parsed the command line for --branch, > etc. > > Even though the current code conceptually does the above, these > steps are not cleanly separated as such. I think our update to gain > "resumable clone" feature on the client side need to start by > refactoring the current code, before learning "resumable clone", to > look like the above. > > Once we do that, we can insert an extra step before the step that > runs "git fetch" to optionally [*1*] grab the extra piece of > information Kevin's "prime-clone" service produces [*2*], and store > it in the "new repository" somewhere [*3*]. > > And then, as you suggested, an updated "git fetch" can be taught to > notice the priming information left by the previous step, and use it > to attempt to download the pack until success, and to index that > pack to learn the tips that can be used as ".have" entries in the > request. From the original server's point of view, this fetch > request would "want" the same set of objects, but would appear as > an incremental update. Thinking about this even more, it probably makes even more sense to move the new "learn prime info and store it in repository somewhere, so that later re-invocation of 'git fetch' can take advantage of it" step _into_ "git fetch". That would allow "git fetch" in a freshly created empty repository take advantage of this feature for free. The step that "git clone" internally drives "git fetch" would not actually be done by spawning a separate process with run_command() because we would want to reuse the connection we already have with the server when "git clone" first talked to it to learn "ls-remote" equivalent (i.e. transport_get_remote_refs()). I wonder if we can do without this early "ls-remote"; that would further simplify things by allowing us to just spawn "git fetch" internally.
Re: [PATCH 00/11] Resumable clone
Junio C Hamano writes: >>> git clone --resume >> >> I think calling "git fetch" should resume, actually. >> It would reduce the learning curve and seems natural to me: >> "fetch" is jabout grabbing whatever else appeared since the >> last clone/fetch happened. > > I hate say this but it sounds to me like a terrible idea. At that > point when you need to resume, there is not even ref for "fetch" to > base its incremental work off of. It is better to keep the knowledge > of this "priming" dance inside "clone". Hopefully the original "clone" > whose connection was disconnected in the middle would automatically > attempt resuming and "clone --resume" would not be as often as needed. After sleeping on this, I want to take the above back. I think teaching "git fetch" about the "resume" part makes tons of sense. What "git clone" should have been was: * Parse command line arguments; * Create a new repository and go into it; this step would require us to have parsed the command line for --template, , --separate-git-dir, etc. * Talk to the remote and do get_remote_heads() aka ls-remote output; * Decide what fetch refspec to use, which alternate object store to borrow from; this step would require us to have parsed the command line for --reference, --mirror, --origin, etc; --- we'll insert something new here --- * Issue "git fetch" with the refspec determined above; this step would require us to have parsed the command line for --depth, etc. * Run "git checkout -b" to create an initial checkout; this step would require us to have parsed the command line for --branch, etc. Even though the current code conceptually does the above, these steps are not cleanly separated as such. I think our update to gain "resumable clone" feature on the client side need to start by refactoring the current code, before learning "resumable clone", to look like the above. Once we do that, we can insert an extra step before the step that runs "git fetch" to optionally [*1*] grab the extra piece of information Kevin's "prime-clone" service produces [*2*], and store it in the "new repository" somewhere [*3*]. And then, as you suggested, an updated "git fetch" can be taught to notice the priming information left by the previous step, and use it to attempt to download the pack until success, and to index that pack to learn the tips that can be used as ".have" entries in the request. From the original server's point of view, this fetch request would "want" the same set of objects, but would appear as an incremental update. Of course, the final step that happens in "git clone", i.e. the initial checkout, needs to be done somehow, if your user decides to resume with "git fetch", as "git fetch" _never_ touches the working tree. So for that purpose, the primary end-user facing interface may still have to be "git clone --resume ". That would probably skip all four steps in the above sequence, the new "download priming information" step and go directly to the step that runs "git fetch". I do agree that is a much better design, and the crucial design decision that makes it a better design is your making "git fetch" aware of this "ah, we have the instruction left in this repository how to prime its object store" information. Thanks. [Footnotes] *1* It is debatable if it would be an overall win to use the "first prime by grabbing a large packfile" clone if we are doing shallow or single-branch clone, hence "optionally". It is important to notice that we already have enough information to base the decision at this point in the above sequence. *2* As I said, I do not think it needs to be a separate new service, and I suspect it may be a better design to carry it over the protocol extension. At this point in the above sequence, we have done an equivalent of ls-remote and if we designed a protocol extension to carry the information we should already have it. If we use a separate new service, we can of course make a separate connection to ask about "prime-clone" information. The way this piece of information is transmitted is of secondary importance. *3* In addition to the "prime-clone" information, we may need to store some information that is only known to "clone" (perhaps because it was given from the command line) to help the final "checkout -b" step to know what to checkout around here, in case the next "fetch" step is interrupted and killed.
Re: [PATCH 00/11] Resumable clone
Eric Wong writes: >> [primeclone] >> url = http://location/pack-$NAME.pack >> filetype = pack > > If unconfigured, I wonder if a primeclone pack can be inferred by > the existence of a pack bitmap (or merely being the biggest+oldest > pack for dumb HTTP). That would probably be a nice heuristics but it is unclear who should find that out at runtime. The downloading side would not have a visiblity into directory listing. >> git clone --resume > > I think calling "git fetch" should resume, actually. > It would reduce the learning curve and seems natural to me: > "fetch" is jabout grabbing whatever else appeared since the > last clone/fetch happened. I hate say this but it sounds to me like a terrible idea. At that point when you need to resume, there is not even ref for "fetch" to base its incremental work off of. It is better to keep the knowledge of this "priming" dance inside "clone". Hopefully the original "clone" whose connection was disconnected in the middle would automatically attempt resuming and "clone --resume" would not be as often as needed.
Re: [PATCH 00/11] Resumable clone
Kevin Wern wrote: > Hey, all, > > It's been a while (sent a very short patch in May), but I've > still been working on the resumable clone feature and checking up on > the mailing list for any updates. After submitting the prime-clone > service alone, I figured implementing the whole thing would be the best > way to understand the full scope of the problem (this is my first real > contribution here, and learning while working on such an involved > feature has not been easy). Thank you for working on this. I'm hugely interested in this feature as both a cheapskate sysadmin and as a client with unreliable connectivity (and I've barely been connected this month). > This is a functional implementation handling a direct http/ftp URI to a > single, fully connected packfile (i.e. the link is a direct path to the > file, not a prefix or guess). My hope is that this acts as a bare > minimum cross-section spanning the full requirments that can expand in > width as more cases are added (.info file, split bundle, daemon > download service). This is certainly not perfect, but I think it at > least prototypes each component involved in the workflow. > > This patch series is based on jc/bundle, because the logic to find the > tips of a pack's history already exists there (I call index-pack > --clone-bundle on the downloaded file, and read the file to write the > references to a temporary directory). If I need to re-implement this > logic or base it on another branch, let me know. For ease of pulling > and testing, I included the branch here: > > https://github.com/kevinwern/git/tree/feature/prime-clone Am I correct this imposes no additional storage burden for servers? (unlike the current .bundle dance used by kernel.org: https://www.kernel.org/cloning-linux-from-a-bundle.html ) That would be great! > Although there are a few changes internally from the last patch, > the "alternate resource" url to download is configured on the > server side in exactly the same way: > > [primeclone] > url = http://location/pack-$NAME.pack > filetype = pack If unconfigured, I wonder if a primeclone pack can be inferred by the existence of a pack bitmap (or merely being the biggest+oldest pack for dumb HTTP). > The prime-clone service simply outputs the components as: > > url filetype > > > On the client side, the transport_prime_clone and > transport_download_primer APIs are built to be more robust (i.e. read > messages without dying due to protocol errors), so that git clone can > always try them without being dependent on the capability output of > git-upload-pack. transport_download_primer is dependent on the success > of transport_prime_clone, but transport_prime_clone is always run on an > initial clone. Part of achieving this robustness involves adding > *_gentle functions to pkt_line, so that prime_clone can fail silently > without dying. > > The transport_download_primer function uses a resumable download, > which is applicable to both automatic and manual resuming. Automatic > is programmatically reconnecting to the resource after being > interrupted (up to a set number of times). Manual is using a newly > taught --resume option on the command line: > > git clone --resume I think calling "git fetch" should resume, actually. It would reduce the learning curve and seems natural to me: "fetch" is jabout grabbing whatever else appeared since the last clone/fetch happened. > Right now, a manually resumable directory is left behind only if the > *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE, > is set (right before the download). For an initial clone, if the > connection fails after automatic resuming, the client erases the > partial resources and falls through to a normal clone. However, once a > resumable directory is left behind by the program, it is NEVER > deleted/abandoned after it is continued with --resume. I'm not sure if erasing partial resources should ever be done automatically. Perhaps a note to the user explaining the situation and potential ways to correct/resume it. > I think determining when a resource is "unsalvageable" should be more > nuanced. Especially in a case where a connection is perpetually poor > and the user wishes to resume over a long period of time. The timeout > logic itself *definitely* needs more nuance than "repeat 5 times", such > as expanding wait times and using earlier successes when deciding to > try again. Right now, I think the most important part of this patch is > that these two paths (falling through after a failed download, exiting > to be manually resumed later) exist. > > Off the top of my head, outstanding issues/TODOs inlcude: > - The above issue of determining when to fall through, when to > reattempt, and when to write the resumable info and exit > in git clone. My current (initial) reaction is: you're overthinking this. I think it's less surprising to a user to always write resumable info a
Re: [PATCH 00/11] Resumable clone
Kevin Wern writes: > It's been a while (sent a very short patch in May), but I've > still been working on the resumable clone feature and checking up on > the mailing list for any updates. After submitting the prime-clone > service alone, I figured implementing the whole thing would be the best > way to understand the full scope of the problem (this is my first real > contribution here, and learning while working on such an involved > feature has not been easy). It may not have been easy but I hope it has been a fun journey for you ;-) > On the client side, the transport_prime_clone and > transport_download_primer APIs are built to be more robust (i.e. read > messages without dying due to protocol errors), so that git clone can > always try them without being dependent on the capability output of > git-upload-pack. transport_download_primer is dependent on the success > of transport_prime_clone, but transport_prime_clone is always run on an > initial clone. Part of achieving this robustness involves adding > *_gentle functions to pkt_line, so that prime_clone can fail silently > without dying. OK. > Right now, a manually resumable directory is left behind only if the > *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE, > is set (right before the download). For an initial clone, if the > connection fails after automatic resuming, the client erases the > partial resources and falls through to a normal clone. However, once a > resumable directory is left behind by the program, it is NEVER > deleted/abandoned after it is continued with --resume. Sounds like you made a sensible design decision here. > - When running with ssh and a password, the credentials are > prompted for twice. I don't know if there is a way to > preserve credentials between executions. I couldn't find any > examples in git's source. We leave credentail reuse to keyring services like ssh-agent.
[PATCH 00/11] Resumable clone
Hey, all, It's been a while (sent a very short patch in May), but I've still been working on the resumable clone feature and checking up on the mailing list for any updates. After submitting the prime-clone service alone, I figured implementing the whole thing would be the best way to understand the full scope of the problem (this is my first real contribution here, and learning while working on such an involved feature has not been easy). This is a functional implementation handling a direct http/ftp URI to a single, fully connected packfile (i.e. the link is a direct path to the file, not a prefix or guess). My hope is that this acts as a bare minimum cross-section spanning the full requirments that can expand in width as more cases are added (.info file, split bundle, daemon download service). This is certainly not perfect, but I think it at least prototypes each component involved in the workflow. This patch series is based on jc/bundle, because the logic to find the tips of a pack's history already exists there (I call index-pack --clone-bundle on the downloaded file, and read the file to write the references to a temporary directory). If I need to re-implement this logic or base it on another branch, let me know. For ease of pulling and testing, I included the branch here: https://github.com/kevinwern/git/tree/feature/prime-clone Although there are a few changes internally from the last patch, the "alternate resource" url to download is configured on the server side in exactly the same way: [primeclone] url = http://location/pack-$NAME.pack filetype = pack The prime-clone service simply outputs the components as: url filetype On the client side, the transport_prime_clone and transport_download_primer APIs are built to be more robust (i.e. read messages without dying due to protocol errors), so that git clone can always try them without being dependent on the capability output of git-upload-pack. transport_download_primer is dependent on the success of transport_prime_clone, but transport_prime_clone is always run on an initial clone. Part of achieving this robustness involves adding *_gentle functions to pkt_line, so that prime_clone can fail silently without dying. The transport_download_primer function uses a resumable download, which is applicable to both automatic and manual resuming. Automatic is programmatically reconnecting to the resource after being interrupted (up to a set number of times). Manual is using a newly taught --resume option on the command line: git clone --resume Right now, a manually resumable directory is left behind only if the *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE, is set (right before the download). For an initial clone, if the connection fails after automatic resuming, the client erases the partial resources and falls through to a normal clone. However, once a resumable directory is left behind by the program, it is NEVER deleted/abandoned after it is continued with --resume. I think determining when a resource is "unsalvageable" should be more nuanced. Especially in a case where a connection is perpetually poor and the user wishes to resume over a long period of time. The timeout logic itself *definitely* needs more nuance than "repeat 5 times", such as expanding wait times and using earlier successes when deciding to try again. Right now, I think the most important part of this patch is that these two paths (falling through after a failed download, exiting to be manually resumed later) exist. Off the top of my head, outstanding issues/TODOs inlcude: - The above issue of determining when to fall through, when to reattempt, and when to write the resumable info and exit in git clone. - Creating git-daemon service to download a resumable resource. Pretty straightforward, I think, especially if http.getanyfile already exists. This falls more under "haven't gotten to yet" than dilemma. - Logic for git clone to determine when a full clone would be superior, such as when a clone is local or a reference is given. - Configuring prime-clone for multiple resources, in two dimensions: (a) resources to choose from (e.g. fall back to a second resource if the first one doesn't work) and (b) resources to be downloaded together or in sequence (e.g. download http://host/this, then http://host/that). Maybe prime-clone could also handle client preferences in terms of filetype or protocol. For this, I just have to re-read a few discussions about the filetypes we use to see if there are any outliers that aren't representable in this way. I think this is another "haven't gotten to yet". - Related to the above, seeing if there are any outlying resource types whose process can't be modularized into: download t