Hey, all, It's been a while (sent a very short patch in May), but I've still been working on the resumable clone feature and checking up on the mailing list for any updates. After submitting the prime-clone service alone, I figured implementing the whole thing would be the best way to understand the full scope of the problem (this is my first real contribution here, and learning while working on such an involved feature has not been easy).
This is a functional implementation handling a direct http/ftp URI to a single, fully connected packfile (i.e. the link is a direct path to the file, not a prefix or guess). My hope is that this acts as a bare minimum cross-section spanning the full requirments that can expand in width as more cases are added (.info file, split bundle, daemon download service). This is certainly not perfect, but I think it at least prototypes each component involved in the workflow. This patch series is based on jc/bundle, because the logic to find the tips of a pack's history already exists there (I call index-pack --clone-bundle on the downloaded file, and read the file to write the references to a temporary directory). If I need to re-implement this logic or base it on another branch, let me know. For ease of pulling and testing, I included the branch here: https://github.com/kevinwern/git/tree/feature/prime-clone Although there are a few changes internally from the last patch, the "alternate resource" url to download is configured on the server side in exactly the same way: [primeclone] url = http://location/pack-$NAME.pack filetype = pack The prime-clone service simply outputs the components as: ####url filetype 0000 On the client side, the transport_prime_clone and transport_download_primer APIs are built to be more robust (i.e. read messages without dying due to protocol errors), so that git clone can always try them without being dependent on the capability output of git-upload-pack. transport_download_primer is dependent on the success of transport_prime_clone, but transport_prime_clone is always run on an initial clone. Part of achieving this robustness involves adding *_gentle functions to pkt_line, so that prime_clone can fail silently without dying. The transport_download_primer function uses a resumable download, which is applicable to both automatic and manual resuming. Automatic is programmatically reconnecting to the resource after being interrupted (up to a set number of times). Manual is using a newly taught --resume option on the command line: git clone --resume <resumable_work_or_git_dir> Right now, a manually resumable directory is left behind only if the *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE, is set (right before the download). For an initial clone, if the connection fails after automatic resuming, the client erases the partial resources and falls through to a normal clone. However, once a resumable directory is left behind by the program, it is NEVER deleted/abandoned after it is continued with --resume. I think determining when a resource is "unsalvageable" should be more nuanced. Especially in a case where a connection is perpetually poor and the user wishes to resume over a long period of time. The timeout logic itself *definitely* needs more nuance than "repeat 5 times", such as expanding wait times and using earlier successes when deciding to try again. Right now, I think the most important part of this patch is that these two paths (falling through after a failed download, exiting to be manually resumed later) exist. Off the top of my head, outstanding issues/TODOs inlcude: - The above issue of determining when to fall through, when to reattempt, and when to write the resumable info and exit in git clone. - Creating git-daemon service to download a resumable resource. Pretty straightforward, I think, especially if http.getanyfile already exists. This falls more under "haven't gotten to yet" than dilemma. - Logic for git clone to determine when a full clone would be superior, such as when a clone is local or a reference is given. - Configuring prime-clone for multiple resources, in two dimensions: (a) resources to choose from (e.g. fall back to a second resource if the first one doesn't work) and (b) resources to be downloaded together or in sequence (e.g. download http://host/this, then http://host/that). Maybe prime-clone could also handle client preferences in terms of filetype or protocol. For this, I just have to re-read a few discussions about the filetypes we use to see if there are any outliers that aren't representable in this way. I think this is another "haven't gotten to yet". - Related to the above, seeing if there are any outlying resource types whose process can't be modularized into: download to location, use, clean one way if failed, clean another way if succeeded. The "split bundle," for example, is retrieved (download), read for the pack location (use), and then the packfile is retrieved (download). I believe, in this case, all of that can be considered the "download," and then indexing/writing can be considered "use." But I'm not sure if there are more extreme cases. - Creating the logic to guess a packfile, and append that to a prefix specified by the admin. Additionally, allowing the admin to use a custom script to use their own logic to output the URL. - Preventing the retry wait period (currently set by using select()) from being interrupted by other system calls. I believe there is a setting in libcurl, but I don't want to make any potentially large-impact changes without discussing it first. Plus, I believe changes to http.c were up for discussion anyway. - Finding if there's a more elegant way to access the alternate resource than invoking remote-helper with a url we don't care about (the same url that will be specified later to stdin with "download-primer"). - Finding if there is a better way to suppress index-pack's output than creating a run-command option specifically to suppress stdout. - When running with ssh and a password, the credentials are prompted for twice. I don't know if there is a way to preserve credentials between executions. I couldn't find any examples in git's source. Some of these are issues I've been actively working on, but I'm hitting a point where keeping everyone up-to-date trumps completeness. Hopefully, the bulk of the 'learning and re-doing' is done and I can update more frequently in smaller increments. I will probably work on the git-daemon download service, the curl timeout issue, and supporting other filetypes next. Feedback is appreciated. Kevin Wern (11): Resumable clone: create service git-prime-clone Resumable clone: add prime-clone endpoints pkt-line: create gentle packet_read_line functions Resumable clone: add prime-clone to remote-curl Resumable clone: add output parsing to connect.c Resumable clone: implement transport_prime_clone Resumable clone: add resumable download to http/curl Resumable clone: create transport_download_primer path: add resumable marker run command: add RUN_COMMAND_NO_STDOUT Resumable clone: implement primer logic in git-clone .gitignore | 1 + Documentation/git-clone.txt | 16 + Documentation/git-daemon.txt | 7 + Documentation/git-http-backend.txt | 7 + Documentation/git-prime-clone.txt | 39 +++ Makefile | 2 + builtin.h | 1 + builtin/clone.c | 590 +++++++++++++++++++++++++++++++------ builtin/prime-clone.c | 77 +++++ cache.h | 1 + connect.c | 47 +++ connect.h | 10 +- daemon.c | 7 + git.c | 1 + http-backend.c | 22 +- http.c | 86 +++++- http.h | 7 +- path.c | 1 + pkt-line.c | 47 ++- pkt-line.h | 16 + remote-curl.c | 192 +++++++++--- run-command.c | 1 + run-command.h | 1 + t/t9904-git-prime-clone.sh | 181 ++++++++++++ transport-helper.c | 75 ++++- transport.c | 53 ++++ transport.h | 27 ++ 27 files changed, 1361 insertions(+), 154 deletions(-) create mode 100644 Documentation/git-prime-clone.txt create mode 100644 builtin/prime-clone.c create mode 100755 t/t9904-git-prime-clone.sh -- 2.7.4