Kevin Wern <kevin.m.w...@gmail.com> wrote:
> Hey, all,
> It's been a while (sent a very short patch in May), but I've
> still been working on the resumable clone feature and checking up on
> the mailing list for any updates. After submitting the prime-clone
> service alone, I figured implementing the whole thing would be the best
> way to understand the full scope of the problem (this is my first real
> contribution here, and learning while working on such an involved
> feature has not been easy).
Thank you for working on this. I'm hugely interested in this
feature as both a cheapskate sysadmin and as a client with
unreliable connectivity (and I've barely been connected this month).
> This is a functional implementation handling a direct http/ftp URI to a
> single, fully connected packfile (i.e. the link is a direct path to the
> file, not a prefix or guess). My hope is that this acts as a bare
> minimum cross-section spanning the full requirments that can expand in
> width as more cases are added (.info file, split bundle, daemon
> download service). This is certainly not perfect, but I think it at
> least prototypes each component involved in the workflow.
> This patch series is based on jc/bundle, because the logic to find the
> tips of a pack's history already exists there (I call index-pack
> --clone-bundle on the downloaded file, and read the file to write the
> references to a temporary directory). If I need to re-implement this
> logic or base it on another branch, let me know. For ease of pulling
> and testing, I included the branch here:
Am I correct this imposes no additional storage burden for servers?
(unlike the current .bundle dance used by kernel.org:
That would be great!
> Although there are a few changes internally from the last patch,
> the "alternate resource" url to download is configured on the
> server side in exactly the same way:
> url = http://location/pack-$NAME.pack
> filetype = pack
If unconfigured, I wonder if a primeclone pack can be inferred by
the existence of a pack bitmap (or merely being the biggest+oldest
pack for dumb HTTP).
> The prime-clone service simply outputs the components as:
> ####url filetype
> On the client side, the transport_prime_clone and
> transport_download_primer APIs are built to be more robust (i.e. read
> messages without dying due to protocol errors), so that git clone can
> always try them without being dependent on the capability output of
> git-upload-pack. transport_download_primer is dependent on the success
> of transport_prime_clone, but transport_prime_clone is always run on an
> initial clone. Part of achieving this robustness involves adding
> *_gentle functions to pkt_line, so that prime_clone can fail silently
> without dying.
> The transport_download_primer function uses a resumable download,
> which is applicable to both automatic and manual resuming. Automatic
> is programmatically reconnecting to the resource after being
> interrupted (up to a set number of times). Manual is using a newly
> taught --resume option on the command line:
> git clone --resume <resumable_work_or_git_dir>
I think calling "git fetch" should resume, actually.
It would reduce the learning curve and seems natural to me:
"fetch" is jabout grabbing whatever else appeared since the
last clone/fetch happened.
> Right now, a manually resumable directory is left behind only if the
> *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
> is set (right before the download). For an initial clone, if the
> connection fails after automatic resuming, the client erases the
> partial resources and falls through to a normal clone. However, once a
> resumable directory is left behind by the program, it is NEVER
> deleted/abandoned after it is continued with --resume.
I'm not sure if erasing partial resources should ever be done
automatically. Perhaps a note to the user explaining the
situation and potential ways to correct/resume it.
> I think determining when a resource is "unsalvageable" should be more
> nuanced. Especially in a case where a connection is perpetually poor
> and the user wishes to resume over a long period of time. The timeout
> logic itself *definitely* needs more nuance than "repeat 5 times", such
> as expanding wait times and using earlier successes when deciding to
> try again. Right now, I think the most important part of this patch is
> that these two paths (falling through after a failed download, exiting
> to be manually resumed later) exist.
> Off the top of my head, outstanding issues/TODOs inlcude:
> - The above issue of determining when to fall through, when to
> reattempt, and when to write the resumable info and exit
> in git clone.
My current (initial) reaction is: you're overthinking this.
I think it's less surprising to a user to always write resumable
info and let them know how to resume (or abort); rather than
trying to second-guess their intent.
Going by the zero-one-infinity rule, I'd probably attempt an
auto-retry once on socket errors before saving state and bailing
with instructions on how to resume.
If they hit Ctrl-C manually, then just tell them they can
either resume or "rm -r" the directory.
> - Creating git-daemon service to download a resumable resource.
> Pretty straightforward, I think, especially if
> http.getanyfile already exists. This falls more under
> "haven't gotten to yet" than dilemma.
I think this could be handled natively by git-daemon for
trickling data to slow clients in the existing event loop (and
expanded to use epoll/kqueue). Similar to how X-Sendfile works
with (Apache|lighttpd) or X-Accel in nginx.
This would be cheaper than wasting a process (or thread) to
trickle to low-bandwidth clients. But this may be an
optimization we defer until we've ironed out other parts.
> - Logic for git clone to determine when a full clone would
> be superior, such as when a clone is local or a reference is
> - Configuring prime-clone for multiple resources, in two
> dimensions: (a) resources to choose from (e.g. fall back to
> a second resource if the first one doesn't work) and (b)
> resources to be downloaded together or in sequence (e.g.
> download http://host/this, then http://host/that). Maybe
> prime-clone could also handle client preferences in terms of
> filetype or protocol. For this, I just have to re-read a few
> discussions about the filetypes we use to see if there are
> any outliers that aren't representable in this way. I think
> this is another "haven't gotten to yet".
Perhaps using the existing http-alternates (and automatic
primeclone pack inference I wrote about above) can be done.
> - Creating the logic to guess a packfile, and append that to a
> prefix specified by the admin. Additionally, allowing the
> admin to use a custom script to use their own logic to
> output the URL.
Yes :) Though I'm not sure if the custom script is necessary.