Re: [PATCH 00/11] Resumable clone

2016-09-28 Thread Eric Wong
Junio C Hamano  wrote:
> Eric Wong  writes:
> 
> >> [primeclone]
> >>url = http://location/pack-$NAME.pack
> >>filetype = pack
> >
> > If unconfigured, I wonder if a primeclone pack can be inferred by
> > the existence of a pack bitmap (or merely being the biggest+oldest
> > pack for dumb HTTP).
> 
> That would probably be a nice heuristics but it is unclear who
> should find that out at runtime.  The downloading side would not
> have a visiblity into directory listing.

I think making a bunch of HEAD requests based on the contents of
$GIT_DIR/objects/info/packs wouldn't be too expensive on either
end, especially when HTTP/1.1 persistent connections + pipelining
may be used.


Re: [PATCH 00/11] Resumable clone

2016-09-28 Thread Junio C Hamano
Junio C Hamano  writes:

> Junio C Hamano  writes:
>
> What "git clone" should have been was:
>
> * Parse command line arguments;
>
> * Create a new repository and go into it; this step would
>   require us to have parsed the command line for --template,
>   , --separate-git-dir, etc.
>
> * Talk to the remote and do get_remote_heads() aka ls-remote
>   output;
>
> * Decide what fetch refspec to use, which alternate object store
>   to borrow from; this step would require us to have parsed the
>   command line for --reference, --mirror, --origin, etc;
>
> --- we'll insert something new here ---
>
> * Issue "git fetch" with the refspec determined above; this step
>   would require us to have parsed the command line for --depth, etc.
>
> * Run "git checkout -b" to create an initial checkout; this step
>   would require us to have parsed the command line for --branch,
>   etc.
>
> Even though the current code conceptually does the above, these
> steps are not cleanly separated as such.  I think our update to gain
> "resumable clone" feature on the client side need to start by
> refactoring the current code, before learning "resumable clone", to
> look like the above.
>
> Once we do that, we can insert an extra step before the step that
> runs "git fetch" to optionally [*1*] grab the extra piece of
> information Kevin's "prime-clone" service produces [*2*], and store
> it in the "new repository" somewhere [*3*].
>
> And then, as you suggested, an updated "git fetch" can be taught to
> notice the priming information left by the previous step, and use it
> to attempt to download the pack until success, and to index that
> pack to learn the tips that can be used as ".have" entries in the
> request.  From the original server's point of view, this fetch
> request would "want" the same set of objects, but would appear as
> an incremental update.

Thinking about this even more, it probably makes even more sense to
move the new "learn prime info and store it in repository somewhere,
so that later re-invocation of 'git fetch' can take advantage of it"
step _into_ "git fetch".  That would allow "git fetch" in a freshly
created empty repository take advantage of this feature for free.

The step that "git clone" internally drives "git fetch" would not
actually be done by spawning a separate process with run_command()
because we would want to reuse the connection we already have with
the server when "git clone" first talked to it to learn "ls-remote"
equivalent (i.e. transport_get_remote_refs()).  I wonder if we can
do without this early "ls-remote"; that would further simplify
things by allowing us to just spawn "git fetch" internally.



Re: [PATCH 00/11] Resumable clone

2016-09-28 Thread Junio C Hamano
Junio C Hamano  writes:

>>> git clone --resume 
>>
>> I think calling "git fetch" should resume, actually.
>> It would reduce the learning curve and seems natural to me:
>> "fetch" is jabout grabbing whatever else appeared since the
>> last clone/fetch happened.
>
> I hate say this but it sounds to me like a terrible idea.  At that
> point when you need to resume, there is not even ref for "fetch" to
> base its incremental work off of.  It is better to keep the knowledge
> of this "priming" dance inside "clone".  Hopefully the original "clone"
> whose connection was disconnected in the middle would automatically
> attempt resuming and "clone --resume" would not be as often as needed.

After sleeping on this, I want to take the above back.

I think teaching "git fetch" about the "resume" part makes tons of
sense.

What "git clone" should have been was:

* Parse command line arguments;

* Create a new repository and go into it; this step would
  require us to have parsed the command line for --template,
  , --separate-git-dir, etc.

* Talk to the remote and do get_remote_heads() aka ls-remote
  output;

* Decide what fetch refspec to use, which alternate object store
  to borrow from; this step would require us to have parsed the
  command line for --reference, --mirror, --origin, etc;

--- we'll insert something new here ---

* Issue "git fetch" with the refspec determined above; this step
  would require us to have parsed the command line for --depth, etc.

* Run "git checkout -b" to create an initial checkout; this step
  would require us to have parsed the command line for --branch,
  etc.

Even though the current code conceptually does the above, these
steps are not cleanly separated as such.  I think our update to gain
"resumable clone" feature on the client side need to start by
refactoring the current code, before learning "resumable clone", to
look like the above.

Once we do that, we can insert an extra step before the step that
runs "git fetch" to optionally [*1*] grab the extra piece of
information Kevin's "prime-clone" service produces [*2*], and store
it in the "new repository" somewhere [*3*].

And then, as you suggested, an updated "git fetch" can be taught to
notice the priming information left by the previous step, and use it
to attempt to download the pack until success, and to index that
pack to learn the tips that can be used as ".have" entries in the
request.  From the original server's point of view, this fetch
request would "want" the same set of objects, but would appear as
an incremental update.

Of course, the final step that happens in "git clone", i.e. the
initial checkout, needs to be done somehow, if your user decides to
resume with "git fetch", as "git fetch" _never_ touches the working
tree.  So for that purpose, the primary end-user facing interface
may still have to be "git clone --resume ".  That would
probably skip all four steps in the above sequence, the new
"download priming information" step and go directly to the step that
runs "git fetch".

I do agree that is a much better design, and the crucial design
decision that makes it a better design is your making "git fetch"
aware of this "ah, we have the instruction left in this repository
how to prime its object store" information.

Thanks.


[Footnotes]

*1* It is debatable if it would be an overall win to use the "first
prime by grabbing a large packfile" clone if we are doing
shallow or single-branch clone, hence "optionally".  It is
important to notice that we already have enough information to
base the decision at this point in the above sequence.

*2* As I said, I do not think it needs to be a separate new service,
and I suspect it may be a better design to carry it over the
protocol extension.  At this point in the above sequence, we
have done an equivalent of ls-remote and if we designed a
protocol extension to carry the information we should already
have it.  If we use a separate new service, we can of course
make a separate connection to ask about "prime-clone"
information.  The way this piece of information is transmitted
is of secondary importance.

*3* In addition to the "prime-clone" information, we may need to
store some information that is only known to "clone" (perhaps
because it was given from the command line) to help the final
"checkout -b" step to know what to checkout around here, in case
the next "fetch" step is interrupted and killed.




Re: [PATCH 00/11] Resumable clone

2016-09-27 Thread Junio C Hamano
Eric Wong  writes:

>> [primeclone]
>>  url = http://location/pack-$NAME.pack
>>  filetype = pack
>
> If unconfigured, I wonder if a primeclone pack can be inferred by
> the existence of a pack bitmap (or merely being the biggest+oldest
> pack for dumb HTTP).

That would probably be a nice heuristics but it is unclear who
should find that out at runtime.  The downloading side would not
have a visiblity into directory listing.

>> git clone --resume 
>
> I think calling "git fetch" should resume, actually.
> It would reduce the learning curve and seems natural to me:
> "fetch" is jabout grabbing whatever else appeared since the
> last clone/fetch happened.

I hate say this but it sounds to me like a terrible idea.  At that
point when you need to resume, there is not even ref for "fetch" to
base its incremental work off of.  It is better to keep the knowledge
of this "priming" dance inside "clone".  Hopefully the original "clone"
whose connection was disconnected in the middle would automatically
attempt resuming and "clone --resume" would not be as often as needed.


Re: [PATCH 00/11] Resumable clone

2016-09-27 Thread Eric Wong
Kevin Wern  wrote:
> Hey, all,
> 
> It's been a while (sent a very short patch in May), but I've
> still been working on the resumable clone feature and checking up on
> the mailing list for any updates. After submitting the prime-clone
> service alone, I figured implementing the whole thing would be the best
> way to understand the full scope of the problem (this is my first real
> contribution here, and learning while working on such an involved
> feature has not been easy). 

Thank you for working on this.  I'm hugely interested in this
feature as both a cheapskate sysadmin and as a client with
unreliable connectivity (and I've barely been connected this month).

> This is a functional implementation handling a direct http/ftp URI to a
> single, fully connected packfile (i.e. the link is a direct path to the
> file, not a prefix or guess). My hope is that this acts as a bare
> minimum cross-section spanning the full requirments that can expand in
> width as more cases are added (.info file, split bundle, daemon
> download service). This is certainly not perfect, but I think it at
> least prototypes each component involved in the workflow.
> 
> This patch series is based on jc/bundle, because the logic to find the
> tips of a pack's history already exists there (I call index-pack
> --clone-bundle on the downloaded file, and read the file to write the
> references to a temporary directory). If I need to re-implement this
> logic or base it on another branch, let me know. For ease of pulling
> and testing, I included the branch here:
> 
> https://github.com/kevinwern/git/tree/feature/prime-clone

Am I correct this imposes no additional storage burden for servers?

(unlike the current .bundle dance used by kernel.org:
  https://www.kernel.org/cloning-linux-from-a-bundle.html )

That would be great!

> Although there are a few changes internally from the last patch,
> the "alternate resource" url to download is configured on the
> server side in exactly the same way:
> 
> [primeclone]
>   url = http://location/pack-$NAME.pack
>   filetype = pack

If unconfigured, I wonder if a primeclone pack can be inferred by
the existence of a pack bitmap (or merely being the biggest+oldest
pack for dumb HTTP).

> The prime-clone service simply outputs the components as:
> 
> url filetype
> 
> 
> On the client side, the transport_prime_clone and
> transport_download_primer APIs are built to be more robust (i.e. read
> messages without dying due to protocol errors), so that git clone can
> always try them without being dependent on the capability output of
> git-upload-pack. transport_download_primer is dependent on the success
> of transport_prime_clone, but transport_prime_clone is always run on an
> initial clone. Part of achieving this robustness involves adding
> *_gentle functions to pkt_line, so that prime_clone can fail silently
> without dying.
> 
> The transport_download_primer function uses a resumable download,
> which is applicable to both automatic and manual resuming. Automatic
> is programmatically reconnecting to the resource after being
> interrupted (up to a set number of times). Manual is using a newly
> taught --resume option on the command line:
> 
> git clone --resume 

I think calling "git fetch" should resume, actually.
It would reduce the learning curve and seems natural to me:
"fetch" is jabout grabbing whatever else appeared since the
last clone/fetch happened.

> Right now, a manually resumable directory is left behind only if the
> *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
> is set (right before the download). For an initial clone, if the
> connection fails after automatic resuming, the client erases the
> partial resources and falls through to a normal clone. However, once a
> resumable directory is left behind by the program, it is NEVER
> deleted/abandoned after it is continued with --resume.

I'm not sure if erasing partial resources should ever be done
automatically.  Perhaps a note to the user explaining the
situation and potential ways to correct/resume it.

> I think determining when a resource is "unsalvageable" should be more
> nuanced. Especially in a case where a connection is perpetually poor
> and the user wishes to resume over a long period of time. The timeout
> logic itself *definitely* needs more nuance than "repeat 5 times", such
> as expanding wait times and using earlier successes when deciding to
> try again. Right now, I think the most important part of this patch is
> that these two paths (falling through after a failed download, exiting
> to be manually resumed later) exist.
> 
> Off the top of my head, outstanding issues/TODOs inlcude:
>   - The above issue of determining when to fall through, when to
> reattempt, and when to write the resumable info and exit
> in git clone.

My current (initial) reaction is: you're overthinking this.

I think it's less surprising to a user to always write resumable
info a

Re: [PATCH 00/11] Resumable clone

2016-09-16 Thread Junio C Hamano
Kevin Wern  writes:

> It's been a while (sent a very short patch in May), but I've
> still been working on the resumable clone feature and checking up on
> the mailing list for any updates. After submitting the prime-clone
> service alone, I figured implementing the whole thing would be the best
> way to understand the full scope of the problem (this is my first real
> contribution here, and learning while working on such an involved
> feature has not been easy). 

It may not have been easy but I hope it has been a fun journey for
you ;-)

> On the client side, the transport_prime_clone and
> transport_download_primer APIs are built to be more robust (i.e. read
> messages without dying due to protocol errors), so that git clone can
> always try them without being dependent on the capability output of
> git-upload-pack. transport_download_primer is dependent on the success
> of transport_prime_clone, but transport_prime_clone is always run on an
> initial clone. Part of achieving this robustness involves adding
> *_gentle functions to pkt_line, so that prime_clone can fail silently
> without dying.

OK.

> Right now, a manually resumable directory is left behind only if the
> *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
> is set (right before the download). For an initial clone, if the
> connection fails after automatic resuming, the client erases the
> partial resources and falls through to a normal clone. However, once a
> resumable directory is left behind by the program, it is NEVER
> deleted/abandoned after it is continued with --resume.

Sounds like you made a sensible design decision here.

>   - When running with ssh and a password, the credentials are
> prompted for twice. I don't know if there is a way to
> preserve credentials between executions. I couldn't find any
> examples in git's source.

We leave credentail reuse to keyring services like ssh-agent.



[PATCH 00/11] Resumable clone

2016-09-15 Thread Kevin Wern
Hey, all,

It's been a while (sent a very short patch in May), but I've
still been working on the resumable clone feature and checking up on
the mailing list for any updates. After submitting the prime-clone
service alone, I figured implementing the whole thing would be the best
way to understand the full scope of the problem (this is my first real
contribution here, and learning while working on such an involved
feature has not been easy). 

This is a functional implementation handling a direct http/ftp URI to a
single, fully connected packfile (i.e. the link is a direct path to the
file, not a prefix or guess). My hope is that this acts as a bare
minimum cross-section spanning the full requirments that can expand in
width as more cases are added (.info file, split bundle, daemon
download service). This is certainly not perfect, but I think it at
least prototypes each component involved in the workflow.

This patch series is based on jc/bundle, because the logic to find the
tips of a pack's history already exists there (I call index-pack
--clone-bundle on the downloaded file, and read the file to write the
references to a temporary directory). If I need to re-implement this
logic or base it on another branch, let me know. For ease of pulling
and testing, I included the branch here:

https://github.com/kevinwern/git/tree/feature/prime-clone

Although there are a few changes internally from the last patch,
the "alternate resource" url to download is configured on the
server side in exactly the same way:

[primeclone]
url = http://location/pack-$NAME.pack
filetype = pack

The prime-clone service simply outputs the components as:

url filetype


On the client side, the transport_prime_clone and
transport_download_primer APIs are built to be more robust (i.e. read
messages without dying due to protocol errors), so that git clone can
always try them without being dependent on the capability output of
git-upload-pack. transport_download_primer is dependent on the success
of transport_prime_clone, but transport_prime_clone is always run on an
initial clone. Part of achieving this robustness involves adding
*_gentle functions to pkt_line, so that prime_clone can fail silently
without dying.

The transport_download_primer function uses a resumable download,
which is applicable to both automatic and manual resuming. Automatic
is programmatically reconnecting to the resource after being
interrupted (up to a set number of times). Manual is using a newly
taught --resume option on the command line:

git clone --resume 

Right now, a manually resumable directory is left behind only if the
*client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
is set (right before the download). For an initial clone, if the
connection fails after automatic resuming, the client erases the
partial resources and falls through to a normal clone. However, once a
resumable directory is left behind by the program, it is NEVER
deleted/abandoned after it is continued with --resume.

I think determining when a resource is "unsalvageable" should be more
nuanced. Especially in a case where a connection is perpetually poor
and the user wishes to resume over a long period of time. The timeout
logic itself *definitely* needs more nuance than "repeat 5 times", such
as expanding wait times and using earlier successes when deciding to
try again. Right now, I think the most important part of this patch is
that these two paths (falling through after a failed download, exiting
to be manually resumed later) exist.

Off the top of my head, outstanding issues/TODOs inlcude:
- The above issue of determining when to fall through, when to
  reattempt, and when to write the resumable info and exit
  in git clone.
- Creating git-daemon service to download a resumable resource.
  Pretty straightforward, I think, especially if
  http.getanyfile already exists. This falls more under
  "haven't gotten to yet" than dilemma.
- Logic for git clone to determine when a full clone would
  be superior, such as when a clone is local or a reference is
  given.
- Configuring prime-clone for multiple resources, in two
  dimensions: (a) resources to choose from (e.g. fall back to
  a second resource if the first one doesn't work) and (b)
  resources to be downloaded together or in sequence (e.g.
  download http://host/this, then http://host/that). Maybe
  prime-clone could also handle client preferences in terms of
  filetype or protocol. For this, I just have to re-read a few
  discussions about the filetypes we use to see if there are
  any outliers that aren't representable in this way. I think
  this is another "haven't gotten to yet".
- Related to the above, seeing if there are any outlying
  resource types whose process can't be modularized into:
  download t