Re: How to resume broke clone ?

2013-12-05 Thread Jeff King
On Thu, Dec 05, 2013 at 10:01:28AM -0800, Junio C Hamano wrote:

> > You could have a "git-advertise-upstream" that generates a mirror blob
> > from your remotes config and pushes it to your publishing point. That
> > may be overkill, but I don't think it's possible with a
> > .git/config-based solution.
> 
> I do not think I follow.  The upload-pack service could be taught to
> pay attention to the uploadpack.advertiseUpstream config at runtime,
> advertise 'mirror' capability, and then respond with the list of
> remote.*.url it uses when asked (if we go with the pkt-line based
> approach).

I was assuming a triangular workflow, where your publishing point (that
other people will fetch from) does not know anything about the upstream.
Like:

  $ git clone git://git.kernel.org/pub/scm/git/git.git
  $ hack hack hack; commit commit commit
  $ git remote add me myserver:/var/git/git.git
  $ git push me
  $ git advertise-upstream origin me

If your publishing point is already fetching from another upstream, then
yeah, I'd agree that dynamically generating it from the config is fine.

> Alternatively, it could also be taught to pay attention
> to the same config at runtime, create an blob to advertise the list
> of remote.*.url it uses and store it in refs/mirror (or do this
> purely in-core without actually writing to the refs/ namespace), and
> emit an entry for refs/mirror using that blob object name in the
> ls-remote part of the response (if we go with the magic blob based
> approach).

Yes. The pkt-line versus refs distinction is purely a protocol issue.
You can do anything you want on the backend with either of them,
including faking the ref (you can also accept fake pushes to
refs/mirror, too, if you really want people to be able to upload that
way).

But it is worth considering what implementation difficulties we would
run across in either case. Producing a fake refs/mirror blob that
responds like a normal ref is more work than just dumping the lines. If
we're always just going to generate it dynamically anyway, then we can
save ourselves some effort.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-12-05 Thread Junio C Hamano
Jeff King  writes:

> Right, I think that's the most critical one (though you could also just
> use the convention of ".bundle" in the URL). I think we may want to
> leave room for more metadata, though.

Good. I like this line of thinking.

>> Heck, remote.origin.url might already
>> be a good mirror address to advertise, especially if the client isn't
>> on the same /24 as the server and the remote.origin.url is something
>> like "git.kernel.org". :-)
>
> You could have a "git-advertise-upstream" that generates a mirror blob
> from your remotes config and pushes it to your publishing point. That
> may be overkill, but I don't think it's possible with a
> .git/config-based solution.

I do not think I follow.  The upload-pack service could be taught to
pay attention to the uploadpack.advertiseUpstream config at runtime,
advertise 'mirror' capability, and then respond with the list of
remote.*.url it uses when asked (if we go with the pkt-line based
approach).  Alternatively, it could also be taught to pay attention
to the same config at runtime, create an blob to advertise the list
of remote.*.url it uses and store it in refs/mirror (or do this
purely in-core without actually writing to the refs/ namespace), and
emit an entry for refs/mirror using that blob object name in the
ls-remote part of the response (if we go with the magic blob based
approach).

>> Yes. And this is why the packfile name algorithm is horribly flawed. I
>> keep saying we should change it to name the pack using the last 20
>> bytes of the file but ... nobody has written the patch for that?  :-)
>
> Totally agree. I think we could also get rid of the horrible hacks in
> repack where we pack to a tempfile, then have to do another tempfile
> dance (which is not atomic!) to move the same-named packfile out of the
> way. If the name were based on the content, we could just throw away our
> new pack if one of the same name is already there (just like we do for
> loose objects).

Yay.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-12-05 Thread Jeff King
On Thu, Dec 05, 2013 at 02:21:09PM +0100, Michael Haggerty wrote:

> A better alternative would be to ask users to clone from the central
> server.  In this case, the central server would want to tell the clients
> to grab what they can from their local bootstrap mirror and then come
> back to the central server for any remainders.  The trick is that which
> bootstrap mirror is "local" would vary from client to client.
>
> I suppose that this could be implemented using what you have discussed
> by having the central server direct the client to a URL that resolves
> differently for different clients, CDN-like.  Alternatively, the central
> Git server could itself look where a request is coming from and use some
> intelligence to redirect the client to the closest bootstrap mirror from
> its own list.  Or the server could pass the client a list of known
> mirrors, and the client could try to determine which one is closest (and
> reachable!).

Exactly. I think this will mostly happen via CDN, but I had also
envisioned that the server could add metadata to a list of possible
mirrors, like:

   [mirror "ko-us"]
   url = http://git.us.kernel.org/...
   zone = us

   [mirror "ko-cn"]
   url = http://git.cn.kernel.org/...
   zone = cn

If the "zone" keys follow a micro-format convention, then the client
knows that it prefers "cn" over "us" (either on the command line, or a
local config option in ~/.gitconfig).

The biggest problem with all of this is that the server has to know
about the mirrors. If you want to set up an in-house mirror for
something hosted on GitHub, but its only available to people in your
company, then GitHub would not want to advertise it. You need some way
to tell your clients about the mirror (and that is the inverse-mirror
"fetch from the mirror, which tells you it is just a bootstrap and to
now switch to the real repo" scheme that I think you were describing
earlier).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-12-05 Thread Jeff King
On Wed, Dec 04, 2013 at 10:50:27PM -0800, Shawn Pearce wrote:

> I wasn't thinking about using a "well known blob" for this.
> 
> Jonathan, Dave, Colby and I were kicking this idea around on Monday
> during lunch. If the initial ref advertisement included a "mirrors"
> capability the client could respond with "want mirrors" instead of the
> usual want/have negotiation. The server could then return the mirror
> URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial
> compared to the cost to really clone the repository.

I don't think this is any more or less efficient than the blob scheme.
In both cases, the client sends a single "want" line and no "have"
lines, and then the server responds with the output (either pkt-lines,
or a single-blob pack).

What I like about the blob approach is:

  1. It requires zero extra code on the server. This makes
 implementation simple, but also means you can deploy it
 on existing servers (or even on non-pkt-line servers like
 dumb http).

  2. It's very debuggable from the client side. You can fetch the blob,
 look at it, and decide which mirror you want outside of git if you
 want to (true, you can teach the git client to dump the pkt-line
 URLs, too, but that's extra code). You could even do this with an
 existing git client that has not yet learned about the mirror
 redirect.

  3. It removes any size or structure limits that the protocol imposes
 (I was planning to use git-config format for the blob itself). The
 URLs themselves aren't big, but we may want to annotate them with
 metadata.

 You mentioned "this is a bundle" versus "this is a regular http
 server" below. You might also want to provide network location
 information (e.g., "this is a good mirror if you are in Asia"),
 though for the most part I'd expect that to happen magically via
 CDN.

 When we discussed this before, the concept came up of offering not
 just a clone bundle, but "slices" of history (as thin-pack
 bundles), so that a fetch could grab a sequence of resumable
 slices, starting with what they have, and then topping off with a
 true fetch. You would want to provide the start and end points of
 each slice.

  4. You can manage it remotely via the git protocol (more discussion
 below).

  5. A clone done with "--mirror" will actually propagate the mirror
 file automatically.

What are the advantages of the pkt-line approach? The biggest one I can
think of is that it does not pollute the refs namespace. While (5) is
convenient in some cases, it would make it more of a pain if you are
trying to keep a clone mirror up to date, but do _not_ want to pass
along upstream's mirror file.

You may want to have a server implementation that offers a dynamic
mirror, rather than a true object we have in the ODB. That is possible
with a mirror blob, but is slightly harder (you have to fake the object
rather than just dumping a line).

> These pkt-lines need to be a bit more than just URL. Or we need a new
> URL like "bundle:http://"; to denote a resumable bundle over HTTP
> vs. a normal HTTP URL that might not be a bundle file, and is just a
> better connected server.

Right, I think that's the most critical one (though you could also just
use the convention of ".bundle" in the URL). I think we may want to
leave room for more metadata, though.

> The mirror URLs could be stored in $GIT_DIR/config as a simple
> multi-value variable. Unfortunately that isn't easily remotely
> editable. But I am not sure I care?

For big sites that manage the bundles on behalf of the user, I don't
think it is an issue. For somebody running their own small site, I think
it is a useful way of moving the data to the server.

> For the average home user sharing their working repository over git://
> from their home ADSL or cable connection, editing .git/config is
> easier than a blob in refs/mirrors. They already know how to edit
> .git/config to manage remotes.

Yes, but it's editing .git/config on the server, not on the client,
which may be slightly harder for some people. I do think we'd want
some tool support on the client side. git-config recently learned to
read from a blob. The next step is:

  git config --blob=refs/mirrors --edit

or

  git config --blob=refs/mirrors mirror.ko.url git://git.kernel.org/...
  git config --blob=refs/mirrors mirror.ko.bundle true

We can't add tool support for editing .git/config on the server side,
because the method for doing so isn't standard.

> Heck, remote.origin.url might already
> be a good mirror address to advertise, especially if the client isn't
> on the same /24 as the server and the remote.origin.url is something
> like "git.kernel.org". :-)

You could have a "git-advertise-upstream" that generates a mirror blob
from your remotes config and pushes it to your publishing point. That
may be overkill, but I don't think it's possible with a
.git/config-based solution.


Re: How to resume broke clone ?

2013-12-05 Thread Shawn Pearce
On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty  wrote:
> This discussion has mostly been about letting small Git servers delegate
> the work of an initial clone to a beefier server.  I haven't seen any
> explicit mention of the inverse:
>
> Suppose a company has a central Git server that is meant to be the
> "single source of truth", but has worldwide offices and wants to locate
> bootstrap mirrors in each office.  The end users would not even want to
> know that there are multiple servers.  Hosters like GitHub might also
> encourage their big customers to set up bootstrap mirror(s) in-house to
> make cloning faster for their users while reducing internet traffic and
> the burden on their own infrastructure.  The goal would be to make the
> system transparent to users and easily reconfigurable as circumstances
> change.

I think there is a different way to do that.

Build a caching Git proxy server. And teach Git clients to use it.


One idea we had at $DAY_JOB a couple of years ago was to build a
daemon that sat in the background and continuously fetched content
from repository upstreams. We made it efficient by modifying the Git
protocol to use a hanging network socket, and the upstream server
would broadcast push pack files down these hanging streams as pushes
were received.

The original intent was for an Android developer to be able to have
his working tree forest of 500 repositories subscribe to our internal
server's broadcast stream. We figured if the server knows exactly
which refs every client has, because they all have the same ones, and
their streams are all still open and active, then the server can make
exactly one incremental thin pack and send the same copy to every
client. Its "just" a socket write problem. Instead of packing the same
stuff 100x for 100x clients its packed once and sent 100x.

Then we realized remote offices could also install this software on a
local server, and use this as a fan-out distributor within the LAN. We
were originally thinking about some remote offices on small Internet
connections, where delivery of 10 MiB x 20 was a lot but delivery of
10 MiB once and local fan-out on the Ethernet was easy.

The JGit patches for this work are still pending[1].


If clients had a local Git-aware cache server in their office and
~/.gitconfig had the address of it, your problem becomes simple.

Clients clone from the public URL e.g. GitHub, but the local cache
server first gives the client a URL to clone from itself. After that
is complete then the client can fetch from the upstream. The cache
server can be self-maintaining, watching its requests to see what is
accessed often-ish, and keep those repositories current-ish locally by
running git fetch itself in the background.

Its easy to do this with bundles on "CDN" like HTTP. Just use the
office's caching HTTP proxy server. Assuming its cache is big enough
for those large Git bundle payloads, and the viral cat videos. But you
are at the mercy of the upstream bundler rebuilding the bundles. And
refetching them in whole. Neither of which is great.

A simple self-contained server that doesn't accept pushes, but knows
how to clone repositories, fetch them periodically, and run `git gc`,
works well. And the mirror URL extension we have been discussing in
this thread would work fine here. The cache server can return URLs
that point to itself. Or flat out proxy the Git transaction with the
origin server.


[1] 
https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-12-05 Thread Michael Haggerty
This discussion has mostly been about letting small Git servers delegate
the work of an initial clone to a beefier server.  I haven't seen any
explicit mention of the inverse:

Suppose a company has a central Git server that is meant to be the
"single source of truth", but has worldwide offices and wants to locate
bootstrap mirrors in each office.  The end users would not even want to
know that there are multiple servers.  Hosters like GitHub might also
encourage their big customers to set up bootstrap mirror(s) in-house to
make cloning faster for their users while reducing internet traffic and
the burden on their own infrastructure.  The goal would be to make the
system transparent to users and easily reconfigurable as circumstances
change.

One alternative would be to ask users to clone from their local mirror.
 The local mirror would give them whatever it has, then do the
equivalent of a permanent redirect to tell the client "from now on, use
the central server" to get the rest of the initial clone and for future
fetches.  But this would require users to know which mirror is "local".

A better alternative would be to ask users to clone from the central
server.  In this case, the central server would want to tell the clients
to grab what they can from their local bootstrap mirror and then come
back to the central server for any remainders.  The trick is that which
bootstrap mirror is "local" would vary from client to client.

I suppose that this could be implemented using what you have discussed
by having the central server direct the client to a URL that resolves
differently for different clients, CDN-like.  Alternatively, the central
Git server could itself look where a request is coming from and use some
intelligence to redirect the client to the closest bootstrap mirror from
its own list.  Or the server could pass the client a list of known
mirrors, and the client could try to determine which one is closest (and
reachable!).

I'm not sure that this idea is interesting, but I just wanted to throw
it out there as a related use case that seems a bit different than what
you have been discussing.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-12-04 Thread Shawn Pearce
On Wed, Dec 4, 2013 at 12:08 PM, Jeff King  wrote:
> On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote:
>
>> >>  - better integration with git bundles, provide a way to seamlessly
>> >> create/fetch/resume the bundles with "git clone" and "git fetch"
>>
>> We have been thinking about formalizing the /clone.bundle hack used by
>> repo on Android. If the server has the bundle, add a capability in the
>> refs advertisement saying its available, and the clone client can
>> first fetch $URL/clone.bundle.
>
> Yes, that was going to be my next step after getting the bundle fetch
> support in.

Yay!

> If we are going to do this, though, I'd really love for it
> to not be "hey, fetch .../clone.bundle from me", but a full-fledged
> "here are full URLs of my mirrors".

Ack. I agree completely.

> Then you can redirect a non-http cloner to http to grab the bundle. Or
> redirect them to a CDN. Or even somebody else's server entirely (e.g.,
> "go fetch from Linus first, my piddly server cannot feed you the whole
> kernel"). Some of the redirects you can do by issuing an http redirect
> to "/clone.bundle", but the cross-protocol ones are tricky.

Ack. My thoughts exactly. Especially the part of "my piddly server
shouldn't have to serve you a clone of Linus' tree when there are many
public hosts mirroring his code available to anyone". It is simply not
fair to clone Linus' tree off some guy's home ADSL connection, his
uplink probably sucks. But it is reasonable to fetch his incremental
delta after cloning from some other well known and well connected
source.

> If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors")
> it does not add much overhead over a simple capability. There are a few
> extra round trips to actually fetch the blob (client sends a want and no
> haves, then server sends the pack), but I think that's negligible when
> we are talking about redirecting a full clone. In either case, we have
> to hang up the original connection, fetch the mirror, and then come
> back.

I wasn't thinking about using a "well known blob" for this.

Jonathan, Dave, Colby and I were kicking this idea around on Monday
during lunch. If the initial ref advertisement included a "mirrors"
capability the client could respond with "want mirrors" instead of the
usual want/have negotiation. The server could then return the mirror
URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial
compared to the cost to really clone the repository.

These pkt-lines need to be a bit more than just URL. Or we need a new
URL like "bundle:http://"; to denote a resumable bundle over HTTP
vs. a normal HTTP URL that might not be a bundle file, and is just a
better connected server.


The mirror URLs could be stored in $GIT_DIR/config as a simple
multi-value variable. Unfortunately that isn't easily remotely
editable. But I am not sure I care?

GitHub doesn't let you edit $GIT_DIR/config, but it doesn't need to.
For most repositories hosted at GitHub, GitHub is probably the best
connected server for that repository. For repositories that are
incredibly high traffic GitHub might out of its own interest want to
configure mirror URLs on some sort of CDN to distribute the network
traffic closer to the edges. Repository owners just shouldn't have to
worry about these sorts of details. It should be managed by the
hosting service.

In my case for android.googlesource.com we want bundles on the CDN
near the network edges, and our repository owners don't care to know
the details of that. They just want our server software to make it all
happen, and our servers already manage $GIT_DIR/config for them. It
also mostly manages /clone.bundle on the CDN. And /clone.bundle is an
ugly, limited hack.

For the average home user sharing their working repository over git://
from their home ADSL or cable connection, editing .git/config is
easier than a blob in refs/mirrors. They already know how to edit
.git/config to manage remotes. Heck, remote.origin.url might already
be a good mirror address to advertise, especially if the client isn't
on the same /24 as the server and the remote.origin.url is something
like "git.kernel.org". :-)

>> For most Git repositories the bundle can be constructed by saving the
>> bundle reference header into a file, e.g.
>> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
>> created. The bundle can be served by combining the .bh and .pack
>> streams onto the network. It is very little additional disk overhead
>> for the origin server,
>
> That's clever. It does not work out of the box if you are using
> alternates, but I think it could be adapted in certain situations. E.g.,
> if you layer the pack so that one "base" repo always has its full pack
> at the start, which is something we're already doing at GitHub.

Yes, well, I was assuming the pack was a fully connected repack.
Alternates always creates a partial pack. But if you have an
alternate, that alternate maybe should be given as a

Re: How to resume broke clone ?

2013-12-04 Thread Jeff King
On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote:

> >>  - better integration with git bundles, provide a way to seamlessly
> >> create/fetch/resume the bundles with "git clone" and "git fetch"
> 
> We have been thinking about formalizing the /clone.bundle hack used by
> repo on Android. If the server has the bundle, add a capability in the
> refs advertisement saying its available, and the clone client can
> first fetch $URL/clone.bundle.

Yes, that was going to be my next step after getting the bundle fetch
support in. If we are going to do this, though, I'd really love for it
to not be "hey, fetch .../clone.bundle from me", but a full-fledged
"here are full URLs of my mirrors".

Then you can redirect a non-http cloner to http to grab the bundle. Or
redirect them to a CDN. Or even somebody else's server entirely (e.g.,
"go fetch from Linus first, my piddly server cannot feed you the whole
kernel"). Some of the redirects you can do by issuing an http redirect
to "/clone.bundle", but the cross-protocol ones are tricky.

If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors")
it does not add much overhead over a simple capability. There are a few
extra round trips to actually fetch the blob (client sends a want and no
haves, then server sends the pack), but I think that's negligible when
we are talking about redirecting a full clone. In either case, we have
to hang up the original connection, fetch the mirror, and then come
back.

> For most Git repositories the bundle can be constructed by saving the
> bundle reference header into a file, e.g.
> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
> created. The bundle can be served by combining the .bh and .pack
> streams onto the network. It is very little additional disk overhead
> for the origin server,

That's clever. It does not work out of the box if you are using
alternates, but I think it could be adapted in certain situations. E.g.,
if you layer the pack so that one "base" repo always has its full pack
at the start, which is something we're already doing at GitHub.

> but allows resumable clone, provided the server has not done a GC.

As an aside, the current transfer-resuming code in http.c is
questionable.  It does not use etags or any sort of invalidation
mechanism, but just assumes hitting the same URL will give the same
bytes. That _usually_ works for dumb fetching of objects and packfiles,
though it is possible for a pack to change representation without
changing name.

My bundle patches inherited the same flaw, but it is much worse there,
because your URL may very well just be "clone.bundle" that gets updated
periodically.

> > I posted patches for this last year. One of the things that I got hung
> > up on was that I spooled the bundle to disk, and then cloned from it.
> > Which meant that you needed twice the disk space for a moment.
> 
> I don't think this is a huge concern. In many cases the checked out
> copy of the repository approaches a sizable fraction of the .pack
> itself. If you don't have 2x .pack disk available at clone time you
> may be in trouble anyway as you try to work with the repository post
> clone.

Yeah, in retrospect I was being stupid to let that hold it up. I'll
revisit the patches (I've rebased them forward over the past year, so it
shouldn't be too bad).

> > I wanted
> > to teach index-pack to "--fix-thin" a pack that was already on disk, so
> > that we could spool to disk, and then finalize it without making another
> > copy.
> 
> Don't you need to separate the bundle header from the pack data before
> you do this?

Yes, though it isn't hard. We have to fetch part of the bundle header
into memory during discover_refs(), since that is when we realize we are
getting a bundle and not just the refs. From there you can spool the
bundle header to disk, and then the packfile separately.

My original implementation did that, though I don't remember if that one
got posted to the list (after realizing that I couldn't just
"--fix-thin" directly, I simplified it to just spool the whole thing to
a single file).

> If the bundle is only used at clone time there is no
> --fix-thin step.

Yes, for the particular use case of a clone-mirror, you wouldn't need to
--fix-thin. But I think "git fetch https://example.com/foo.bundle";
should work in the general case (and it does with my patches).

> See above, I think you can reasonably do the /clone.bundle
> automatically on any HTTP server.

Yeah, the ".bh" trick you mentioned is low enough impact to the server
that we could just unconditionally make it part of the repack.

> > What would need to go into such a hash? It would need to represent the
> > exact bytes that will go into the pack, but without actually generating
> > those bytes. Perhaps a sha1 over the sequence of  > base (if applicable), length> for each object would be enough. We should
> > know that after calling compute_write_order. If the client has a match,
> > we should be 

Re: How to resume broke clone ?

2013-11-28 Thread Jakub Narebski
zhifeng hu  ancientrocklab.com> writes:

> 
> Once using git clone —depth or git fetch —depth,
> While you want to move backward.
> you may face problem
> 
>  git fetch --depth=105
> error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
> error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
> error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
> error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
> error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b

BTW. there was (is?) a bundler service at http://bundler.caurea.org/
but I don't know if it can create Linux-size bundle.

-- 
Jakub Narębski


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Shawn Pearce
On Thu, Nov 28, 2013 at 1:29 AM, zhifeng hu  wrote:
> Once using git clone —depth or git fetch —depth,
> While you want to move backward.
> you may face problem
>
>  git fetch --depth=105
> error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
> error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
> error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
> error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
> error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b

We now have a resumable bundle available through our kernel.org
mirror. The bundle is 658M.

  mkdir linux
  cd linux
  git init

  wget 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/clone.bundle

  sha1sum clone.bundle
  96831de0b8171e5ebba94edb31e37e70e1df  clone.bundle

  git fetch -u ./clone.bundle refs/*:refs/*
  git reset --hard

You can also use our mirror as an upstream, as we have servers in Asia
that lag no more than 5 or 6 minutes behind kernel.org:

  git remote add origin
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Shawn Pearce
On Thu, Nov 28, 2013 at 1:29 AM, Jeff King  wrote:
> On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:
>
>> > Git should be better support resume transfer.
>> > It now seems not doing better it’s job.
>> > Share code, manage code, transfer code, what would it be a VCS we imagine 
>> > it ?
>>
>> You're welcome to step up and do it. On top of my head  there are a few 
>> options:
>>
>>  - better integration with git bundles, provide a way to seamlessly
>> create/fetch/resume the bundles with "git clone" and "git fetch"

We have been thinking about formalizing the /clone.bundle hack used by
repo on Android. If the server has the bundle, add a capability in the
refs advertisement saying its available, and the clone client can
first fetch $URL/clone.bundle.

For most Git repositories the bundle can be constructed by saving the
bundle reference header into a file, e.g.
$GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is
created. The bundle can be served by combining the .bh and .pack
streams onto the network. It is very little additional disk overhead
for the origin server, but allows resumable clone, provided the server
has not done a GC.

> I posted patches for this last year. One of the things that I got hung
> up on was that I spooled the bundle to disk, and then cloned from it.
> Which meant that you needed twice the disk space for a moment.

I don't think this is a huge concern. In many cases the checked out
copy of the repository approaches a sizable fraction of the .pack
itself. If you don't have 2x .pack disk available at clone time you
may be in trouble anyway as you try to work with the repository post
clone.

> I wanted
> to teach index-pack to "--fix-thin" a pack that was already on disk, so
> that we could spool to disk, and then finalize it without making another
> copy.

Don't you need to separate the bundle header from the pack data before
you do this? If the bundle is only used at clone time there is no
--fix-thin step.

> One of the downsides of this approach is that it requires the repo
> provider (or somebody else) to provide the bundle. I think that is
> something that a big site like GitHub would do (and probably push the
> bundles out to a CDN, too, to make getting them faster). But it's not a
> universal solution.

See above, I think you can reasonably do the /clone.bundle
automatically on any HTTP server. Big sites might choose to have
/clone.bundle do a redirect into a caching CDN that fills itself by
going to the application servers to obtain the current data. This is
what we do for Android.

>>  - stablize pack order so we can resume downloading a pack
>
> I think stabilizing in all cases (e.g., including ones where the content
> has changed) is hard, but I wonder if it would be enough to handle the
> easy cases, where nothing has changed. If the server does not use
> multiple threads for delta computation, it should generate the same pack
> from the same on-disk deterministically. We just need a way for the
> client to indicate that it has the same partial pack.
>
> I'm thinking that the server would report some opaque hash representing
> the current pack. The client would record that, along with the number of
> pack bytes it received. If the transfer is interrupted, the client comes
> back with the hash/bytes pair. The server starts to generate the pack,
> checks whether the hash matches, and if so, says "here is the same pack,
> resuming at byte X".

An important part of this is the want set must be identical to the
prior request. It is entirely possible the branch tips have advanced
since the prior packing attempt started.

> What would need to go into such a hash? It would need to represent the
> exact bytes that will go into the pack, but without actually generating
> those bytes. Perhaps a sha1 over the sequence of  base (if applicable), length> for each object would be enough. We should
> know that after calling compute_write_order. If the client has a match,
> we should be able to skip ahead to the correct byte.

I don't think Length is sufficient.

The repository could have recompressed an object with the same length
but different libz encoding. I wonder if loose object recompression is
reliable enough about libz encoding to resume in the middle of an
object? Is it just based on libz version?

You may need to do include information about the source of the object,
e.g. the trailing 20 byte hash in the source pack file.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Duy Nguyen
On Thu, Nov 28, 2013 at 4:29 PM, Jeff King  wrote:
>>  - stablize pack order so we can resume downloading a pack
>
> I think stabilizing in all cases (e.g., including ones where the content
> has changed) is hard, but I wonder if it would be enough to handle the
> easy cases, where nothing has changed. If the server does not use
> multiple threads for delta computation, it should generate the same pack
> from the same on-disk deterministically. We just need a way for the
> client to indicate that it has the same partial pack.
>
> I'm thinking that the server would report some opaque hash representing
> the current pack. The client would record that, along with the number of
> pack bytes it received. If the transfer is interrupted, the client comes
> back with the hash/bytes pair. The server starts to generate the pack,
> checks whether the hash matches, and if so, says "here is the same pack,
> resuming at byte X".
>
> What would need to go into such a hash? It would need to represent the
> exact bytes that will go into the pack, but without actually generating
> those bytes. Perhaps a sha1 over the sequence of  base (if applicable), length> for each object would be enough. We should
> know that after calling compute_write_order. If the client has a match,
> we should be able to skip ahead to the correct byte.

Exactly. The hash would include the list of sha-1 and object source,
the git version (so changes in code or default values are covered),
the list of config keys/values that may impact pack generation
algorithm (like window size..), .git/shallow, refs/replace,
.git/graft, all or most of command line options. If we audit the code
carefully I think we can cover all input that influences pack
generation. From then on it's just a matter of protocol extension. It
also opens an opportunity for optional server side caching, just save
the pack and associate it with the hash. Next time the client asks to
resume, the server has everything ready.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread zhifeng hu
Once using git clone —depth or git fetch —depth,
While you want to move backward.
you may face problem

 git fetch --depth=105
error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a
error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef
error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b
error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db
error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b


zhifeng hu 



On Nov 28, 2013, at 5:20 PM, Tay Ray Chuan  wrote:

> On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen  wrote:
>> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu  wrote:
>>> Thanks for reply, But I am developer, I want to clone full repository, I 
>>> need to view code since very early.
>> 
>> if it works with --depth =1, you can incrementally run "fetch
>> --depth=N" with N larger and larger.
> 
> I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially
> create a "shallow" clone then 2) incrementally deepen your clone.
> 
> Zhifeng, in the course of your research into resumable cloning, you
> might have learnt that while it's a really valuable feature, it's also
> a pretty hard problem at the same time. So it's not because git
> doesn't want to have this feature.
> 
> -- 
> Cheers,
> Ray Chuan
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Jeff King
On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote:

> > Git should be better support resume transfer.
> > It now seems not doing better it’s job.
> > Share code, manage code, transfer code, what would it be a VCS we imagine 
> > it ?
> 
> You're welcome to step up and do it. On top of my head  there are a few 
> options:
> 
>  - better integration with git bundles, provide a way to seamlessly
> create/fetch/resume the bundles with "git clone" and "git fetch"

I posted patches for this last year. One of the things that I got hung
up on was that I spooled the bundle to disk, and then cloned from it.
Which meant that you needed twice the disk space for a moment. I wanted
to teach index-pack to "--fix-thin" a pack that was already on disk, so
that we could spool to disk, and then finalize it without making another
copy.

One of the downsides of this approach is that it requires the repo
provider (or somebody else) to provide the bundle. I think that is
something that a big site like GitHub would do (and probably push the
bundles out to a CDN, too, to make getting them faster). But it's not a
universal solution.

>  - stablize pack order so we can resume downloading a pack

I think stabilizing in all cases (e.g., including ones where the content
has changed) is hard, but I wonder if it would be enough to handle the
easy cases, where nothing has changed. If the server does not use
multiple threads for delta computation, it should generate the same pack
from the same on-disk deterministically. We just need a way for the
client to indicate that it has the same partial pack.

I'm thinking that the server would report some opaque hash representing
the current pack. The client would record that, along with the number of
pack bytes it received. If the transfer is interrupted, the client comes
back with the hash/bytes pair. The server starts to generate the pack,
checks whether the hash matches, and if so, says "here is the same pack,
resuming at byte X".

What would need to go into such a hash? It would need to represent the
exact bytes that will go into the pack, but without actually generating
those bytes. Perhaps a sha1 over the sequence of  for each object would be enough. We should
know that after calling compute_write_order. If the client has a match,
we should be able to skip ahead to the correct byte.

>  - remote alternates, the repo will ask for more and more objects as
> you need them (so goodbye to distributed model)

This is also something I've been playing with, but just for very large
objects (so to support something like git-media, but below the object
graph layer). I don't think it would apply here, as the kernel has a lot
of small objects, and getting them in the tight delta'd pack format
increases efficiency a lot.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Tay Ray Chuan
On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen  wrote:
> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu  wrote:
>> Thanks for reply, But I am developer, I want to clone full repository, I 
>> need to view code since very early.
>
> if it works with --depth =1, you can incrementally run "fetch
> --depth=N" with N larger and larger.

I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially
create a "shallow" clone then 2) incrementally deepen your clone.

Zhifeng, in the course of your research into resumable cloning, you
might have learnt that while it's a really valuable feature, it's also
a pretty hard problem at the same time. So it's not because git
doesn't want to have this feature.

-- 
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Jeff King
On Thu, Nov 28, 2013 at 01:32:36AM -0700, Max Kirillov wrote:

> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > 
> > I am in china. our bandwidth is very limitation. Less than 50Kb/s.
> 
> You could manually download big packed bundled from some http remote.
> For example http://repo.or.cz/r/linux.git
> 
> * create a new repository, add the remote there.
> 
> * download files with wget or whatever:
>  http://repo.or.cz/r/linux.git/objects/info/packs
> also files mentioned in the file. Currently they are:
>  
> http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx
>  
> http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack
> 
> and put them to the corresponding places.
> 
> * then run fetch of pull. I believe it should run fast then. Though I
> have not test it.

You would also need to set up local refs so that git knows you have
those objects. The simplest way to do it is to just fetch by dumb-http,
which can resume the pack transfer. I think that clone is also very
eager to clean up the partial transfer if the initial fetch fails. So
you would want to init manually:

  git init linux
  cd linux
  git remote add origin 
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
  GIT_SMART_HTTP=0 git fetch -vv

and then you can follow that up with regular smart fetches, which should
be much smaller.

It would be even simpler if you could fetch the whole thing as a bundle,
rather than over dumb-http. But that requires the server side (or some
third party who has fast access to the repo) cooperating and making a
bundle available.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Duy Nguyen
On Thu, Nov 28, 2013 at 3:55 PM, zhifeng hu  wrote:
> The repository growing fast, things get harder . Now the size reach several 
> GB, it may possible be TB, YB.
> When then, How do we handle this?
> If the transfer broken, and it can not be resume transfer, waste time and 
> waste bandwidth.
>
> Git should be better support resume transfer.
> It now seems not doing better it’s job.
> Share code, manage code, transfer code, what would it be a VCS we imagine it ?

You're welcome to step up and do it. On top of my head  there are a few options:

 - better integration with git bundles, provide a way to seamlessly
create/fetch/resume the bundles with "git clone" and "git fetch"
 - shallow/narrow clone. the idea is get a small part of the repo, one
depth, a few paths, then get more and more over many iterations so if
we fail one iteration we don't lose everything
 - stablize pack order so we can resume downloading a pack
 - remote alternates, the repo will ask for more and more objects as
you need them (so goodbye to distributed model)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread zhifeng hu
The repository growing fast, things get harder . Now the size reach several GB, 
it may possible be TB, YB.
When then, How do we handle this?
If the transfer broken, and it can not be resume transfer, waste time and waste 
bandwidth.

Git should be better support resume transfer.
It now seems not doing better it’s job.
Share code, manage code, transfer code, what would it be a VCS we imagine it ?
 
zhifeng hu 



On Nov 28, 2013, at 4:50 PM, Duy Nguyen  wrote:

> On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees  
> wrote:
>> Or simply download the individual files (via ftp/http) and clone locally:
>> 
>>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>> cd linux
>>> git remote set-url origin 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> Yeah I didn't realize it is published over dumb http too. You may need
> to be careful with this though because it's not atomic and you may get
> refs that point nowhere because you're already done with "pack"
> directory when you come to fetcing "refs" and did not see new packs...
> If dumb commit walker supports resume (I don't know) then it'll be
> safer to do
> 
> git clone http://git.kernel.org/
> 
> If it does not support resume, I don't think it's hard to do.
> -- 
> Duy

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Duy Nguyen
On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees  wrote:
> Or simply download the individual files (via ftp/http) and clone locally:
>
>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> cd linux
>> git remote set-url origin 
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Yeah I didn't realize it is published over dumb http too. You may need
to be careful with this though because it's not atomic and you may get
refs that point nowhere because you're already done with "pack"
directory when you come to fetcing "refs" and did not see new packs...
If dumb commit walker supports resume (I don't know) then it'll be
safer to do

git clone http://git.kernel.org/

If it does not support resume, I don't think it's hard to do.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: How to resume broke clone ?

2013-11-28 Thread Max Kirillov
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> I am in china. our bandwidth is very limitation. Less than 50Kb/s.

You could manually download big packed bundled from some http remote.
For example http://repo.or.cz/r/linux.git

* create a new repository, add the remote there.

* download files with wget or whatever:
 http://repo.or.cz/r/linux.git/objects/info/packs
also files mentioned in the file. Currently they are:
 
http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx
 
http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack

and put them to the corresponding places.

* then run fetch of pull. I believe it should run fast then. Though I
have not test it.

Br,
-- 
Max
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Karsten Blees
Am 28.11.2013 09:14, schrieb Duy Nguyen:
> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu  wrote:
>> Thanks for reply, But I am developer, I want to clone full repository, I 
>> need to view code since very early.
> 
> if it works with --depth =1, you can incrementally run "fetch
> --depth=N" with N larger and larger.
> 
> But it may be easier to ask kernel.org admin, or any dev with a public
> web server, to provide you a git bundle you can download via http.
> Then you can fetch on top.
> 

Or simply download the individual files (via ftp/http) and clone locally:

> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> cd linux
> git remote set-url origin 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-28 Thread Duy Nguyen
On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu  wrote:
> Thanks for reply, But I am developer, I want to clone full repository, I need 
> to view code since very early.

if it works with --depth =1, you can incrementally run "fetch
--depth=N" with N larger and larger.

But it may be easier to ask kernel.org admin, or any dev with a public
web server, to provide you a git bundle you can download via http.
Then you can fetch on top.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-27 Thread zhifeng hu
Thanks for reply, But I am developer, I want to clone full repository, I need 
to view code since very early.

zhifeng hu 



On Nov 28, 2013, at 3:39 PM, Trần Ngọc Quân  wrote:

> On 28/11/2013 10:13, zhifeng hu wrote:
>> Hello all:
>> Today i want to clone the Linux Kernel git repository.
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> 
>> I am in china. our bandwidth is very limitation. Less than 50Kb/s.
> This repo is really too big.
> You may consider using --depth option if you don't want full history, or
> clone from somewhere have better bandwidth
> $ git clone --depth=1
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> you may chose other mirror (github.com) for example
> see git-clone(1)
> 
> -- 
> Trần Ngọc Quân.
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to resume broke clone ?

2013-11-27 Thread Trần Ngọc Quân
On 28/11/2013 10:13, zhifeng hu wrote:
> Hello all:
> Today i want to clone the Linux Kernel git repository.
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> I am in china. our bandwidth is very limitation. Less than 50Kb/s.
This repo is really too big.
You may consider using --depth option if you don't want full history, or
clone from somewhere have better bandwidth
$ git clone --depth=1
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
you may chose other mirror (github.com) for example
see git-clone(1)

-- 
Trần Ngọc Quân.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to resume broke clone ?

2013-11-27 Thread zhifeng hu
Hello all:
Today i want to clone the Linux Kernel git repository.
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

I am in china. our bandwidth is very limitation. Less than 50Kb/s.

The clone progress is very slow, and broken times and time.
I am very unhappy. 
Because i could not easily to clone kernel.

I had do some research about resume clone , but no good plan how to resolve 
this problem .


Would it be possible add resume transfer clone repository after the transfer 
broken?

such as bittorrent  download. or what ever.

zhifeng hu 



--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html