Re: How to resume broke clone ?
On Thu, Dec 05, 2013 at 10:01:28AM -0800, Junio C Hamano wrote: > > You could have a "git-advertise-upstream" that generates a mirror blob > > from your remotes config and pushes it to your publishing point. That > > may be overkill, but I don't think it's possible with a > > .git/config-based solution. > > I do not think I follow. The upload-pack service could be taught to > pay attention to the uploadpack.advertiseUpstream config at runtime, > advertise 'mirror' capability, and then respond with the list of > remote.*.url it uses when asked (if we go with the pkt-line based > approach). I was assuming a triangular workflow, where your publishing point (that other people will fetch from) does not know anything about the upstream. Like: $ git clone git://git.kernel.org/pub/scm/git/git.git $ hack hack hack; commit commit commit $ git remote add me myserver:/var/git/git.git $ git push me $ git advertise-upstream origin me If your publishing point is already fetching from another upstream, then yeah, I'd agree that dynamically generating it from the config is fine. > Alternatively, it could also be taught to pay attention > to the same config at runtime, create an blob to advertise the list > of remote.*.url it uses and store it in refs/mirror (or do this > purely in-core without actually writing to the refs/ namespace), and > emit an entry for refs/mirror using that blob object name in the > ls-remote part of the response (if we go with the magic blob based > approach). Yes. The pkt-line versus refs distinction is purely a protocol issue. You can do anything you want on the backend with either of them, including faking the ref (you can also accept fake pushes to refs/mirror, too, if you really want people to be able to upload that way). But it is worth considering what implementation difficulties we would run across in either case. Producing a fake refs/mirror blob that responds like a normal ref is more work than just dumping the lines. If we're always just going to generate it dynamically anyway, then we can save ourselves some effort. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
Jeff King writes: > Right, I think that's the most critical one (though you could also just > use the convention of ".bundle" in the URL). I think we may want to > leave room for more metadata, though. Good. I like this line of thinking. >> Heck, remote.origin.url might already >> be a good mirror address to advertise, especially if the client isn't >> on the same /24 as the server and the remote.origin.url is something >> like "git.kernel.org". :-) > > You could have a "git-advertise-upstream" that generates a mirror blob > from your remotes config and pushes it to your publishing point. That > may be overkill, but I don't think it's possible with a > .git/config-based solution. I do not think I follow. The upload-pack service could be taught to pay attention to the uploadpack.advertiseUpstream config at runtime, advertise 'mirror' capability, and then respond with the list of remote.*.url it uses when asked (if we go with the pkt-line based approach). Alternatively, it could also be taught to pay attention to the same config at runtime, create an blob to advertise the list of remote.*.url it uses and store it in refs/mirror (or do this purely in-core without actually writing to the refs/ namespace), and emit an entry for refs/mirror using that blob object name in the ls-remote part of the response (if we go with the magic blob based approach). >> Yes. And this is why the packfile name algorithm is horribly flawed. I >> keep saying we should change it to name the pack using the last 20 >> bytes of the file but ... nobody has written the patch for that? :-) > > Totally agree. I think we could also get rid of the horrible hacks in > repack where we pack to a tempfile, then have to do another tempfile > dance (which is not atomic!) to move the same-named packfile out of the > way. If the name were based on the content, we could just throw away our > new pack if one of the same name is already there (just like we do for > loose objects). Yay. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Dec 05, 2013 at 02:21:09PM +0100, Michael Haggerty wrote: > A better alternative would be to ask users to clone from the central > server. In this case, the central server would want to tell the clients > to grab what they can from their local bootstrap mirror and then come > back to the central server for any remainders. The trick is that which > bootstrap mirror is "local" would vary from client to client. > > I suppose that this could be implemented using what you have discussed > by having the central server direct the client to a URL that resolves > differently for different clients, CDN-like. Alternatively, the central > Git server could itself look where a request is coming from and use some > intelligence to redirect the client to the closest bootstrap mirror from > its own list. Or the server could pass the client a list of known > mirrors, and the client could try to determine which one is closest (and > reachable!). Exactly. I think this will mostly happen via CDN, but I had also envisioned that the server could add metadata to a list of possible mirrors, like: [mirror "ko-us"] url = http://git.us.kernel.org/... zone = us [mirror "ko-cn"] url = http://git.cn.kernel.org/... zone = cn If the "zone" keys follow a micro-format convention, then the client knows that it prefers "cn" over "us" (either on the command line, or a local config option in ~/.gitconfig). The biggest problem with all of this is that the server has to know about the mirrors. If you want to set up an in-house mirror for something hosted on GitHub, but its only available to people in your company, then GitHub would not want to advertise it. You need some way to tell your clients about the mirror (and that is the inverse-mirror "fetch from the mirror, which tells you it is just a bootstrap and to now switch to the real repo" scheme that I think you were describing earlier). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Wed, Dec 04, 2013 at 10:50:27PM -0800, Shawn Pearce wrote: > I wasn't thinking about using a "well known blob" for this. > > Jonathan, Dave, Colby and I were kicking this idea around on Monday > during lunch. If the initial ref advertisement included a "mirrors" > capability the client could respond with "want mirrors" instead of the > usual want/have negotiation. The server could then return the mirror > URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial > compared to the cost to really clone the repository. I don't think this is any more or less efficient than the blob scheme. In both cases, the client sends a single "want" line and no "have" lines, and then the server responds with the output (either pkt-lines, or a single-blob pack). What I like about the blob approach is: 1. It requires zero extra code on the server. This makes implementation simple, but also means you can deploy it on existing servers (or even on non-pkt-line servers like dumb http). 2. It's very debuggable from the client side. You can fetch the blob, look at it, and decide which mirror you want outside of git if you want to (true, you can teach the git client to dump the pkt-line URLs, too, but that's extra code). You could even do this with an existing git client that has not yet learned about the mirror redirect. 3. It removes any size or structure limits that the protocol imposes (I was planning to use git-config format for the blob itself). The URLs themselves aren't big, but we may want to annotate them with metadata. You mentioned "this is a bundle" versus "this is a regular http server" below. You might also want to provide network location information (e.g., "this is a good mirror if you are in Asia"), though for the most part I'd expect that to happen magically via CDN. When we discussed this before, the concept came up of offering not just a clone bundle, but "slices" of history (as thin-pack bundles), so that a fetch could grab a sequence of resumable slices, starting with what they have, and then topping off with a true fetch. You would want to provide the start and end points of each slice. 4. You can manage it remotely via the git protocol (more discussion below). 5. A clone done with "--mirror" will actually propagate the mirror file automatically. What are the advantages of the pkt-line approach? The biggest one I can think of is that it does not pollute the refs namespace. While (5) is convenient in some cases, it would make it more of a pain if you are trying to keep a clone mirror up to date, but do _not_ want to pass along upstream's mirror file. You may want to have a server implementation that offers a dynamic mirror, rather than a true object we have in the ODB. That is possible with a mirror blob, but is slightly harder (you have to fake the object rather than just dumping a line). > These pkt-lines need to be a bit more than just URL. Or we need a new > URL like "bundle:http://"; to denote a resumable bundle over HTTP > vs. a normal HTTP URL that might not be a bundle file, and is just a > better connected server. Right, I think that's the most critical one (though you could also just use the convention of ".bundle" in the URL). I think we may want to leave room for more metadata, though. > The mirror URLs could be stored in $GIT_DIR/config as a simple > multi-value variable. Unfortunately that isn't easily remotely > editable. But I am not sure I care? For big sites that manage the bundles on behalf of the user, I don't think it is an issue. For somebody running their own small site, I think it is a useful way of moving the data to the server. > For the average home user sharing their working repository over git:// > from their home ADSL or cable connection, editing .git/config is > easier than a blob in refs/mirrors. They already know how to edit > .git/config to manage remotes. Yes, but it's editing .git/config on the server, not on the client, which may be slightly harder for some people. I do think we'd want some tool support on the client side. git-config recently learned to read from a blob. The next step is: git config --blob=refs/mirrors --edit or git config --blob=refs/mirrors mirror.ko.url git://git.kernel.org/... git config --blob=refs/mirrors mirror.ko.bundle true We can't add tool support for editing .git/config on the server side, because the method for doing so isn't standard. > Heck, remote.origin.url might already > be a good mirror address to advertise, especially if the client isn't > on the same /24 as the server and the remote.origin.url is something > like "git.kernel.org". :-) You could have a "git-advertise-upstream" that generates a mirror blob from your remotes config and pushes it to your publishing point. That may be overkill, but I don't think it's possible with a .git/config-based solution.
Re: How to resume broke clone ?
On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty wrote: > This discussion has mostly been about letting small Git servers delegate > the work of an initial clone to a beefier server. I haven't seen any > explicit mention of the inverse: > > Suppose a company has a central Git server that is meant to be the > "single source of truth", but has worldwide offices and wants to locate > bootstrap mirrors in each office. The end users would not even want to > know that there are multiple servers. Hosters like GitHub might also > encourage their big customers to set up bootstrap mirror(s) in-house to > make cloning faster for their users while reducing internet traffic and > the burden on their own infrastructure. The goal would be to make the > system transparent to users and easily reconfigurable as circumstances > change. I think there is a different way to do that. Build a caching Git proxy server. And teach Git clients to use it. One idea we had at $DAY_JOB a couple of years ago was to build a daemon that sat in the background and continuously fetched content from repository upstreams. We made it efficient by modifying the Git protocol to use a hanging network socket, and the upstream server would broadcast push pack files down these hanging streams as pushes were received. The original intent was for an Android developer to be able to have his working tree forest of 500 repositories subscribe to our internal server's broadcast stream. We figured if the server knows exactly which refs every client has, because they all have the same ones, and their streams are all still open and active, then the server can make exactly one incremental thin pack and send the same copy to every client. Its "just" a socket write problem. Instead of packing the same stuff 100x for 100x clients its packed once and sent 100x. Then we realized remote offices could also install this software on a local server, and use this as a fan-out distributor within the LAN. We were originally thinking about some remote offices on small Internet connections, where delivery of 10 MiB x 20 was a lot but delivery of 10 MiB once and local fan-out on the Ethernet was easy. The JGit patches for this work are still pending[1]. If clients had a local Git-aware cache server in their office and ~/.gitconfig had the address of it, your problem becomes simple. Clients clone from the public URL e.g. GitHub, but the local cache server first gives the client a URL to clone from itself. After that is complete then the client can fetch from the upstream. The cache server can be self-maintaining, watching its requests to see what is accessed often-ish, and keep those repositories current-ish locally by running git fetch itself in the background. Its easy to do this with bundles on "CDN" like HTTP. Just use the office's caching HTTP proxy server. Assuming its cache is big enough for those large Git bundle payloads, and the viral cat videos. But you are at the mercy of the upstream bundler rebuilding the bundles. And refetching them in whole. Neither of which is great. A simple self-contained server that doesn't accept pushes, but knows how to clone repositories, fetch them periodically, and run `git gc`, works well. And the mirror URL extension we have been discussing in this thread would work fine here. The cache server can return URLs that point to itself. Or flat out proxy the Git transaction with the origin server. [1] https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
This discussion has mostly been about letting small Git servers delegate the work of an initial clone to a beefier server. I haven't seen any explicit mention of the inverse: Suppose a company has a central Git server that is meant to be the "single source of truth", but has worldwide offices and wants to locate bootstrap mirrors in each office. The end users would not even want to know that there are multiple servers. Hosters like GitHub might also encourage their big customers to set up bootstrap mirror(s) in-house to make cloning faster for their users while reducing internet traffic and the burden on their own infrastructure. The goal would be to make the system transparent to users and easily reconfigurable as circumstances change. One alternative would be to ask users to clone from their local mirror. The local mirror would give them whatever it has, then do the equivalent of a permanent redirect to tell the client "from now on, use the central server" to get the rest of the initial clone and for future fetches. But this would require users to know which mirror is "local". A better alternative would be to ask users to clone from the central server. In this case, the central server would want to tell the clients to grab what they can from their local bootstrap mirror and then come back to the central server for any remainders. The trick is that which bootstrap mirror is "local" would vary from client to client. I suppose that this could be implemented using what you have discussed by having the central server direct the client to a URL that resolves differently for different clients, CDN-like. Alternatively, the central Git server could itself look where a request is coming from and use some intelligence to redirect the client to the closest bootstrap mirror from its own list. Or the server could pass the client a list of known mirrors, and the client could try to determine which one is closest (and reachable!). I'm not sure that this idea is interesting, but I just wanted to throw it out there as a related use case that seems a bit different than what you have been discussing. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Wed, Dec 4, 2013 at 12:08 PM, Jeff King wrote: > On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote: > >> >> - better integration with git bundles, provide a way to seamlessly >> >> create/fetch/resume the bundles with "git clone" and "git fetch" >> >> We have been thinking about formalizing the /clone.bundle hack used by >> repo on Android. If the server has the bundle, add a capability in the >> refs advertisement saying its available, and the clone client can >> first fetch $URL/clone.bundle. > > Yes, that was going to be my next step after getting the bundle fetch > support in. Yay! > If we are going to do this, though, I'd really love for it > to not be "hey, fetch .../clone.bundle from me", but a full-fledged > "here are full URLs of my mirrors". Ack. I agree completely. > Then you can redirect a non-http cloner to http to grab the bundle. Or > redirect them to a CDN. Or even somebody else's server entirely (e.g., > "go fetch from Linus first, my piddly server cannot feed you the whole > kernel"). Some of the redirects you can do by issuing an http redirect > to "/clone.bundle", but the cross-protocol ones are tricky. Ack. My thoughts exactly. Especially the part of "my piddly server shouldn't have to serve you a clone of Linus' tree when there are many public hosts mirroring his code available to anyone". It is simply not fair to clone Linus' tree off some guy's home ADSL connection, his uplink probably sucks. But it is reasonable to fetch his incremental delta after cloning from some other well known and well connected source. > If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors") > it does not add much overhead over a simple capability. There are a few > extra round trips to actually fetch the blob (client sends a want and no > haves, then server sends the pack), but I think that's negligible when > we are talking about redirecting a full clone. In either case, we have > to hang up the original connection, fetch the mirror, and then come > back. I wasn't thinking about using a "well known blob" for this. Jonathan, Dave, Colby and I were kicking this idea around on Monday during lunch. If the initial ref advertisement included a "mirrors" capability the client could respond with "want mirrors" instead of the usual want/have negotiation. The server could then return the mirror URLs as pkt-lines, one per pkt. Its one extra RTT, but this is trivial compared to the cost to really clone the repository. These pkt-lines need to be a bit more than just URL. Or we need a new URL like "bundle:http://"; to denote a resumable bundle over HTTP vs. a normal HTTP URL that might not be a bundle file, and is just a better connected server. The mirror URLs could be stored in $GIT_DIR/config as a simple multi-value variable. Unfortunately that isn't easily remotely editable. But I am not sure I care? GitHub doesn't let you edit $GIT_DIR/config, but it doesn't need to. For most repositories hosted at GitHub, GitHub is probably the best connected server for that repository. For repositories that are incredibly high traffic GitHub might out of its own interest want to configure mirror URLs on some sort of CDN to distribute the network traffic closer to the edges. Repository owners just shouldn't have to worry about these sorts of details. It should be managed by the hosting service. In my case for android.googlesource.com we want bundles on the CDN near the network edges, and our repository owners don't care to know the details of that. They just want our server software to make it all happen, and our servers already manage $GIT_DIR/config for them. It also mostly manages /clone.bundle on the CDN. And /clone.bundle is an ugly, limited hack. For the average home user sharing their working repository over git:// from their home ADSL or cable connection, editing .git/config is easier than a blob in refs/mirrors. They already know how to edit .git/config to manage remotes. Heck, remote.origin.url might already be a good mirror address to advertise, especially if the client isn't on the same /24 as the server and the remote.origin.url is something like "git.kernel.org". :-) >> For most Git repositories the bundle can be constructed by saving the >> bundle reference header into a file, e.g. >> $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is >> created. The bundle can be served by combining the .bh and .pack >> streams onto the network. It is very little additional disk overhead >> for the origin server, > > That's clever. It does not work out of the box if you are using > alternates, but I think it could be adapted in certain situations. E.g., > if you layer the pack so that one "base" repo always has its full pack > at the start, which is something we're already doing at GitHub. Yes, well, I was assuming the pack was a fully connected repack. Alternates always creates a partial pack. But if you have an alternate, that alternate maybe should be given as a
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 11:15:27AM -0800, Shawn Pearce wrote: > >> - better integration with git bundles, provide a way to seamlessly > >> create/fetch/resume the bundles with "git clone" and "git fetch" > > We have been thinking about formalizing the /clone.bundle hack used by > repo on Android. If the server has the bundle, add a capability in the > refs advertisement saying its available, and the clone client can > first fetch $URL/clone.bundle. Yes, that was going to be my next step after getting the bundle fetch support in. If we are going to do this, though, I'd really love for it to not be "hey, fetch .../clone.bundle from me", but a full-fledged "here are full URLs of my mirrors". Then you can redirect a non-http cloner to http to grab the bundle. Or redirect them to a CDN. Or even somebody else's server entirely (e.g., "go fetch from Linus first, my piddly server cannot feed you the whole kernel"). Some of the redirects you can do by issuing an http redirect to "/clone.bundle", but the cross-protocol ones are tricky. If we advertise it as a blob in a specialized ref (e.g., "refs/mirrors") it does not add much overhead over a simple capability. There are a few extra round trips to actually fetch the blob (client sends a want and no haves, then server sends the pack), but I think that's negligible when we are talking about redirecting a full clone. In either case, we have to hang up the original connection, fetch the mirror, and then come back. > For most Git repositories the bundle can be constructed by saving the > bundle reference header into a file, e.g. > $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is > created. The bundle can be served by combining the .bh and .pack > streams onto the network. It is very little additional disk overhead > for the origin server, That's clever. It does not work out of the box if you are using alternates, but I think it could be adapted in certain situations. E.g., if you layer the pack so that one "base" repo always has its full pack at the start, which is something we're already doing at GitHub. > but allows resumable clone, provided the server has not done a GC. As an aside, the current transfer-resuming code in http.c is questionable. It does not use etags or any sort of invalidation mechanism, but just assumes hitting the same URL will give the same bytes. That _usually_ works for dumb fetching of objects and packfiles, though it is possible for a pack to change representation without changing name. My bundle patches inherited the same flaw, but it is much worse there, because your URL may very well just be "clone.bundle" that gets updated periodically. > > I posted patches for this last year. One of the things that I got hung > > up on was that I spooled the bundle to disk, and then cloned from it. > > Which meant that you needed twice the disk space for a moment. > > I don't think this is a huge concern. In many cases the checked out > copy of the repository approaches a sizable fraction of the .pack > itself. If you don't have 2x .pack disk available at clone time you > may be in trouble anyway as you try to work with the repository post > clone. Yeah, in retrospect I was being stupid to let that hold it up. I'll revisit the patches (I've rebased them forward over the past year, so it shouldn't be too bad). > > I wanted > > to teach index-pack to "--fix-thin" a pack that was already on disk, so > > that we could spool to disk, and then finalize it without making another > > copy. > > Don't you need to separate the bundle header from the pack data before > you do this? Yes, though it isn't hard. We have to fetch part of the bundle header into memory during discover_refs(), since that is when we realize we are getting a bundle and not just the refs. From there you can spool the bundle header to disk, and then the packfile separately. My original implementation did that, though I don't remember if that one got posted to the list (after realizing that I couldn't just "--fix-thin" directly, I simplified it to just spool the whole thing to a single file). > If the bundle is only used at clone time there is no > --fix-thin step. Yes, for the particular use case of a clone-mirror, you wouldn't need to --fix-thin. But I think "git fetch https://example.com/foo.bundle"; should work in the general case (and it does with my patches). > See above, I think you can reasonably do the /clone.bundle > automatically on any HTTP server. Yeah, the ".bh" trick you mentioned is low enough impact to the server that we could just unconditionally make it part of the repack. > > What would need to go into such a hash? It would need to represent the > > exact bytes that will go into the pack, but without actually generating > > those bytes. Perhaps a sha1 over the sequence of > base (if applicable), length> for each object would be enough. We should > > know that after calling compute_write_order. If the client has a match, > > we should be
Re: How to resume broke clone ?
zhifeng hu ancientrocklab.com> writes: > > Once using git clone —depth or git fetch —depth, > While you want to move backward. > you may face problem > > git fetch --depth=105 > error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a > error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef > error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b > error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db > error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b BTW. there was (is?) a bundler service at http://bundler.caurea.org/ but I don't know if it can create Linux-size bundle. -- Jakub Narębski -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 1:29 AM, zhifeng hu wrote: > Once using git clone —depth or git fetch —depth, > While you want to move backward. > you may face problem > > git fetch --depth=105 > error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a > error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef > error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b > error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db > error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b We now have a resumable bundle available through our kernel.org mirror. The bundle is 658M. mkdir linux cd linux git init wget https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/clone.bundle sha1sum clone.bundle 96831de0b8171e5ebba94edb31e37e70e1df clone.bundle git fetch -u ./clone.bundle refs/*:refs/* git reset --hard You can also use our mirror as an upstream, as we have servers in Asia that lag no more than 5 or 6 minutes behind kernel.org: git remote add origin https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 1:29 AM, Jeff King wrote: > On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote: > >> > Git should be better support resume transfer. >> > It now seems not doing better it’s job. >> > Share code, manage code, transfer code, what would it be a VCS we imagine >> > it ? >> >> You're welcome to step up and do it. On top of my head there are a few >> options: >> >> - better integration with git bundles, provide a way to seamlessly >> create/fetch/resume the bundles with "git clone" and "git fetch" We have been thinking about formalizing the /clone.bundle hack used by repo on Android. If the server has the bundle, add a capability in the refs advertisement saying its available, and the clone client can first fetch $URL/clone.bundle. For most Git repositories the bundle can be constructed by saving the bundle reference header into a file, e.g. $GIT_DIR/objects/pack/pack-$NAME.bh at the same time the pack is created. The bundle can be served by combining the .bh and .pack streams onto the network. It is very little additional disk overhead for the origin server, but allows resumable clone, provided the server has not done a GC. > I posted patches for this last year. One of the things that I got hung > up on was that I spooled the bundle to disk, and then cloned from it. > Which meant that you needed twice the disk space for a moment. I don't think this is a huge concern. In many cases the checked out copy of the repository approaches a sizable fraction of the .pack itself. If you don't have 2x .pack disk available at clone time you may be in trouble anyway as you try to work with the repository post clone. > I wanted > to teach index-pack to "--fix-thin" a pack that was already on disk, so > that we could spool to disk, and then finalize it without making another > copy. Don't you need to separate the bundle header from the pack data before you do this? If the bundle is only used at clone time there is no --fix-thin step. > One of the downsides of this approach is that it requires the repo > provider (or somebody else) to provide the bundle. I think that is > something that a big site like GitHub would do (and probably push the > bundles out to a CDN, too, to make getting them faster). But it's not a > universal solution. See above, I think you can reasonably do the /clone.bundle automatically on any HTTP server. Big sites might choose to have /clone.bundle do a redirect into a caching CDN that fills itself by going to the application servers to obtain the current data. This is what we do for Android. >> - stablize pack order so we can resume downloading a pack > > I think stabilizing in all cases (e.g., including ones where the content > has changed) is hard, but I wonder if it would be enough to handle the > easy cases, where nothing has changed. If the server does not use > multiple threads for delta computation, it should generate the same pack > from the same on-disk deterministically. We just need a way for the > client to indicate that it has the same partial pack. > > I'm thinking that the server would report some opaque hash representing > the current pack. The client would record that, along with the number of > pack bytes it received. If the transfer is interrupted, the client comes > back with the hash/bytes pair. The server starts to generate the pack, > checks whether the hash matches, and if so, says "here is the same pack, > resuming at byte X". An important part of this is the want set must be identical to the prior request. It is entirely possible the branch tips have advanced since the prior packing attempt started. > What would need to go into such a hash? It would need to represent the > exact bytes that will go into the pack, but without actually generating > those bytes. Perhaps a sha1 over the sequence of base (if applicable), length> for each object would be enough. We should > know that after calling compute_write_order. If the client has a match, > we should be able to skip ahead to the correct byte. I don't think Length is sufficient. The repository could have recompressed an object with the same length but different libz encoding. I wonder if loose object recompression is reliable enough about libz encoding to resume in the middle of an object? Is it just based on libz version? You may need to do include information about the source of the object, e.g. the trailing 20 byte hash in the source pack file. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 4:29 PM, Jeff King wrote: >> - stablize pack order so we can resume downloading a pack > > I think stabilizing in all cases (e.g., including ones where the content > has changed) is hard, but I wonder if it would be enough to handle the > easy cases, where nothing has changed. If the server does not use > multiple threads for delta computation, it should generate the same pack > from the same on-disk deterministically. We just need a way for the > client to indicate that it has the same partial pack. > > I'm thinking that the server would report some opaque hash representing > the current pack. The client would record that, along with the number of > pack bytes it received. If the transfer is interrupted, the client comes > back with the hash/bytes pair. The server starts to generate the pack, > checks whether the hash matches, and if so, says "here is the same pack, > resuming at byte X". > > What would need to go into such a hash? It would need to represent the > exact bytes that will go into the pack, but without actually generating > those bytes. Perhaps a sha1 over the sequence of base (if applicable), length> for each object would be enough. We should > know that after calling compute_write_order. If the client has a match, > we should be able to skip ahead to the correct byte. Exactly. The hash would include the list of sha-1 and object source, the git version (so changes in code or default values are covered), the list of config keys/values that may impact pack generation algorithm (like window size..), .git/shallow, refs/replace, .git/graft, all or most of command line options. If we audit the code carefully I think we can cover all input that influences pack generation. From then on it's just a matter of protocol extension. It also opens an opportunity for optional server side caching, just save the pack and associate it with the hash. Next time the client asks to resume, the server has everything ready. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
Once using git clone —depth or git fetch —depth, While you want to move backward. you may face problem git fetch --depth=105 error: Could not read 483bbf41ca5beb7e38b3b01f21149c56a1154b7a error: Could not read aacb82de3ff8ae7b0a9e4cfec16c1807b6c315ef error: Could not read 5a1758710d06ce9ddef754a8ee79408277032d8b error: Could not read a7d5629fe0580bd3e154206388371f5b8fc832db error: Could not read 073291c476b4edb4d10bbada1e64b471ba153b6b zhifeng hu On Nov 28, 2013, at 5:20 PM, Tay Ray Chuan wrote: > On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen wrote: >> On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu wrote: >>> Thanks for reply, But I am developer, I want to clone full repository, I >>> need to view code since very early. >> >> if it works with --depth =1, you can incrementally run "fetch >> --depth=N" with N larger and larger. > > I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially > create a "shallow" clone then 2) incrementally deepen your clone. > > Zhifeng, in the course of your research into resumable cloning, you > might have learnt that while it's a really valuable feature, it's also > a pretty hard problem at the same time. So it's not because git > doesn't want to have this feature. > > -- > Cheers, > Ray Chuan > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 04:09:18PM +0700, Duy Nguyen wrote: > > Git should be better support resume transfer. > > It now seems not doing better it’s job. > > Share code, manage code, transfer code, what would it be a VCS we imagine > > it ? > > You're welcome to step up and do it. On top of my head there are a few > options: > > - better integration with git bundles, provide a way to seamlessly > create/fetch/resume the bundles with "git clone" and "git fetch" I posted patches for this last year. One of the things that I got hung up on was that I spooled the bundle to disk, and then cloned from it. Which meant that you needed twice the disk space for a moment. I wanted to teach index-pack to "--fix-thin" a pack that was already on disk, so that we could spool to disk, and then finalize it without making another copy. One of the downsides of this approach is that it requires the repo provider (or somebody else) to provide the bundle. I think that is something that a big site like GitHub would do (and probably push the bundles out to a CDN, too, to make getting them faster). But it's not a universal solution. > - stablize pack order so we can resume downloading a pack I think stabilizing in all cases (e.g., including ones where the content has changed) is hard, but I wonder if it would be enough to handle the easy cases, where nothing has changed. If the server does not use multiple threads for delta computation, it should generate the same pack from the same on-disk deterministically. We just need a way for the client to indicate that it has the same partial pack. I'm thinking that the server would report some opaque hash representing the current pack. The client would record that, along with the number of pack bytes it received. If the transfer is interrupted, the client comes back with the hash/bytes pair. The server starts to generate the pack, checks whether the hash matches, and if so, says "here is the same pack, resuming at byte X". What would need to go into such a hash? It would need to represent the exact bytes that will go into the pack, but without actually generating those bytes. Perhaps a sha1 over the sequence of for each object would be enough. We should know that after calling compute_write_order. If the client has a match, we should be able to skip ahead to the correct byte. > - remote alternates, the repo will ask for more and more objects as > you need them (so goodbye to distributed model) This is also something I've been playing with, but just for very large objects (so to support something like git-media, but below the object graph layer). I don't think it would apply here, as the kernel has a lot of small objects, and getting them in the tight delta'd pack format increases efficiency a lot. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 4:14 PM, Duy Nguyen wrote: > On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu wrote: >> Thanks for reply, But I am developer, I want to clone full repository, I >> need to view code since very early. > > if it works with --depth =1, you can incrementally run "fetch > --depth=N" with N larger and larger. I second Duy Nguyen's and Trần Ngọc Quân's suggestion to 1) initially create a "shallow" clone then 2) incrementally deepen your clone. Zhifeng, in the course of your research into resumable cloning, you might have learnt that while it's a really valuable feature, it's also a pretty hard problem at the same time. So it's not because git doesn't want to have this feature. -- Cheers, Ray Chuan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 01:32:36AM -0700, Max Kirillov wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > > > I am in china. our bandwidth is very limitation. Less than 50Kb/s. > > You could manually download big packed bundled from some http remote. > For example http://repo.or.cz/r/linux.git > > * create a new repository, add the remote there. > > * download files with wget or whatever: > http://repo.or.cz/r/linux.git/objects/info/packs > also files mentioned in the file. Currently they are: > > http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx > > http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack > > and put them to the corresponding places. > > * then run fetch of pull. I believe it should run fast then. Though I > have not test it. You would also need to set up local refs so that git knows you have those objects. The simplest way to do it is to just fetch by dumb-http, which can resume the pack transfer. I think that clone is also very eager to clean up the partial transfer if the initial fetch fails. So you would want to init manually: git init linux cd linux git remote add origin http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git GIT_SMART_HTTP=0 git fetch -vv and then you can follow that up with regular smart fetches, which should be much smaller. It would be even simpler if you could fetch the whole thing as a bundle, rather than over dumb-http. But that requires the server side (or some third party who has fast access to the repo) cooperating and making a bundle available. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 3:55 PM, zhifeng hu wrote: > The repository growing fast, things get harder . Now the size reach several > GB, it may possible be TB, YB. > When then, How do we handle this? > If the transfer broken, and it can not be resume transfer, waste time and > waste bandwidth. > > Git should be better support resume transfer. > It now seems not doing better it’s job. > Share code, manage code, transfer code, what would it be a VCS we imagine it ? You're welcome to step up and do it. On top of my head there are a few options: - better integration with git bundles, provide a way to seamlessly create/fetch/resume the bundles with "git clone" and "git fetch" - shallow/narrow clone. the idea is get a small part of the repo, one depth, a few paths, then get more and more over many iterations so if we fail one iteration we don't lose everything - stablize pack order so we can resume downloading a pack - remote alternates, the repo will ask for more and more objects as you need them (so goodbye to distributed model) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
The repository growing fast, things get harder . Now the size reach several GB, it may possible be TB, YB. When then, How do we handle this? If the transfer broken, and it can not be resume transfer, waste time and waste bandwidth. Git should be better support resume transfer. It now seems not doing better it’s job. Share code, manage code, transfer code, what would it be a VCS we imagine it ? zhifeng hu On Nov 28, 2013, at 4:50 PM, Duy Nguyen wrote: > On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees > wrote: >> Or simply download the individual files (via ftp/http) and clone locally: >> >>> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ >>> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >>> cd linux >>> git remote set-url origin >>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > Yeah I didn't realize it is published over dumb http too. You may need > to be careful with this though because it's not atomic and you may get > refs that point nowhere because you're already done with "pack" > directory when you come to fetcing "refs" and did not see new packs... > If dumb commit walker supports resume (I don't know) then it'll be > safer to do > > git clone http://git.kernel.org/ > > If it does not support resume, I don't think it's hard to do. > -- > Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 3:35 PM, Karsten Blees wrote: > Or simply download the individual files (via ftp/http) and clone locally: > >> wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ >> git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >> cd linux >> git remote set-url origin >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Yeah I didn't realize it is published over dumb http too. You may need to be careful with this though because it's not atomic and you may get refs that point nowhere because you're already done with "pack" directory when you come to fetcing "refs" and did not see new packs... If dumb commit walker supports resume (I don't know) then it'll be safer to do git clone http://git.kernel.org/ If it does not support resume, I don't think it's hard to do. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: How to resume broke clone ?
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > I am in china. our bandwidth is very limitation. Less than 50Kb/s. You could manually download big packed bundled from some http remote. For example http://repo.or.cz/r/linux.git * create a new repository, add the remote there. * download files with wget or whatever: http://repo.or.cz/r/linux.git/objects/info/packs also files mentioned in the file. Currently they are: http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.idx http://repo.or.cz/r/linux.git/objects/pack/pack-3807b40fc5fd7556990ecbfe28a54af68964a5ce.pack and put them to the corresponding places. * then run fetch of pull. I believe it should run fast then. Though I have not test it. Br, -- Max -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
Am 28.11.2013 09:14, schrieb Duy Nguyen: > On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu wrote: >> Thanks for reply, But I am developer, I want to clone full repository, I >> need to view code since very early. > > if it works with --depth =1, you can incrementally run "fetch > --depth=N" with N larger and larger. > > But it may be easier to ask kernel.org admin, or any dev with a public > web server, to provide you a git bundle you can download via http. > Then you can fetch on top. > Or simply download the individual files (via ftp/http) and clone locally: > wget -r ftp://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ > git clone git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > cd linux > git remote set-url origin > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On Thu, Nov 28, 2013 at 2:41 PM, zhifeng hu wrote: > Thanks for reply, But I am developer, I want to clone full repository, I need > to view code since very early. if it works with --depth =1, you can incrementally run "fetch --depth=N" with N larger and larger. But it may be easier to ask kernel.org admin, or any dev with a public web server, to provide you a git bundle you can download via http. Then you can fetch on top. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
Thanks for reply, But I am developer, I want to clone full repository, I need to view code since very early. zhifeng hu On Nov 28, 2013, at 3:39 PM, Trần Ngọc Quân wrote: > On 28/11/2013 10:13, zhifeng hu wrote: >> Hello all: >> Today i want to clone the Linux Kernel git repository. >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >> >> I am in china. our bandwidth is very limitation. Less than 50Kb/s. > This repo is really too big. > You may consider using --depth option if you don't want full history, or > clone from somewhere have better bandwidth > $ git clone --depth=1 > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > you may chose other mirror (github.com) for example > see git-clone(1) > > -- > Trần Ngọc Quân. > -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to resume broke clone ?
On 28/11/2013 10:13, zhifeng hu wrote: > Hello all: > Today i want to clone the Linux Kernel git repository. > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > I am in china. our bandwidth is very limitation. Less than 50Kb/s. This repo is really too big. You may consider using --depth option if you don't want full history, or clone from somewhere have better bandwidth $ git clone --depth=1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git you may chose other mirror (github.com) for example see git-clone(1) -- Trần Ngọc Quân. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to resume broke clone ?
Hello all: Today i want to clone the Linux Kernel git repository. git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git I am in china. our bandwidth is very limitation. Less than 50Kb/s. The clone progress is very slow, and broken times and time. I am very unhappy. Because i could not easily to clone kernel. I had do some research about resume clone , but no good plan how to resolve this problem . Would it be possible add resume transfer clone repository after the transfer broken? such as bittorrent download. or what ever. zhifeng hu -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html