On Thu, Dec 5, 2013 at 5:21 AM, Michael Haggerty <mhag...@alum.mit.edu> wrote: > This discussion has mostly been about letting small Git servers delegate > the work of an initial clone to a beefier server. I haven't seen any > explicit mention of the inverse: > > Suppose a company has a central Git server that is meant to be the > "single source of truth", but has worldwide offices and wants to locate > bootstrap mirrors in each office. The end users would not even want to > know that there are multiple servers. Hosters like GitHub might also > encourage their big customers to set up bootstrap mirror(s) in-house to > make cloning faster for their users while reducing internet traffic and > the burden on their own infrastructure. The goal would be to make the > system transparent to users and easily reconfigurable as circumstances > change.
I think there is a different way to do that. Build a caching Git proxy server. And teach Git clients to use it. One idea we had at $DAY_JOB a couple of years ago was to build a daemon that sat in the background and continuously fetched content from repository upstreams. We made it efficient by modifying the Git protocol to use a hanging network socket, and the upstream server would broadcast push pack files down these hanging streams as pushes were received. The original intent was for an Android developer to be able to have his working tree forest of 500 repositories subscribe to our internal server's broadcast stream. We figured if the server knows exactly which refs every client has, because they all have the same ones, and their streams are all still open and active, then the server can make exactly one incremental thin pack and send the same copy to every client. Its "just" a socket write problem. Instead of packing the same stuff 100x for 100x clients its packed once and sent 100x. Then we realized remote offices could also install this software on a local server, and use this as a fan-out distributor within the LAN. We were originally thinking about some remote offices on small Internet connections, where delivery of 10 MiB x 20 was a lot but delivery of 10 MiB once and local fan-out on the Ethernet was easy. The JGit patches for this work are still pending. If clients had a local Git-aware cache server in their office and ~/.gitconfig had the address of it, your problem becomes simple. Clients clone from the public URL e.g. GitHub, but the local cache server first gives the client a URL to clone from itself. After that is complete then the client can fetch from the upstream. The cache server can be self-maintaining, watching its requests to see what is accessed often-ish, and keep those repositories current-ish locally by running git fetch itself in the background. Its easy to do this with bundles on "CDN" like HTTP. Just use the office's caching HTTP proxy server. Assuming its cache is big enough for those large Git bundle payloads, and the viral cat videos. But you are at the mercy of the upstream bundler rebuilding the bundles. And refetching them in whole. Neither of which is great. A simple self-contained server that doesn't accept pushes, but knows how to clone repositories, fetch them periodically, and run `git gc`, works well. And the mirror URL extension we have been discussing in this thread would work fine here. The cache server can return URLs that point to itself. Or flat out proxy the Git transaction with the origin server.  https://git.eclipse.org/r/#/q/owner:wetherbeei%2540google.com+status:open,n,z -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html