On Mon, Mar 25, 2013 at 09:43:23AM -0400, Jeff Mitchell wrote: > > But I haven't seen exactly what the corruption is, nor exactly what > > commands they used to clone. I've invited the blog author to give more > > details in this thread. > > The syncing was performed via a clone with git clone --mirror (and a > git:// URL) and updates with git remote update.
OK. That should be resilient to corruption, then. > So I should mention that my experiments after the fact were using > local paths, but with --no-hardlinks. Yeah, we will do a direct copy in that case, and there is nothing to prevent corruption propagating. > If you're saying that the transport is where corruption is supposed to > be caught, then it's possible that we shouldn't see corruption > propagate on an initial mirror clone across git://, and that something > else was responsible for the trouble we saw with the repositories that > got cloned after-the-fact. Right. Either it was something else, or there is a bug in git's protections (but I haven't been able to reproduce anything likely). > But then I'd argue that this is non-obvious. In particular, when using > --no-hardlinks, I wouldn't expect that behavior to be different with a > straight path and with file://. There are basically three levels of transport that can be used on a local machine: 1. Hard-linking (very fast, no redundancy). 2. Byte-for-byte copy (medium speed, makes a separate copy of the data, but does not check the integrity of the original). 3. Regular git transport, creating a pack (slowest, but should include redundancy checks). Using --no-hardlinks turns off (1), but leaves (2) as an option. I think the documentation in "git clone" could use some improvement in that area. > Something else: apparently one of my statements prompted joeyh to > think about potential issues with backing up live git repos > (http://joeyh.name/blog/entry/difficulties_in_backing_up_live_git_repositories/). > Looking at that post made me realize that, when we were doing our > initial thinking about the system three years ago, we made an > assumption that, in fact, taking a .tar.gz of a repo as it's in the > process of being written to or garbage collected or repacked could be > problematic. This isn't a totally baseless assumption, as I once had a > git repository that I was in the process of updating when I had a > sudden power outage that suffered corruption. (It could totally have > been the filesystem, of course, although it was a journaled file > system.) Yes, if you take a snapshot of a repository with rsync or tar, it may be in an inconsistent state. Using the git protocol should always be robust, but if you want to do it with other tools, you need to follow a particular order: 1. copy the refs (refs/ and packed-refs) first 2. copy everything else (including object/) That covers the case where somebody is pushing an update simultaneously (you _may_ get extra objects in step 2 that they have not yet referenced, but you will never end up with a case where you are referencing objects that you did not yet transfer). If it's possible that the repository might be repacked during your transfer, I think the issue a bit trickier, as there's a moment where the new packfile is renamed into place, and then the old ones are deleted. Depending on the timing and how your readdir() implementation behaves with respect to new and deleted entries, it might be possible to miss both the new one appearing and the old ones disappearing. This is quite a tight race to catch, I suspect, but if you were to rsync objects/pack twice in a row, that would be sufficient. For pruning, I think you could run into the opposite situation: you grab the refs, somebody updates them with a history rewind (or branch deletion), then somebody prunes and objects go away. Again, the timing on this race is quite tight and it's unlikely in practice. I'm not sure of a simple way to eliminate it completely. Yet another option is to simply rsync the whole thing and then "git fsck" the result. If it's not 100% good, just re-run the rsync. This is simple and should be robust, but is more CPU intensive (you'll end up re-checking all of the data on each update). > So, we decided to use Git's built-in capabilities of consistency > checking to our advantage (with, as it turns out, a flaw in our > implementation). But the question remains: are we wrong about thinking > that rsyncing or tar.gz live repositories in the middle of being > pushed to/gc'd/repacked could result in a bogus backup? No, I think you are right. If you do the refs-then-objects ordering, that saves you from most of it, but I do think there are still some races that exist during repacking or pruning. -Peff  I mentioned that clone-over-git:// is resilient to corruption. I think that is true for bit corruption, but my tests did show that we are not as careful about checking graph connectivity during clone as we are with fetch. The circumstances in which that would matter are quite unlikely, though. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html