On Wed, Mar 03, 2010 at 11:49:37AM -0600, Jonathan Nieder wrote:
> Zygo Blaxell wrote:
> >> I assume you used rebase -f?  Clever.
>[...]
> I guess I was expecting it to be easier because the object data is all
> there; it just has the wrong SHA-1.  That is not the case in other
> corruption scenarios, so maybe it is silly to spend too much time
> thinking about how to deal with it, but I think it???s worth trying
> anyway (at least maybe to write a script for contrib/).

The problem with rebase -f is that rebase normally expects to do a diff
between "commit" and "commit~" for each commit to be included in the rebase.
At some point in that process it must try to retrieve a corrupt object,
and stops.

It might be nice to have an option on filter-branch that tries to extract
what it can by brute force, and removes everything else.  If a parent
commit is inaccessible, create a new root commit instead.  One variant
nukes any file that can't be retrieved (so corrupt files end up deleted),
another keeps corrupt data (so corrupt files end up containing whatever
deflate/unpack can salvage).  Ideally this feature would write a log file
explaining exactly what objects were deleted from which commits.

> > The problem isn't speed--the problem is tree-filter's requirement to check
> > out the data.  It can't, because the data is corrupt.  filter-branch does
> > check in that case, and it should (otherwise a filesystem on unreliable
> > media could spray undetected junk into your repo).
> 
> It just does checkout-index, clean, and update-index; the only obvious
> difference from a checkout + (munge) + add I can see is the clean.

Exactly--the checkout step notices the bad SHA1 on the corrupt objects,
and causes filter-branch to abort.  Or did I miss something here?

> > It's usually hard when the file was in some transient state during the
> > SHA1 calculation.  ;)
> 
> Ah, I guess this happens with e.g. text editor swapfiles?  Ick.

Anything that uses sqlite or Berkeley DB.  Anything that modifies files
in-place instead of using the Unix create-temporary-write-close-rename
idiom.  gzip does it too, if you're writing from a pipe to a plain file
(it rewinds to fill in the size header at the beginning).  Lots of foreign
software (ported from OSes other than Unix) modifies data in place.

> >>  - racy add, as you noticed;
> >
> > Only Git seems to have that.  SVN and CVS didn't.  Or maybe they did,
> > but they lacked the internal integrity-checking mechanisms to detect it.
>[...]
> CVS and RCS I have no clue about.

Neither has any integrity checking that I'm aware of.

> >>  - checkout is not atomic or close to atomic;
>[...]
> True enough --- if you can wait to checkout until nothing cares about
> what???s happening with those files (e.g. a shutdown), there???s no
> problem.

Usually I check them out somewhere else entirely, e.g. onto a replacement
system disk.  A snapshot repo is normally supposed to observe, not modify.

> Sure, as far as version control systems go, git is a good back up
> systems, but what about backup systems?
> 
> Sadly, I don???t even know enough to say what replicating snapshot-based
> backup system is the standard of care so to speak.

I have a multi-terabyte snapshot-based backup filesystem.  git is a long
way away from managing that.

Git can handle workloads like "the entire contents of a firewall machine's
filesystem" or "everything in ~/.firefox except [Cc]ache/*" just fine
though, as long as it runs on a host with enough RAM.

> > Compression, integrity checking, and replication are the big wins for me.
> > The compression advantage of Git vs. other tools is not trivial.  Git
> > outperforms Subversion by something like 200:1.
> 
> I think any good backup system should have these things.  Your other
> reasons are more compelling.  An unstated reason --- that git, like
> cvs and svn, is a tool developers already often know quite well how to
> use --- is also probably important.

If Linus's Git video at Google is anything to go by, Git's object store
was designed to be the kind of distributed data repository that can
survive events ranging from incompetent IT departments to deliberate
sabotage.  That's the minimum feature set I'd expect from my backup
software.  ;)




-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Reply via email to