Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs

Jonathan Nieder Wed, 03 Mar 2010 09:53:06 -0800

clone 569505 -1
tags -1 =
retitle -1 read-tree (?): silently skips some corrupted objects
thanks

Zygo Blaxell wrote:

> git checkout gives you a working tree where corrupt files are missing,
> and an index where corrupt files are marked deleted.

Not good.  Will investigate.

> git filter-branch aborts when it sees the corrupt data if you have a
> tree-filter, but if you only have an index-filter it will ignore
> corrupt objects unless you do something to force it to examine their
> contents.

Seems sensible.  Sometimes getting the right history requires repeated
invocations of filter-branch, so the ideal thing is to find some way
to examine (compare) the whole history before and after, and the next
best thing is to explicitly run a fsck before.

> filter-branch index-filter won't help you if other objects have been
> deltaified based on corrupt objects--at that point, recovery is very hard.
> I've only seen that occur on pack files that were corrupted outside of
> git, though, so it's not a Git problem.

I think conventional wisdom is that in that case the best thing is to
explode the pack with git unpack-objects -r and recover what you can.
If there is crucial data that that misses, one can use git verify-pack -v
as a starting point to examine and repair the corruption.

> git gc will notice the corruption if it's packing corrupt loose objects.
> It fails to notice if it's not packing loose objects, e.g. because
> the loose objects are not old enough.

Right, this could be changed.  I haven’t decided whether I think it’s
worth it (probably it is).

>> I assume you used rebase -f?  Clever.
>
> I reset to one commit before the corruption, then manually extract the
> surviving changes between the commit after the corruption and the next
> commit that modifies the corrupted file.

Oh, sounds more painful.

I guess I was expecting it to be easier because the object data is all
there; it just has the wrong SHA-1.  That is not the case in other
corruption scenarios, so maybe it is silly to spend too much time
thinking about how to deal with it, but I think it’s worth trying
anyway (at least maybe to write a script for contrib/).

> The problem isn't speed--the problem is tree-filter's requirement to check
> out the data.  It can't, because the data is corrupt.  filter-branch does
> check in that case, and it should (otherwise a filesystem on unreliable
> media could spray undetected junk into your repo).

It just does checkout-index, clean, and update-index; the only obvious
difference from a checkout + (munge) + add I can see is the clean.

> It's usually hard when the file was in some transient state during the
> SHA1 calculation.  ;)

Ah, I guess this happens with e.g. text editor swapfiles?  Ick.

>>  - racy add, as you noticed;
>
> Only Git seems to have that.  SVN and CVS didn't.  Or maybe they did,
> but they lacked the internal integrity-checking mechanisms to detect it.

I suspect SVN just uses a CRC32 computed at the same time as the files
are compressed, which indeed would not have the same problem.
http://svnbook.red-bean.com/nightly/en/svn.ref.svnadmin.c.verify.html

CVS and RCS I have no clue about.

>>  - checkout is not atomic or close to atomic;
>
> Not a problem in my use cases.  Checkouts are very rare, usually only
> occurring after some disaster or other.

True enough --- if you can wait to checkout until nothing cares about
what’s happening with those files (e.g. a shutdown), there’s no
problem.

>>  - large files are not supported well (but there is some work going on
>>    to change this);
>
> "Large" is relative to the size of the system doing the work.  15 years
> ago, 1MB was a "large" file; today, 1MB is on the high end of "small."

I had trouble tracking a small repository of audio files I was
working on because of this.

>>  - uncompressible files are not supported well;
>
> Much better than CVS.
>
>>  - rename detection works poorly with binary files;
>
> Still better than CVS or SVN.

Sure, as far as version control systems go, git is a good back up
systems, but what about backup systems?

Sadly, I don’t even know enough to say what replicating snapshot-based
backup system is the standard of care so to speak.

>>  - no quick way to throw away old history.
>
> I don't intend to throw away old history at all.

I guess if the history gets unmanageably big, one can start a new repo
and graft them together when needed.

> Compression, integrity checking, and replication are the big wins for me.
> The compression advantage of Git vs. other tools is not trivial.  Git
> outperforms Subversion by something like 200:1.

I think any good backup system should have these things.  Your other
reasons are more compelling.  An unstated reason --- that git, like
cvs and svn, is a tool developers already often know quite well how to
use --- is also probably important.

-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs

Reply via email to