Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)

Kenneth Ölwing Sun, 07 Apr 2013 11:57:30 -0700

Thanks for suggestions,

I don't think there's any internal debugging that helps at thispoint. Usually errors pointing to corruption are caused by a chain ofsyscalls failing in some way, and the final error shows only the lastone, so strace() output is very interesting.

Right - a problem could be, in my understanding, that it will be quitehard to figure out which of the traces are actually interesting. First,just because of the intense concurrency, there will be a lot of falseerrors on the way, and as far as I can tell many of those errors areeffectively indistinguishable from a real error; i.e. a clone can reportwording to the effect of 'possibly the remote repo is broken', when it'sjust in transition by another process. So, a lot of retries eventuallywill work. Except when the repo actually is broken, but the retries aredone until they're exhausted. I do keep all such logs anyway, and addingstrace to the output should be fine - it's just a lot to go through.Which is the second thing - I noticed that I can get strace to put intimestamps in it's output which will likely be necessary to try to findwhere two or more processes interfere.

Oh, BTW - I'm also uncertain whether it is the actual regular ops (e.g.push) or perhaps auto-gc's that sometimes kick in that causes problems.While I can set gc.auto=0 to take those out of the equation, it'sobviously not a solution in the long run. Hm, maybe I should go theother way, like doing gc --aggressive very often while doing pushes andsee if that more quicklyprovokes an error. Even Linus in my first link suggests 'avoidingconcurrent gc --prune' (I know, not the same as aggressive), which isunderstandable, but since, again as I understand it, git willoccasionally decide to do it on it's own, frankly I would expect this towork. Not optimally from any viewpoint of course, but still, I simplyshouldn't be able to break a repo as long as I use regular git commands.Or is that an unreasonable expectation? Given that I'm probably waybeyond any reasonable normal use, I guess it could be considered tochase ghosts...but then again, if there's even a tiny hole, it would benice to close it.

Well, I'll just have to try to battle on with it. Is there a hint ofdocs anywhere that would describe the locking behavior git uses tobattle concurrency, and/or some (preferably single) points in the sourcecode that I could look at?

The main issue I see is that I suspect it will generate so

much data

that it'll overflow my disk ;-).

Well, assuming you have some automated way of detecting when
it fails, you can just overwrite the same strace output file
repeatedly; we're only interested in the last one (or all the
last ones if several gits fail concurrently).

We use tmpwatch for this type of issue, especially with oracle traces. Set up a
directory and tell tmpwatch to delete files older than X. This will keep the
files at bay and when you detect a problem stop  the clean up script.

Thanks - as described above I do keep track of all tasks that eventuallyrun out of steam and die from exhaustion (logs and the final copy oftheir clone), and can I get the strace in there things should be fine.It'll still be a lot of data since I as described I haven't yet figuredout how to accurately detect the point where the real error actuallyoccurs. I'll look into if I can have some checkpoints in my tasks wherethey all calm down so I can test the repo for correctness, so limitingthe data and more quickly discovering the fail state.


ken1


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)

Reply via email to