Re: [OpenAFS] Re: odd problem with RW site after a botched replica

Timothy Balcer Tue, 30 Oct 2012 20:09:08 -0700

On Tue, Oct 30, 2012 at 7:33 AM, Kim Kimball <[email protected]> wrote:


> If you have access to a recent RO the quickest fix may be to vos dump it
> and restore the RW from it.  NB that if there is only one RO currently
> available dumping it makes it busy and with no alternate the RO will be
> unavailable to all clients.
>
>
Thanks for that Tip, however in my efforts to get the RW site functioning,
I removed the RO replica.

In other news, the latest salvage has been running for 12 hours... I
straced the busiest pid and it is happily verifying all the links and
contents (open(), close(), pread() ad infinitum), so its not wedged. This
volume has literally slightly less than 32k directory entries in various
places (yes, I made SURE the limits were observed ;-) ) and so I imagine it
will take a very long time to traverse the entire thing... interesting that
this is the fourth salvage and it actually seems to be working at it this
time. Last three times it stopped after a bit over an hour.

I suspect that the resources given to the afs server were too limited to
actually get the salvage done properly. One thing I did this time was
increase the memory to the server up to 8GB, and free shows it tooling
merrily along with plenty of buffers and cache now.

I did THAT because I noticed that the kernel killed the salvage operation
the first two times due to out of memory conditions.. something I had not
checked, or expected. So it may be that this is the second "true" salvage,
and it may succeed.

I'll keep you all posted. There wasn't an error in the AFS logs that
indicated that salvager proceses had been killed due to OOM. It was only in
the kernel logs.

-- 
Timothy Balcer

Re: [OpenAFS] Re: odd problem with RW site after a botched replica

Reply via email to