On Wed, Mar 6, 2013 at 8:54 AM, Petr Bena <[email protected]> wrote:

> okay this is third time when we have same outage... bastion2 and 3
> were accessible for short time after bastion1's gluster died, then
> they died as well. public keys weren't accessible on any of them so
> basically labs were inaccessible for anyone.
>
>
Ok. I tracked this down some. glusterd became unstable on all of the
labstore nodes. It was crashing and restarting pretty often. The glusterfs
service (which runs NFS) crashes with glusterd. The glusterfsd processes
(which run gluster filesystems) are decoupled from the glusterd process, so
they continue running without issue.

I just restarted all of the glusterd processes. That caused an nfs outage,
which could only be fixed by killing all of the glusterfs processes, and
restarting the glusterd processes again. This triggered the issue we're
seeing with bastion1. It looks like long NFS timeouts in lucid make SSH
inaccessible forever. The precise instances recover from this properly.

I'm going to rebuild bastion1 as precise (saving the SSH keys, of course)
to workaround this issue.

- Ryan
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to