On 05/18/2011 03:19 AM, Felix Frank wrote:
On 05/17/2011 09:24 PM, [email protected] wrote:
I inherited a broken cluster.  With the help of a national vendor I am
worse off and 'the good node' is a tad hosed.  I upgraded to the latest
kernel and got everything back to this point. (Note, the good node was
shut out of the cluster so the other one is still up and working.  That
one hangs on an 'ls' command that is why I have my doubts about it).
There was data on this node this morning.  I think.  I'd prefer not to
hose the data on this node in case the other node has a problem.  I'm
hoping I can just bring it up and it syncs and life is good.

[root@julius init.d]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/cciss/c0d0p1     36562540  10644068  24031240  31% /
tmpfs                  6147644         0   6147644   0% /dev/shm
[root@julius init.d]# mount /srv/vmdata/
/sbin/mount.gfs2: can't open /dev/drbd0: Wrong medium type
[root@julius init.d]# service drbd stop
Stopping all DRBD resources.
[root@julius init.d]# /sbin/drbdadm create-md drbd0
md_offset 986671665152
al_offset 986671632384
bm_offset 986641518592

Found some data
  ==>  This might destroy existing data!<==

Do you want to proceed?
[need to type 'yes' to confirm] no
Good choice.

Activate the DRBD service again. Examine the contents of the device
using "file -sL /dev/drbd0" (should recognize the gfs2?).
I believe that if you want to get data back, you may want to run some
sort of fsck against drbd0.
[root@julius ~]# file -sL /dev/drbd0
/dev/drbd0: GFS2 Filesystem (blocksize 4096, lockproto lock_dlm)
[root@julius ~]#

Speaking of cluster file systems - is this a dual-primary setup? Is the
"good node" Primary?

This is a primary/primary set up.

net {
        timeout 50;
        connect-int 10;
        ping-int 10;
        allow-two-primaries;


In this case, you will most likely face split brain either way, so
syncing back up won't be easy. It may then be your best shot (and most
simple solution) to heal your "good node" (find out what's blocking it)
and use that as sync source.

It was set up so all the vm's were on one node, the other was for fail over, so split brain shouldn't really be an issue.

On the "good node", could it be that the dlm is biting you since the
peer node is in trouble?

I get this on the node that is locked out so I guess you may be right:
root@julius ~]# mount /srv/vmdata/
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
<repeats>
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: gfs_controld not running
/sbin/mount.gfs2: error mounting lockproto lock_dlm

There is no error recovery at all specified in the config file.

Would running this from the "good node" be the next step?

*|drbdadm -- --overwrite-data-of-peer primary/|resource

resource would be|/|*drbd0 from the config file

I greatly appreciate your taking the time to help.

Thank You



_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to