Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Laurentiu Gosu

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch and 
test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i understand, 
hearbeat should be started and stopped once the volume gets 
mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   
0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2


mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: 

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Sunil Mushran

I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch 
and test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i understand, 
hearbeat should be started and stopped once the volume gets 
mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 
0C4AB55FE9314FA5A9F81652FDB9B22D

-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   
0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2


mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Laurentiu Gosu

hmm..
#ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
*BUT:*
#ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
I can still kill the ref using device name (-d).

On 10/23/2011 17:57, Sunil Mushran wrote:

I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch 
and test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i 
understand, hearbeat should be started and stopped once the volume 
gets mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 
0C4AB55FE9314FA5A9F81652FDB9B22D

-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:
One way this can happen is if one starts the hb manually and then 
force

formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   
0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2


mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the