If you can umount all ocfs2 vols on node 2, then you can skip mounted.ocfs2 step. Just list all active heartbeat regions and kill them one-by-one.
Sunil Mushran wrote: > The first step is to find out the current UUIDs on the devices. > $ mounted.ocfs2 -d > > Next get a list of all running heartbeat threads. > $ ls -l /sys/kernel/config/cluster/<clustername>/heartbeat/ > This will list the heartbeat regions which is the same as the UUID. > > What you have to do is remove from the second list all UUIDs you get > from the first list. This process could be made simpler if you umounted > all the ocfs2 volumes on node 2. What we are trying to do is kill off all > hb threads that should have been killed during umount. > > Once you have that list, do: > $ ocfs2_hb_ctl -I -u <UUID> > For example: > $ ocfs2_hb_ctl -I -u C43CB881C2C84B09BAC14546BF6DCAD9 > > This will tell you the number of hb references. It should be 1. > To kill do: > $ ocfs2_hb_ctl -K -u <UUID> > > Do it one by one. Ensure the hb thread is killed. (Note: The o2hb thread > name has the start of the region name.) > > We still don't know why ocfs2_hb_ctl sigsevs duuring umount. But we > know that that the failure to do so is the cause of your problem. > > Sunil > > Daniel Keisling wrote: > >> All nodes except the node that I run snapshots on have the correct >> number of o2hb threads running. However, node 2, the node that has >> daily snapshots taken has _way_ too many threads: >> >> [EMAIL PROTECTED] ~]# ps aux | grep o2hb | wc -l >> 79 >> >> [EMAIL PROTECTED] ~]# ps aux | grep o2hb | head -n 10 >> root 1166 0.0 0.0 0 0 ? S< Nov20 0:47 >> [o2hb-00EFECD3FF] >> root 1216 0.0 0.0 0 0 ? S< Oct25 4:14 >> [o2hb-5E0C4AD17C] >> root 1318 0.0 0.0 0 0 ? S< Nov01 3:18 >> [o2hb-98697EE8BC] >> root 1784 0.0 0.0 0 0 ? S< Nov15 1:25 >> [o2hb-A7DBDA5C27] >> root 2293 0.0 0.0 0 0 ? S< Nov18 1:05 >> [o2hb-FBA96061AD] >> root 2410 0.0 0.0 0 0 ? S< Oct23 4:49 >> [o2hb-289FD53333] >> root 2977 0.0 0.0 0 0 ? S< Nov21 0:00 >> [o2hb-58CB9EA8F0] >> root 3038 0.0 0.0 0 0 ? S< Nov21 0:00 >> [o2hb-D33787D93D] >> root 3150 0.0 0.0 0 0 ? S< Oct25 4:38 >> [o2hb-3CB2E03215] >> root 3302 0.0 0.0 0 0 ? S< Nov09 2:22 >> [o2hb-F78E8BF89E] >> >> >> What's the best way to proceed? >> >> Is this being caused by unpresenting/presenting snapshotted LUNs back to >> the system? Those steps include: >> - unmount the snapshot dir >> - unmap the snapshot lun >> - take a SAN-based snapshot >> - present snapshot lun (same SCSI ID/WWNN) back to server >> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem >> - change the label with tunefs.ocfs2 on the snapshot filesystem >> - fsck the snapshot filesystem >> - mount the snapshot filesystem >> >> I am using tunefs.ocfs2 v1.2.7 because the --force-uuid-reset is not in >> the v1.4.1 release. >> >> My two node development cluster, which is exactly the same as above, is >> exhibiting the same behavior. My single node cluster, which is exactly >> the same as above, is NOT exhibiting the same behavior. >> >> Another single-node Oracle RAC cluster that is nearly the same (using >> Qlogic HBA drivers for SCSI devices instead of device-mapper) does not >> exhibit the o2hb thread issue. >> >> Daniel >> > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
