Re: [Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Sunil Mushran Mon, 01 Dec 2008 16:09:25 -0800

The reason it is unable to stop hb by uuid is that none of the devices
have that uuid.


So lookup by uuid fails because it cannot match the uuid to a device.

And shutdown by device name fails because it sees a different uuid
on that device. So ocfs2_hb_ctl -K -d /dev/dm-36 o2cb does nothing.
(Use o2cb as the service.)

The qs is: Can you reboot this box? If not, I could look into providing
a procedure that involves hand-editing the superblock. Fun! :)

Getting back to how this could have happened: Can you provide the command
for steps 1,2 and 4. I want to make sure I understand what you are doing.

- unmount the snapshot dir
- unmap the snapshot lun
- take a SAN-based snapshot
- present snapshot lun (same SCSI ID/WWNN) back to server
- force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
- change the label with tunefs.ocfs2 on the snapshot filesystem
- fsck the snapshot filesystem
- mount the snapshot filesystem

Sunil

Daniel Keisling wrote:
> [EMAIL PROTECTED] tmp]# uname -a
> Linux ausracdbd01.austin.ppdi.com 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4
> 03:51:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> [EMAIL PROTECTED] tmp]# rpm -qa | grep ocfs2
> ocfs2console-1.4.1-1.el5
> ocfs2-2.6.18-53.el5-1.2.8-2.el5
> ocfs2-tools-1.4.1-1.el5
> ocfs2-2.6.18-92.1.13.el5-1.4.1-1.el5
>
> [EMAIL PROTECTED] tmp]# rpm -qf `which ocfs2_hb_ctl`
> ocfs2-tools-1.4.1-1.el5
>
>
>
>
>
> [EMAIL PROTECTED] tmp]# cat
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
> D589/dev
> dm-36
>
> [EMAIL PROTECTED] tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> 5C81428158004C66B8AD4011D023E7F9: 1 refs
>
> The kill syntax you gave me for devices needs the service name...I
> assume o2hb?
>
> [EMAIL PROTECTED] tmp]# ocfs2_hb_ctl -K -d /dev/dm-36 o2hb
> [EMAIL PROTECTED] tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> 5C81428158004C66B8AD4011D023E7F9: 0 refs
>
> However, this did not kill the thread or remove any references out of
> /sys/kernel/config/cluster/racdbd/heartbeat/:
>
> [EMAIL PROTECTED] tmp]# ps -ef | grep F5F0
> root       620   169  0 Nov29 ?        00:00:31 [o2hb-F5F0522D39]
> root     14914 11922  0 15:03 pts/4    00:00:00 grep F5F0
>
> [EMAIL PROTECTED] tmp]# cat
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
> D589/dev
> dm-36
>
>
> FWIW, the UUID 5C81428158004C66B8AD4011D023E7F9 does not exist in
> /sys/kernel/config/cluster/racdbd/heartbeat but does in 'mounted.ocfs2
> -d.'
>
>
>
>
>
>   
>> -----Original Message-----
>> From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
>> Sent: Monday, December 01, 2008 2:41 PM
>> To: Daniel Keisling
>> Cc: [email protected]
>> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
>> our slot! errors with LUN removal/addition
>>
>> So the problem you are encountering is killing via uuid. You 
>> could kill by
>> device name too.
>>
>> By now you have the list of heartbeat regions. To get the 
>> device name for
>> a region, do:
>>
>> $ cat 
>> /sys/kernel/config/cluster/CLUSERNAME/heartbeat/C43CB881C2C84B
>> 09BAC14546BF6DCAD9/dev 
>>
>> sdf1
>>
>> $ ocfs2_hb_ctl -K -d /dev/sdf1
>>
>> Now makesure that that device is not mounted. It should not be. If it
>> is, then you probably have used force-uuid-reset to change 
>> the uuid of 
>> an active
>> device. In that case, I see no solution other than a node reset.
>>
>> But before you do this, I would like some more info.
>>
>> 1. strace -o /tmp/hbctl.out ocfs2_hb_ctl -K -u 
>> F5F0522D39FC4EB2824C3E68C0B1D589
>> 2. uname -a
>> 3. rpm -qa | grep ocfs2
>> 4. rpm -qf `which ocfs2_hb_ctl`
>> 5. mounted.ocfs2 -d >/tmp/mounted.out
>>
>> Thanks
>> Sunil
>>
>> Daniel Keisling wrote:
>>     
>>> I wrote a script to easily get the heartbeats that should have been
>>> killed.  However, I get a segmentation fault everytime I 
>>>       
>> try and kill
>>     
>>> the "dead" heartbeats:
>>>
>>> [EMAIL PROTECTED] tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
>>> 0
>>>
>>> [EMAIL PROTECTED] tmp]# ocfs2_hb_ctl -K -u
>>> F5F0522D39FC4EB2824C3E68C0B1D589
>>> Segmentation fault (core dumped)
>>>
>>>
>>>
>>> The process is still active:
>>>
>>> [EMAIL PROTECTED] tmp]# ps -ef | grep -i f5f0
>>> root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
>>> root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0
>>>
>>> Attached is the core.
>>>
>>> While I can create and mount snapshot filesystems on my development
>>> node, a dead heartbeat on one of my production nodes is not 
>>>       
>> letting me
>>     
>>> mount the snapshot for a newly presented filesystem (thus 
>>>       
>> causing our
>>     
>>> backups to fail).  What else can I do?  I really don't want 
>>>       
>> to open an
>>     
>>> SR with Oracle...
>>>
>>> Thanks,
>>>
>>> Daniel
>>>       
>>     
>
> ______________________________________________________________________
> This email transmission and any documents, files or previous email
> messages attached to it may contain information that is confidential or
> legally privileged. If you are not the intended recipient or a person
> responsible for delivering this transmission to the intended recipient,
> you are hereby notified that you must not read this transmission and
> that any disclosure, copying, printing, distribution or use of this
> transmission is strictly prohibited. If you have received this transmission
> in error, please immediately notify the sender by telephone or return email
> and delete the original transmission and its attachments without reading
> or saving in any manner.
>   


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Reply via email to