Re: [Linux-HA] how to check HBA with heartbeat

Dejan Muhamedagic Wed, 15 Apr 2009 04:19:44 -0700

Ciao,

On Wed, Apr 15, 2009 at 12:53:41PM +0200, Cristina Bulfon wrote:
> Ciao Dejan,
>
> I am doing back & forth on this item :-)
> I moved to 2.14. version and back to V1 style... I don't use anymore DRBD, 
> just the mount


Do you need drbd?

> So the haresources file is the follows
>
> afsitfs3.roma1.infn.it  IPaddr::141.108.26.31/24/eth0
> afsitfs3.roma1.infn.it   Filesystem::/dev/AFS/sda3::/vicepa::xfs
> afsitfs3.roma1.infn.it   Filesystem::/dev/AFS/sda1::/usr/afs::ext3
> afsitfs3.roma1.infn.it  141.108.26.31   afs
>
> when I put the master node in stand_by or I stop the heartbeat, happens the 
> following things
>
> - try the umount the filesystems before to stop "afs"..

Isn't it afs stop before filesystem?

> umount: /vicepa: device is busy
> umount: /vicepa: device is busy
> Filesystem[3427]:       2009/04/14_09:16:52 ERROR: Couldn't unmount 
> /vicepa; trying cleanup with SIGTERM
> /vicepa:

This may be normal, i.e. there could be processes using the
filesystem, though typically there are only applications which
depend on the filesystem (in this case afs) which should be
doing something there. If this is a concern, you should check
which processes have files open over there (fuser,lsof).

> With 2.1.3 version I didn;t see any kind of those message, everything is V1 
> style was fine.

I suspect that the afs RA is not working correctly, in particular
the status operation.

Thanks,

Dejan


> thanks
>
> cristina
>
>
> On Apr 14, 2009, at 2:25 PM, Dejan Muhamedagic wrote:
>
>> Ciao,
>>
>> On Tue, Apr 14, 2009 at 01:51:25PM +0200, Cristina Bulfon wrote:
>>> Ciao,
>>>
>>> I don't think.. in V1 style is working, the behavior change with V2 
>>> style.
>>> In attachment you will find a small ha-log file (zip format).
>>
>> The monitor operation on afs reports 7 (not started) even though
>> the previous start operation succeeds:
>>
>> crmd[19180]: 2009/04/14_13:38:55 info: process_lrm_event: LRM operation 
>> afs_6_start_0 (call=18, rc=0) complete
>> crmd[19180]: 2009/04/14_13:38:56 info: do_lrm_rsc_op: Performing 
>> op=afs_6_monitor_120000 key=17:2:0:cc5851a8-04dd-45a6-8700-954bea0f2c78)
>> crmd[19180]: 2009/04/14_13:38:56 info: process_lrm_event: LRM operation 
>> afs_6_monitor_120000 (call=19, rc=7) complete
>>
>> You have to take a look at the afs script and see what's going
>> on.
>>
>> Thanks,
>>
>> Dejan
>>
>>
>>>
>>>
>>>
>>> I don't know if the output of "ciblint" could help
>>>
>>> [r...@afsitfs3 crm]# ciblint -L
>>> ERROR: <nvpair name="short-resource-names"...>: [short-resource-names] is
>>> not a legal name for the <crm_config> section
>>> ERROR: <nvpair name="transition-idle-timeout"...>:
>>> [transition-idle-timeout] is not a legal name for the <crm_config> 
>>> section
>>> WARNING: STONITH disabled <nvpair name="stonith-enabled" value="false">.
>>> STONITH is STRONGLY recommended.
>>> WARNING: No STONITH resources configured.  STONITH is not available.
>>> INFO: See http://linux-ha.org/ciblint/stonith for more information on 
>>> this
>>> topic.
>>> INFO: See http://linux-ha.org/ciblint/crm_config#stonith-enabled for more
>>> information on this topic.
>>> WARNING: resource afs_6 has failcount 2 on node afsitfs3.roma1.infn.it
>>> INFO: Resource Filesystem_4 running on node afsitfs3.roma1.infn.it
>>> INFO: Resource Filesystem_2 running on node afsitfs3.roma1.infn.it
>>> INFO: Resource drbddisk_1 running on node afsitfs3.roma1.infn.it
>>> INFO: Resource drbddisk_3 running on node afsitfs3.roma1.infn.it
>>> WARNING: Resource afs_6 not running anywhere.
>>> INFO: Resource IPaddr_141_108_26_31 running on node 
>>> afsitfs3.roma1.infn.it
>>>
>>> Thanks
>>>
>>> cristina
>>>
>>> On Apr 14, 2009, at 1:00 PM, Dejan Muhamedagic wrote:
>>>
>>>> Hi,
>>>>
>>>> On Tue, Apr 14, 2009 at 10:56:23AM +0200, Cristina Bulfon wrote:
>>>>> Ciao,
>>>>>
>>>>> thanks for the answer ... Dejan has already pointed me out regarding 
>>>>> the
>>>>> IP.
>>>>> That IP is the alias IP for the AFS server, and I was using also with
>>>>> IPaddr2 because at the beginning,
>>>>> while I was configuring AFS, I had probem with network communication 
>>>>> and
>>>>> I
>>>>> thought to redirect the traffic
>>>>> on that IP. I've solved that problem and I forgot to delete the entry 
>>>>> in
>>>>> haresource file
>>>>> beacuse that configuration work fine with V1...
>>>>>
>>>>> Anyway I correct the haresource file as follows
>>>>>
>>>>> afsitfs3.roma1.infn.it \
>>>>>       drbddisk::afs_fs Filesystem::/dev/drbd1::/vicepa/::xfs \
>>>>>       drbddisk::afs_sw Filesystem::/dev/drbd2::/usr/afs::ext3 \
>>>>>       141.108.26.31 afs
>>>>>
>>>>> and create the cib.xml  I don't have anymore the error  but the AFS
>>>>> start/stop
>>>>> continuously
>>>>
>>>> Probably an afs issue. What do you see in the logs?
>>>>
>>>> Dejan
>>>>
>>>>> cristina
>>>>>
>>>>> On Apr 14, 2009, at 10:38 AM, Andrew Beekhof wrote:
>>>>>
>>>>>> On Fri, Apr 10, 2009 at 12:25, Cristina Bulfon
>>>>>> <[email protected]> wrote:
>>>>>>> Dejan,
>>>>>>>
>>>>>>> I've followed your advice and I've moved to V2, first the software 
>>>>>>> has
>>>>>>> been
>>>>>>> updated to version 2.1.4.
>>>>>>> I just modified the following files
>>>>>>>
>>>>>>> - ha.cf, added the line
>>>>>>>       crm yes
>>>>>>>
>>>>>>> - cib.xml has been produced using the python script and my 
>>>>>>> haresources
>>>>>>>
>>>>>>>      afsitfs3.roma1.infn.it IPaddr2::141.108.26.31/24/eth0:0
>>>>>>>      afsitfs3.roma1.infn.it drbddisk::afs_fs
>>>>>>> Filesystem::/dev/drbd1::/vicepa::xfs
>>>>>>>      afsitfs3.roma1.infn.it drbddisk::afs_sw
>>>>>>> Filesystem::/dev/drbd2::/usr/afs::ext3
>>>>>>>      afsitfs3.roma1.infn.it 141.108.26.31 afs
>>>>>>>
>>>>>>>
>>>>>>> With this kind of configuration I've got a lot of error and the AFS
>>>>>>> resource
>>>>>>> doesn't work
>>>>>>
>>>>>> Looks to me like the ip address is the one that doesn't work.  Did you
>>>>>> actually read the output you pasted below?
>>>>>>
>>>>>> You might want to double check the nic and netmask attributes, they're
>>>>>> probably swapped around.
>>>>>>
>>>>>>>
>>>>>>> - crm_verify -L  -x /var/lib/heartbeat/crm/cib.xml
>>>>>>>
>>>>>>> crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op: Hard
>>>>>>> error:
>>>>>>> IPaddr2_1_monitor_0 failed with rc=2.
>>>>>>> crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:
>>>>>>> Preventing
>>>>>>> IPaddr2_1 from re-starting on afsitfs4.roma1.infn.it
>>>>>>> crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op: Hard
>>>>>>> error:
>>>>>>> IPaddr2_1_monitor_0 failed with rc=2.
>>>>>>> crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:
>>>>>>> Preventing
>>>>>>> IPaddr2_1 from re-starting on afsitfs3.roma1.infn.it
>>>>>>>
>>>>>>> I've attached both cib.xml, ha-log and ha.cf
>>>>>>>
>>>>>>> Thanks for helping me
>>>>>>>
>>>>>>> cristina
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Apr 8, 2009, at 5:50 PM, Cristina Bulfon wrote:
>>>>>>>
>>>>>>>> Dejan,
>>>>>>>>
>>>>>>>> thanks so much for the explanation :-)
>>>>>>>>
>>>>>>>> c.
>>>>>>>>
>>>>>>>> On Apr 8, 2009, at 5:46 PM, Dejan Muhamedagic wrote:
>>>>>>>>
>>>>>>>>> Ciao,
>>>>>>>>>
>>>>>>>>> On Wed, Apr 08, 2009 at 04:17:45PM +0200, Cristina Bulfon wrote:
>>>>>>>>>>
>>>>>>>>>> Ciao Dejan,
>>>>>>>>>>
>>>>>>>>>> thanks for the answer.
>>>>>>>>>> Do you mean that I have to use heartbeat V2 plus CRM  and there is 
>>>>>>>>>> a
>>>>>>>>>> way
>>>>>>>>>> to
>>>>>>>>>> check the HBA without using
>>>>>>>>>> hbaping ?
>>>>>>>>>
>>>>>>>>> Unlike Heartbeat v1, CRM/v2 can monitor resources. I suppose that
>>>>>>>>> in your case, a failing HBA would cause drbd or Filesystem
>>>>>>>>> monitor action to fail, which would result in either a failover
>>>>>>>>> or restart, depending on the configuration.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dejan
>>>>>>>>>
>>>>>>>>>> Just to be sure if I have understood correctly. I am newby on
>>>>>>>>>> heartbeat
>>>>>>>>>> V2
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>>
>>>>>>>>>> cristina
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 31, 2009, at 2:00 PM, Dejan Muhamedagic wrote:
>>>>>>>>>>
>>>>>>>>>>> Ciao,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 31, 2009 at 01:48:47PM +0200, Cristina Bulfon wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Ciao,
>>>>>>>>>>>>
>>>>>>>>>>>> in our heartbeat cluster we have simulated the breaking of the 
>>>>>>>>>>>> HBA
>>>>>>>>>>>> by
>>>>>>>>>>>> unplugging the fiber from HBA on the primary node. The resource
>>>>>>>>>>>> didn't
>>>>>>>>>>>> switch to   the secondary node and on the log file on primary 
>>>>>>>>>>>> node
>>>>>>>>>>>> reported
>>>>>>>>>>>> the following messages:
>>>>>>>>>>>>
>>>>>>>>>>>> Feb 19 14:33:33 afsitfs3 kernel: qla2xxx 0000:0a:01.0: LOOP DOWN
>>>>>>>>>>>> detected
>>>>>>>>>>>> (2 e678 16ed).
>>>>>>>>>>>> Feb 19 14:33:38 afsitfs3 kernel: qla2xxx 0000:0a:01.1: LOOP DOWN
>>>>>>>>>>>> detected
>>>>>>>>>>>> (2 8633 16fc).
>>>>>>>>>>>> Feb 19 14:33:46 afsitfs3 kernel: qla2x00: FAILOVER device 2 from
>>>>>>>>>>>> 200500a0b832d169 -> 200400a0b832d16a - LUN 10, reason=0x2
>>>>>>>>>>>> Feb 19 14:33:46 afsitfs3 kernel: qla2x00: FROM HBA 0 to HBA 1
>>>>>>>>>>>> Feb 19 14:33:52 afsitfs3 kernel: qla2x00: FAILOVER device 2 from
>>>>>>>>>>>> 200400a0b832d16a -> 200500a0b832d16a - LUN 10, reason=0x2
>>>>>>>>>>>> Feb 19 14:33:52 afsitfs3 kernel: qla2x00: FROM HBA 1 to HBA 1
>>>>>>>>>>>> Feb 19 14:33:55 afsitfs3 kernel: qla2x00: FAILOVER device 2 from
>>>>>>>>>>>> 200500a0b832d16a -> 200400a0b832d169 - LUN 10, reason=0x2
>>>>>>>>>>>> Feb 19 14:33:55 afsitfs3 kernel: qla2x00: FROM HBA 1 to HBA 0
>>>>>>>>>>>> Feb 19 14:33:58 afsitfs3 kernel: qla2x00: FAILOVER device 2 from
>>>>>>>>>>>> 200400a0b832d169 -> 200500a0b832d169 - LUN 10, reason=0x2
>>>>>>>>>>>> Feb 19 14:33:58 afsitfs3 kernel: qla2x00: FROM HBA 0 to HBA 0
>>>>>>>>>>>> Feb 19 14:34:01 afsitfs3 kernel: qla2x00: FAILOVER device 2 from
>>>>>>>>>>>> 200500a0b832d169 -> 200400a0b832d16a - LUN 10, reason=0x2
>>>>>>>>>>>>
>>>>>>>>>>>> In some way I expected this kind of messages but  I do not
>>>>>>>>>>>> understand
>>>>>>>>>>>> why
>>>>>>>>>>>> the secondary node doesn't take the control of the resources.
>>>>>>>>>>>>
>>>>>>>>>>>> In the ha.cf there is not nothing related to HBA and the
>>>>>>>>>>>> haresources
>>>>>>>>>>>> file
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>> afsitfs3.roma1.infn.it  IPaddr2::Y.Y.Y.Y/24/eth0:0
>>>>>>>>>>>> afsitfs3.roma1.infn.it  drbddisk::r0
>>>>>>>>>>>> Filesystem::/dev/drbd1::/vicepa::xfs
>>>>>>>>>>>> afsitfs3.roma1.infn.it  drbddisk::r1
>>>>>>>>>>>> Filesystem::/dev/drbd2::/usr/afs::ext3
>>>>>>>>>>>> afsitfs3.roma1.infn.it         Y.Y.Y.Y   afs
>>>>>>>>>>>
>>>>>>>>>>> There's no resource monitoring with v1. For that you have to go
>>>>>>>>>>> with v2/Pacemaker (aka CRM).
>>>>>>>>>>>
>>>>>>>>>>>> Also tried to use hbaping compiling the hbaapi_src_2.2 but 
>>>>>>>>>>>> without
>>>>>>>>>>>> success
>>>>>>>>>>>> .. got problem during the compilations and I didn't understand 
>>>>>>>>>>>> if
>>>>>>>>>>>> I
>>>>>>>>>>>> have
>>>>>>>>>>>> to
>>>>>>>>>>>> use libHBAAPI.so  from hbaapi or from HBA vendor.
>>>>>>>>>>>
>>>>>>>>>>> That could work with ipfail, perhaps.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Dejan
>>>>>>>>>>>
>>>>>>>>>>>> Our FC controller is
>>>>>>>>>>>>              Logic PCI to Fibre Channel Host Adapter for 
>>>>>>>>>>>> QLA2342:
>>>>>>>>>>>>      Firmware version 3.03.25 IPX, Driver version 8.02.14.01-fo
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>> cristina
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>
>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] how to check HBA with heartbeat

Reply via email to