Re: [Linux-HA] how to check HBA with heartbeat

Cristina Bulfon Wed, 15 Apr 2009 04:38:16 -0700


On Apr 15, 2009, at 1:18 PM, Dejan Muhamedagic wrote:

Ciao,

On Wed, Apr 15, 2009 at 12:53:41PM +0200, Cristina Bulfon wrote:

Ciao Dejan,

I am doing back & forth on this item :-)
I moved to 2.14. version and back to V1 style... I don't useanymore DRBD,
just the mount


Do you need drbd?

No.. when I started the first time to use heartbeat I couldn't managethe filesystem mount with heartbeatso I used DRDB as workaround, I don't need it since my devices arevisible through the SAN.

So the haresources file is the follows

afsitfs3.roma1.infn.it  IPaddr::141.108.26.31/24/eth0
afsitfs3.roma1.infn.it   Filesystem::/dev/AFS/sda3::/vicepa::xfs
afsitfs3.roma1.infn.it   Filesystem::/dev/AFS/sda1::/usr/afs::ext3
afsitfs3.roma1.infn.it  141.108.26.31   afs

when I put the master node in stand_by or I stop the heartbeat,happens the

following things

- try the umount the filesystems before to stop "afs"..


Isn't it afs stop before filesystem?


That's is the problem I don't understand why .. it seems that
the stop is performed in the same  "start" order

umount: /vicepa: device is busy
umount: /vicepa: device is busy
Filesystem[3427]:       2009/04/14_09:16:52 ERROR: Couldn't unmount
/vicepa; trying cleanup with SIGTERM
/vicepa:


This may be normal, i.e. there could be processes using the
filesystem, though typically there are only applications which
depend on the filesystem (in this case afs) which should be
doing something there. If this is a concern, you should check
which processes have files open over there (fuser,lsof).

With 2.1.3 version I didn;t see any kind of those message,everything is V1
style was fine.


I suspect that the afs RA is not working correctly, in particular
the status operation.

I will take a look

thanks cristina



Thanks,

Dejan

thanks

cristina


On Apr 14, 2009, at 2:25 PM, Dejan Muhamedagic wrote:

Ciao,

On Tue, Apr 14, 2009 at 01:51:25PM +0200, Cristina Bulfon wrote:

Ciao,

I don't think.. in V1 style is working, the behavior change with V2
style.
In attachment you will find a small ha-log file (zip format).


The monitor operation on afs reports 7 (not started) even though
the previous start operation succeeds:

crmd[19180]: 2009/04/14_13:38:55 info: process_lrm_event: LRMoperation

afs_6_start_0 (call=18, rc=0) complete
crmd[19180]: 2009/04/14_13:38:56 info: do_lrm_rsc_op: Performing

op=afs_6_monitor_120000key=17:2:0:cc5851a8-04dd-45a6-8700-954bea0f2c78)crmd[19180]: 2009/04/14_13:38:56 info: process_lrm_event: LRMoperation

afs_6_monitor_120000 (call=19, rc=7) complete

You have to take a look at the afs script and see what's going
on.

Thanks,

Dejan




I don't know if the output of "ciblint" could help

[r...@afsitfs3 crm]# ciblint -L

ERROR: <nvpair name="short-resource-names"...>: [short-resource-names] is

not a legal name for the <crm_config> section
ERROR: <nvpair name="transition-idle-timeout"...>:
[transition-idle-timeout] is not a legal name for the <crm_config>
section

WARNING: STONITH disabled <nvpair name="stonith-enabled"value="false">.

STONITH is STRONGLY recommended.

WARNING: No STONITH resources configured. STONITH is notavailable.INFO: See http://linux-ha.org/ciblint/stonith for moreinformation on

this
topic.

INFO: See http://linux-ha.org/ciblint/crm_config#stonith-enabledfor more

information on this topic.

WARNING: resource afs_6 has failcount 2 on nodeafsitfs3.roma1.infn.it

INFO: Resource Filesystem_4 running on node afsitfs3.roma1.infn.it
INFO: Resource Filesystem_2 running on node afsitfs3.roma1.infn.it
INFO: Resource drbddisk_1 running on node afsitfs3.roma1.infn.it
INFO: Resource drbddisk_3 running on node afsitfs3.roma1.infn.it
WARNING: Resource afs_6 not running anywhere.
INFO: Resource IPaddr_141_108_26_31 running on node
afsitfs3.roma1.infn.it

Thanks

cristina

On Apr 14, 2009, at 1:00 PM, Dejan Muhamedagic wrote:

Hi,

On Tue, Apr 14, 2009 at 10:56:23AM +0200, Cristina Bulfon wrote:

Ciao,
thanks for the answer ... Dejan has already pointed me outregarding
the
IP.
That IP is the alias IP for the AFS server, and I was usingalso with
IPaddr2 because at the beginning,
while I was configuring AFS, I had probem with networkcommunication
and
I
thought to redirect the traffic
on that IP. I've solved that problem and I forgot to delete theentry
in
haresource file
beacuse that configuration work fine with V1...

Anyway I correct the haresource file as follows

afsitfs3.roma1.infn.it \
     drbddisk::afs_fs Filesystem::/dev/drbd1::/vicepa/::xfs \
     drbddisk::afs_sw Filesystem::/dev/drbd2::/usr/afs::ext3 \
     141.108.26.31 afs
and create the cib.xml I don't have anymore the error but theAFS
start/stop
continuously


Probably an afs issue. What do you see in the logs?

Dejan

cristina

On Apr 14, 2009, at 10:38 AM, Andrew Beekhof wrote:

On Fri, Apr 10, 2009 at 12:25, Cristina Bulfon
<[email protected]> wrote:
Dejan,
I've followed your advice and I've moved to V2, first thesoftware
has
been
updated to version 2.1.4.
I just modified the following files

- ha.cf, added the line
     crm yes

- cib.xml has been produced using the python script and my
haresources

    afsitfs3.roma1.infn.it IPaddr2::141.108.26.31/24/eth0:0
    afsitfs3.roma1.infn.it drbddisk::afs_fs
Filesystem::/dev/drbd1::/vicepa::xfs
    afsitfs3.roma1.infn.it drbddisk::afs_sw
Filesystem::/dev/drbd2::/usr/afs::ext3
    afsitfs3.roma1.infn.it 141.108.26.31 afs
With this kind of configuration I've got a lot of error andthe AFS
resource
doesn't work
Looks to me like the ip address is the one that doesn't work.Did you
actually read the output you pasted below?
You might want to double check the nic and netmask attributes,they're
probably swapped around.
- crm_verify -L  -x /var/lib/heartbeat/crm/cib.xml
crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:Hard
error:
IPaddr2_1_monitor_0 failed with rc=2.
crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:
Preventing
IPaddr2_1 from re-starting on afsitfs4.roma1.infn.it
crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:Hard
error:
IPaddr2_1_monitor_0 failed with rc=2.
crm_verify[30489]: 2009/04/10_12:20:01 ERROR: unpack_rsc_op:
Preventing
IPaddr2_1 from re-starting on afsitfs3.roma1.infn.it

I've attached both cib.xml, ha-log and ha.cf

Thanks for helping me

cristina








On Apr 8, 2009, at 5:50 PM, Cristina Bulfon wrote:
Dejan,

thanks so much for the explanation :-)

c.

On Apr 8, 2009, at 5:46 PM, Dejan Muhamedagic wrote:
Ciao,
On Wed, Apr 08, 2009 at 04:17:45PM +0200, Cristina Bulfonwrote:
Ciao Dejan,

thanks for the answer.
Do you mean that I have to use heartbeat V2 plus CRM andthere is
a
way
to
check the HBA without using
hbaping ?
Unlike Heartbeat v1, CRM/v2 can monitor resources. Isuppose that
in your case, a failing HBA would cause drbd or Filesystem
monitor action to fail, which would result in either afailover
or restart, depending on the configuration.

Thanks,

Dejan
Just to be sure if I have understood correctly. I am newbyon
heartbeat
V2

thanks

cristina





On Mar 31, 2009, at 2:00 PM, Dejan Muhamedagic wrote:
Ciao,
On Tue, Mar 31, 2009 at 01:48:47PM +0200, Cristina Bulfonwrote:
Ciao,
in our heartbeat cluster we have simulated the breakingof the
HBA
by
unplugging the fiber from HBA on the primary node. Theresource
didn't
switch to the secondary node and on the log file onprimary
node
reported
the following messages:
Feb 19 14:33:33 afsitfs3 kernel: qla2xxx 0000:0a:01.0:LOOP DOWN
detected
(2 e678 16ed).
Feb 19 14:33:38 afsitfs3 kernel: qla2xxx 0000:0a:01.1:LOOP DOWN
detected
(2 8633 16fc).
Feb 19 14:33:46 afsitfs3 kernel: qla2x00: FAILOVERdevice 2 from
200500a0b832d169 -> 200400a0b832d16a - LUN 10, reason=0x2
Feb 19 14:33:46 afsitfs3 kernel: qla2x00: FROM HBA 0 toHBA 1Feb 19 14:33:52 afsitfs3 kernel: qla2x00: FAILOVERdevice 2 from
200400a0b832d16a -> 200500a0b832d16a - LUN 10, reason=0x2
Feb 19 14:33:52 afsitfs3 kernel: qla2x00: FROM HBA 1 toHBA 1Feb 19 14:33:55 afsitfs3 kernel: qla2x00: FAILOVERdevice 2 from
200500a0b832d16a -> 200400a0b832d169 - LUN 10, reason=0x2
Feb 19 14:33:55 afsitfs3 kernel: qla2x00: FROM HBA 1 toHBA 0Feb 19 14:33:58 afsitfs3 kernel: qla2x00: FAILOVERdevice 2 from
200400a0b832d169 -> 200500a0b832d169 - LUN 10, reason=0x2
Feb 19 14:33:58 afsitfs3 kernel: qla2x00: FROM HBA 0 toHBA 0Feb 19 14:34:01 afsitfs3 kernel: qla2x00: FAILOVERdevice 2 from
200500a0b832d169 -> 200400a0b832d16a - LUN 10, reason=0x2

In some way I expected this kind of messages but  I do not
understand
why
the secondary node doesn't take the control of theresources.
In the ha.cf there is not nothing related to HBA and the
haresources
file
is

afsitfs3.roma1.infn.it  IPaddr2::Y.Y.Y.Y/24/eth0:0
afsitfs3.roma1.infn.it  drbddisk::r0
Filesystem::/dev/drbd1::/vicepa::xfs
afsitfs3.roma1.infn.it  drbddisk::r1
Filesystem::/dev/drbd2::/usr/afs::ext3
afsitfs3.roma1.infn.it         Y.Y.Y.Y   afs
There's no resource monitoring with v1. For that you haveto go
with v2/Pacemaker (aka CRM).
Also tried to use hbaping compiling the hbaapi_src_2.2 but
without
success
.. got problem during the compilations and I didn'tunderstand
if
I
have
to
use libHBAAPI.so  from hbaapi or from HBA vendor.
That could work with ipfail, perhaps.

Thanks,

Dejan
Our FC controller is
            Logic PCI to Fibre Channel Host Adapter for
QLA2342:
Firmware version 3.03.25 IPX, Driver version8.02.14.01-fo
Thanks in advance

cristina



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] how to check HBA with heartbeat

Reply via email to