Hi Rob, On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <[EMAIL PROTECTED]> wrote: > Hi Hal, > > Thank you for your help. > > Ibstat on MachineB: > CA 'mthca0' > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.2.0 > Hardware version: a0 > Node GUID: 0x0002c9020022d428 > System image GUID: 0x0002c9020022d42b > Port 1: > State: Down
Is machine A on ? Is mthca loaded there ? If so, this should at least be init but the driver errors below may preclude this from occurring. > Physical state: Polling > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x02510a6a > Port GUID: 0x0002c9020022d429 > > Machine A is operating normally with the exception of Infiniband which > broke after powering down Machine B and did not recover once Machine B > was powered on again. An extract from the log of Machine A: > Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > (-11) > Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed > (-11) > Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > (-11) > Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed > (-11) > Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > (-11) > Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed > (-11) > Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > (-11) > Nov 25 14:32:01 mrhappy last message repeated 3 times > Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > (-11) -11 is EAGAIN. Not sure what this is used for in the mthca driver. Can you unload and reload the IB stack especially mthca driver ? -- Hal > Thanks again, > > Rob > > -----Original Message----- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: 25 November 2008 14:49 > To: Robert Dunkley > Cc: Baur, Eric; [email protected] > Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource > Temporarily unavailable" > > On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <[EMAIL PROTECTED]> > wrote: >> Hi Eric, >> >> Thanks for the response. OpenSM is running and set to start on bootup > on >> MachineB: >> ps aux | grep open >> root 5616 0.0 0.1 142004 1396 ? Sl 13:39 0:00 >> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0 >> >> The log on Machine B just logs this every 10 seconds: >> Nov 25 14:34:21 148541 [477A7940] 0x01 -> >> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal >> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING >> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down >> >> Ibstat confirms port is in polling state on MachineB. > > Is the port in init or down ? > >> MachineA however is in a bad state, > > Any additional details on this ? > > Can you kill/unload all the ib stuff and reload it ? That would be > gentler than rebooting. > > -- Hal > >>I tried the openibd restart command, it accepted the >> command but after 5 minutes shows no progress of doing anything and is >> just at the cursor. Is some sort of forced restart of openibd > possible? >> >> Thanks, >> >> Rob >> >> >> -----Original Message----- >> From: Baur, Eric [mailto:[EMAIL PROTECTED] >> Sent: 25 November 2008 14:31 >> To: Robert Dunkley >> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource >> Temporarily unavailable" >> >> Robert- >> >> Is OpenSM set to start on boot? >> chkconfig --list | grep opensmd >> >> If not: chkconfig opensmd on >> and: /etc/init.d/opensmd start >> >> You can also restart openib without rebooting the machines. >> /etc/init.d/openibd restart >> >> -Eric >> >> -----Original Message----- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of Robert >> Dunkley >> Sent: Tuesday, November 25, 2008 9:21 AM >> To: [email protected] >> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource >> Temporarily unavailable" >> >> Hi everyone, >> >> I'm using a setup of two machines (Lets call them A and B) directly >> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 > Mellanox >> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED > 1.3 >> installed, Machine B runs OpenSM. >> >> All was working fine. I shutdown Machine A did some maintenance and > then >> powered it on again, everything is OK again. I then shutdown Machine B >> (The one running OpenSM), this seemed to really upset Machine A. After >> booting Machine B again, Machine B looks OK with the port down and in >> polling state. Machine A however gives the following error if I run >> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed: >> (Resource temporarily unavailable) >> >> I don't want to reboot Machine A as it must synch data with Machine B >> over the Infiniband link first. Does anyone have any idea how to fix >> machine A? >> >> Thanks, >> >> Rob >> >> The SAQ Group >> >> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ >> SEMTEC Limited Trading as SAQ is Registered in England & Wales >> Company Number: 06481952 >> >> >> >> http://www.saqnet.co.uk AS29219 >> >> SAQ Group Delivers high quality, honestly priced communication and > I.T. >> services to UK Business. >> >> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : >> Backups : Managed Networks : Remote Support. >> >> Find us in http://www.thebestof.co.uk/petersfield >> >> _______________________________________________ >> general mailing list >> [email protected] >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> _______________________________________________ >> general mailing list >> [email protected] >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general >> > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
