On Tue, 25 Nov 2008 10:00:22 -0500 "Hal Rosenstock" <[EMAIL PROTECTED]> wrote:
> Hi Rob, > > On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <[EMAIL PROTECTED]> wrote: > > Hi Hal, > > > > Thank you for your help. > > > > Ibstat on MachineB: > > CA 'mthca0' > > CA type: MT25204 > > Number of ports: 1 > > Firmware version: 1.2.0 > > Hardware version: a0 > > Node GUID: 0x0002c9020022d428 > > System image GUID: 0x0002c9020022d42b > > Port 1: > > State: Down > > Is machine A on ? Is mthca loaded there ? If so, this should at least > be init but the driver errors below may preclude this from occurring. > > > Physical state: Polling > > Rate: 10 > > Base lid: 0 > > LMC: 0 > > SM lid: 0 > > Capability mask: 0x02510a6a > > Port GUID: 0x0002c9020022d429 > > > > Machine A is operating normally with the exception of Infiniband which > > broke after powering down Machine B and did not recover once Machine B > > was powered on again. An extract from the log of Machine A: > > Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > > (-11) > > Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed > > (-11) > > Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > > (-11) > > Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed > > (-11) > > Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > > (-11) > > Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed > > (-11) > > Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > > (-11) > > Nov 25 14:32:01 mrhappy last message repeated 3 times > > Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed > > (-11) > > -11 is EAGAIN. Not sure what this is used for in the mthca driver. When we have seen these errors, it has meant the firmware is in a bad state and is not responsive. Unfortunately for you, in this situation we have been forced to reboot to correct the problem. (If rebooting is problematic for you perhaps Mellanox has a way around this.) For the future speak with Mellanox to ensure you have the latest firmware as that has fixed a number of items for us. Ira > > Can you unload and reload the IB stack especially mthca driver ? > > -- Hal > > > Thanks again, > > > > Rob > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: 25 November 2008 14:49 > > To: Robert Dunkley > > Cc: Baur, Eric; [email protected] > > Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource > > Temporarily unavailable" > > > > On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <[EMAIL PROTECTED]> > > wrote: > >> Hi Eric, > >> > >> Thanks for the response. OpenSM is running and set to start on bootup > > on > >> MachineB: > >> ps aux | grep open > >> root 5616 0.0 0.1 142004 1396 ? Sl 13:39 0:00 > >> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0 > >> > >> The log on Machine B just logs this every 10 seconds: > >> Nov 25 14:34:21 148541 [477A7940] 0x01 -> > >> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal > >> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING > >> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down > >> > >> Ibstat confirms port is in polling state on MachineB. > > > > Is the port in init or down ? > > > >> MachineA however is in a bad state, > > > > Any additional details on this ? > > > > Can you kill/unload all the ib stuff and reload it ? That would be > > gentler than rebooting. > > > > -- Hal > > > >>I tried the openibd restart command, it accepted the > >> command but after 5 minutes shows no progress of doing anything and is > >> just at the cursor. Is some sort of forced restart of openibd > > possible? > >> > >> Thanks, > >> > >> Rob > >> > >> > >> -----Original Message----- > >> From: Baur, Eric [mailto:[EMAIL PROTECTED] > >> Sent: 25 November 2008 14:31 > >> To: Robert Dunkley > >> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource > >> Temporarily unavailable" > >> > >> Robert- > >> > >> Is OpenSM set to start on boot? > >> chkconfig --list | grep opensmd > >> > >> If not: chkconfig opensmd on > >> and: /etc/init.d/opensmd start > >> > >> You can also restart openib without rebooting the machines. > >> /etc/init.d/openibd restart > >> > >> -Eric > >> > >> -----Original Message----- > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED] On Behalf Of Robert > >> Dunkley > >> Sent: Tuesday, November 25, 2008 9:21 AM > >> To: [email protected] > >> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource > >> Temporarily unavailable" > >> > >> Hi everyone, > >> > >> I'm using a setup of two machines (Lets call them A and B) directly > >> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 > > Mellanox > >> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED > > 1.3 > >> installed, Machine B runs OpenSM. > >> > >> All was working fine. I shutdown Machine A did some maintenance and > > then > >> powered it on again, everything is OK again. I then shutdown Machine B > >> (The one running OpenSM), this seemed to really upset Machine A. After > >> booting Machine B again, Machine B looks OK with the port down and in > >> polling state. Machine A however gives the following error if I run > >> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed: > >> (Resource temporarily unavailable) > >> > >> I don't want to reboot Machine A as it must synch data with Machine B > >> over the Infiniband link first. Does anyone have any idea how to fix > >> machine A? > >> > >> Thanks, > >> > >> Rob > >> > >> The SAQ Group > >> > >> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ > >> SEMTEC Limited Trading as SAQ is Registered in England & Wales > >> Company Number: 06481952 > >> > >> > >> > >> http:// www. saqnet.co.uk AS29219 > >> > >> SAQ Group Delivers high quality, honestly priced communication and > > I.T. > >> services to UK Business. > >> > >> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : > >> Backups : Managed Networks : Remote Support. > >> > >> Find us in http:// www. thebestof.co.uk/petersfield > >> > >> _______________________________________________ > >> general mailing list > >> [email protected] > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http:// openib.org/mailman/listinfo/openib-general > >> _______________________________________________ > >> general mailing list > >> [email protected] > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > > http:// openib.org/mailman/listinfo/openib-general > >> > > > _______________________________________________ > general mailing list > [email protected] > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// > openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
