Hi Rob, On Tue, Nov 25, 2008 at 10:21 AM, Robert Dunkley <[EMAIL PROTECTED]> wrote: > Hi Hal, > > Thanks again, I will try this in a minute. I think I have found the > moment it went bad on Machine A using Dmesg: > ib_mthca 0000:87:00.0: Catastrophic error detected: unknown error
Definitely need to reset mthca after this. > ib_mthca 0000:87:00.0: buf[00]: ffffffff > ib_mthca 0000:87:00.0: buf[01]: ffffffff > ib_mthca 0000:87:00.0: buf[02]: ffffffff > ib_mthca 0000:87:00.0: buf[03]: ffffffff > ib_mthca 0000:87:00.0: buf[04]: ffffffff > ib_mthca 0000:87:00.0: buf[05]: ffffffff > ib_mthca 0000:87:00.0: buf[06]: ffffffff > ib_mthca 0000:87:00.0: buf[07]: ffffffff > ib_mthca 0000:87:00.0: buf[08]: ffffffff > ib_mthca 0000:87:00.0: buf[09]: ffffffff > ib_mthca 0000:87:00.0: buf[0a]: ffffffff > ib_mthca 0000:87:00.0: buf[0b]: ffffffff > ib_mthca 0000:87:00.0: buf[0c]: ffffffff > ib_mthca 0000:87:00.0: buf[0d]: ffffffff > ib_mthca 0000:87:00.0: buf[0e]: ffffffff > ib_mthca 0000:87:00.0: buf[0f]: ffffffff > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib0: ib_query_gid() failed > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib0: ib_query_port failed > ib0: Failed to modify QP to ERROR state > ib0: timing out; 1 sends 250 receives not completed > ib0: Failed to modify QP to RESET state > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_SRQ failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11) > > Does this help to pinpoint what might have caused this? Maybe Mellanox can comment. What firmware version are you using ? -- Hal > > Thanks, > > Rob > > > -----Original Message----- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: 25 November 2008 15:19 > To: Robert Dunkley > Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource > Temporarily unavailable" > > Hi Rob, > > On Tue, Nov 25, 2008 at 10:01 AM, Robert Dunkley <[EMAIL PROTECTED]> > wrote: >> Hi Hal, >> >> Machine A is definitely on and I have had the cable connection > checked. >> I'm afraid I'm not much of a techy, how do I unload and reload the IB >> stack? > > It depends on what you have running... Is it just OpenSM and IPoIB ? > > Kill off opensm > > Use modprobe -r to remove all the ib_ modules. You can find them via > lsmod | grep ib_. There is a dependency order. > > If you can get them all unloaded, reload them in the reverse order and > hopefully things will be better... > > -- Hal > >> Thanks, >> >> Rob >> >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:[EMAIL PROTECTED] >> Sent: 25 November 2008 15:00 >> To: Robert Dunkley >> Cc: Baur, Eric; [email protected] >> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - > "Resource >> Temporarily unavailable" >> >> Hi Rob, >> >> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <[EMAIL PROTECTED]> >> wrote: >>> Hi Hal, >>> >>> Thank you for your help. >>> >>> Ibstat on MachineB: >>> CA 'mthca0' >>> CA type: MT25204 >>> Number of ports: 1 >>> Firmware version: 1.2.0 >>> Hardware version: a0 >>> Node GUID: 0x0002c9020022d428 >>> System image GUID: 0x0002c9020022d42b >>> Port 1: >>> State: Down >> >> Is machine A on ? Is mthca loaded there ? If so, this should at least >> be init but the driver errors below may preclude this from occurring. >> >>> Physical state: Polling >>> Rate: 10 >>> Base lid: 0 >>> LMC: 0 >>> SM lid: 0 >>> Capability mask: 0x02510a6a >>> Port GUID: 0x0002c9020022d429 >>> >>> Machine A is operating normally with the exception of Infiniband > which >>> broke after powering down Machine B and did not recover once Machine > B >>> was powered on again. An extract from the log of Machine A: >>> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT >> failed >>> (-11) >>> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ > failed >>> (-11) >>> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT >> failed >>> (-11) >>> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ > failed >>> (-11) >>> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT >> failed >>> (-11) >>> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ >> failed >>> (-11) >>> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT >> failed >>> (-11) >>> Nov 25 14:32:01 mrhappy last message repeated 3 times >>> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT >> failed >>> (-11) >> >> -11 is EAGAIN. Not sure what this is used for in the mthca driver. >> >> Can you unload and reload the IB stack especially mthca driver ? >> >> -- Hal >> >>> Thanks again, >>> >>> Rob >>> >>> -----Original Message----- >>> From: Hal Rosenstock [mailto:[EMAIL PROTECTED] >>> Sent: 25 November 2008 14:49 >>> To: Robert Dunkley >>> Cc: Baur, Eric; [email protected] >>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - >> "Resource >>> Temporarily unavailable" >>> >>> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <[EMAIL PROTECTED]> >>> wrote: >>>> Hi Eric, >>>> >>>> Thanks for the response. OpenSM is running and set to start on > bootup >>> on >>>> MachineB: >>>> ps aux | grep open >>>> root 5616 0.0 0.1 142004 1396 ? Sl 13:39 0:00 >>>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0 >>>> >>>> The log on Machine B just logs this every 10 seconds: >>>> Nov 25 14:34:21 148541 [477A7940] 0x01 -> >>>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal >>>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING >>>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down >>>> >>>> Ibstat confirms port is in polling state on MachineB. >>> >>> Is the port in init or down ? >>> >>>> MachineA however is in a bad state, >>> >>> Any additional details on this ? >>> >>> Can you kill/unload all the ib stuff and reload it ? That would be >>> gentler than rebooting. >>> >>> -- Hal >>> >>>>I tried the openibd restart command, it accepted the >>>> command but after 5 minutes shows no progress of doing anything and >> is >>>> just at the cursor. Is some sort of forced restart of openibd >>> possible? >>>> >>>> Thanks, >>>> >>>> Rob >>>> >>>> >>>> -----Original Message----- >>>> From: Baur, Eric [mailto:[EMAIL PROTECTED] >>>> Sent: 25 November 2008 14:31 >>>> To: Robert Dunkley >>>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - >> "Resource >>>> Temporarily unavailable" >>>> >>>> Robert- >>>> >>>> Is OpenSM set to start on boot? >>>> chkconfig --list | grep opensmd >>>> >>>> If not: chkconfig opensmd on >>>> and: /etc/init.d/opensmd start >>>> >>>> You can also restart openib without rebooting the machines. >>>> /etc/init.d/openibd restart >>>> >>>> -Eric >>>> >>>> -----Original Message----- >>>> From: [EMAIL PROTECTED] >>>> [mailto:[EMAIL PROTECTED] On Behalf Of Robert >>>> Dunkley >>>> Sent: Tuesday, November 25, 2008 9:21 AM >>>> To: [email protected] >>>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource >>>> Temporarily unavailable" >>>> >>>> Hi everyone, >>>> >>>> I'm using a setup of two machines (Lets call them A and B) directly >>>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 >>> Mellanox >>>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED >>> 1.3 >>>> installed, Machine B runs OpenSM. >>>> >>>> All was working fine. I shutdown Machine A did some maintenance and >>> then >>>> powered it on again, everything is OK again. I then shutdown Machine >> B >>>> (The one running OpenSM), this seemed to really upset Machine A. >> After >>>> booting Machine B again, Machine B looks OK with the port down and > in >>>> polling state. Machine A however gives the following error if I run >>>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed: >>>> (Resource temporarily unavailable) >>>> >>>> I don't want to reboot Machine A as it must synch data with Machine > B >>>> over the Infiniband link first. Does anyone have any idea how to fix >>>> machine A? >>>> >>>> Thanks, >>>> >>>> Rob >>>> >>>> The SAQ Group >>>> >>>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ >>>> SEMTEC Limited Trading as SAQ is Registered in England & Wales >>>> Company Number: 06481952 >>>> >>>> >>>> >>>> http://www.saqnet.co.uk AS29219 >>>> >>>> SAQ Group Delivers high quality, honestly priced communication and >>> I.T. >>>> services to UK Business. >>>> >>>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : >>>> Backups : Managed Networks : Remote Support. >>>> >>>> Find us in http://www.thebestof.co.uk/petersfield >>>> >>>> _______________________________________________ >>>> general mailing list >>>> [email protected] >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> _______________________________________________ >>>> general mailing list >>>> [email protected] >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >> > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
