Hi, On Wed, Sep 17, 2008 at 4:25 AM, Wen Hao Wang <[EMAIL PROTECTED]> wrote: > Hi all: > > I had one IB cluster with eight IBM HS21 blades, mixed with RHEL5.2 Server > and SLES10 SP2. All of them connected to one IB switch. opensm was running > as subnet manager on one blade. Command ibcheckerrors finished smoothly. > Last week I got another eight IBM LS21 blades connected to another IB > switch. But after I connected two switches and turned on all the IB adapters > on new blades, ibcheckerrors gave error message: > > [EMAIL PROTECTED] ~]# ibcheckerrors > #warn: counter RcvErrors = 5691 (threshold 10) lid 3 port 1 > Error check on lid 3 (gaia-07 HCA-1) port 1: FAILED > > ## Summary: 19 nodes checked, 0 bad nodes found > ## 46 ports checked, 1 ports have errors beyond threshold > [EMAIL PROTECTED] ~]# ibv_devinfo > hca_id: mlx4_0 > fw_ver: 2.3.000 > node_guid: 0002:c903:0001:3370 > sys_image_guid: 0002:c903:0001:3373 > vendor_id: 0x02c9 > vendor_part_id: 25418 > hw_ver: 0xA0 > board_id: IBM08A0000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 15 > port_lid: 3 > port_lmc: 0x00 > > port: 2 > state: PORT_DOWN (1) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > [EMAIL PROTECTED] ~]# ibcheckport 3 1 > [EMAIL PROTECTED] ~]# echo $? > 0 > > I had closed the embeded subnet manager on two IB switches. The issue always > exist, even after I change subnet manager location to another machine. ib0 > of machine gaia-07 can communicate with other machines each other. All > installed IB adapters are ConnectX 4xSDR. Both switches are Topspin > Switches. Will anyone give some advice about this issue? Thanks in advance!
counter RcvErrors = 5691 is indicating the value of PortCounters:RcvErrors. Per IBA section 16.1.3.5, it includes: • Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine) • Malformed data packet errors (LVer, length, VL) • Malformed link packet errors (operand, length, VL) • Packets discarded due to buffer overrun Those errors may have occurred when you plugged in the additional nodes. You might want to clear the errors first and then see if they are continually increasing or stable. -- Hal > > Wen Hao Wang > Email: [EMAIL PROTECTED] > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
