On Fri, 11 Jun 2010 11:29:24 -0700 "Hefty, Sean" <[email protected]> wrote:
> The following opensm log is from a windows port of opensm 3.3.6. Can anyone > familiar with the error messages explain whether the errors that are > reported should ever happen? Error 9 is a timeout, so I'm not concerned > about that. I am concerned about the 'failed to obtain request madw' > errors. > > - Sean > > [Jun-10-2010 15:45:37:486][0C2C] 0x03 -> OpenSM 3.3.6 UMAD > [Jun-10-2010 15:45:37:486][0C2C] 0x80 -> OpenSM 3.3.6 UMAD > [Jun-10-2010 15:45:37:501][0C2C] 0x02 -> osm_vendor_init: 1000 pending umads > specified > [Jun-10-2010 15:45:37:501][0C2C] 0x80 -> Entering DISCOVERING state > [Jun-10-2010 15:45:37:501][0C2C] 0x02 -> osm_vendor_bind: Binding to port > 0x2c9020000409d > [Jun-10-2010 15:45:37:548][0C2C] 0x02 -> osm_vendor_bind: Binding to port > 0x2c9020000409d > [Jun-10-2010 15:45:37:548][0C2C] 0x02 -> osm_opensm_bind: Setting IS_SM on > port 0x0002c9020000409d > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> umad_receiver: ERR 5411: DR SMP Send > completed with error(9) -- dropping > Method 0x1, Attr 0x20, TID 0x12a5, Hop Ptr: 0x0 > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> Received SMP on a 2 hop path: > Initial path = 0,1,10, Return path = 0,0,0 > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x12a5 > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> umad_receiver: ERR 5411: DR SMP Send > completed with error(9) -- dropping > Method 0x1, Attr 0x20, TID 0x12a8, Hop Ptr: 0x0 > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> Received SMP on a 2 hop path: > Initial path = 0,1,14, Return path = 0,0,0 > [Jun-10-2010 15:45:38:375][08E0] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x12a8 > [Jun-10-2010 15:45:38:375][0CCC] 0x80 -> Entering MASTER state > [Jun-10-2010 15:45:38:375][0CCC] 0x02 -> osm_ucast_mgr_process: minhop tables > configured on all switches > [Jun-10-2010 15:45:38:375][0A1C] 0x01 -> log_trap_info: Received Generic > Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] > Enabled, SM priority changed) Producer:1 (Channel Adapter) from LID:7 > TID:0x0000000000000010 > [Jun-10-2010 15:45:38:375][0A1C] 0x01 -> osm_get_port_by_mad_addr: ERR 7504: > Lid is out of range: 0 > [Jun-10-2010 15:45:38:375][0A1C] 0x01 -> trap_rcv_process_request: ERR 3809: > Failed to find source physical port for trap > [Jun-10-2010 15:45:38:375][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, > SM priority changed) from LID:7 GID:fe80::2:c902:0:409d > > [Jun-10-2010 15:45:38:375][0CCC] 0x80 -> SUBNET UP > > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x10001b90b, > LID 17 > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x10001b90b) > -- dropping I think this might be an issue with RMPP. SubnAdmGetTableResp uses RMPP and umad_receiver does a second umad_recv for RMPP. My guess is that this second umad_recv is not being done. The timeout non-RMPP packet causes the request madw to be "free'ed". The next time umad_recv is called the same TID is returned? This causes the request mad to not be found because it was "free'ed" by the previous umad_recv. What does a read from the umad device on Windows return when an RMPP packet is being read from the kernel? I guess there could be an incompatibility in libibumad when a RMPP packet times out. Ira > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x35, TID 0x10001b90c, > LID 17 > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x35 tid=0x10001b90c) > -- dropping > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x10000012a, > LID 1 > [Jun-10-2010 15:45:38:609][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x10000012a) > -- dropping > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x35, TID 0x100016375, > LID 4 > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x35 tid=0x100016375) > -- dropping > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x100016374, > LID 4 > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x100016374) > -- dropping > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x35, TID 0x100000d38, > LID 6 > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x35 tid=0x100000d38) > -- dropping > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x100000d37, > LID 6 > [Jun-10-2010 15:45:38:625][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x100000d37) > -- dropping > [Jun-10-2010 15:45:38:874][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:401b:ffff::ffff:ffff > [Jun-10-2010 15:45:38:874][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:401b:ffff::1 > [Jun-10-2010 15:45:38:874][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:401b:ffff::7f:fffa > [Jun-10-2010 15:45:38:874][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:0:1 > [Jun-10-2010 15:45:38:874][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:1:3 > [Jun-10-2010 15:45:38:874][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffb7:99a9 > [Jun-10-2010 15:45:38:874][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:401b:ffff::fc > [Jun-10-2010 15:45:38:874][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffb8:a9ed > [Jun-10-2010 15:45:38:874][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:0:c > [Jun-10-2010 15:45:38:874][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffe9:b65e > [Jun-10-2010 15:45:38:890][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x38, TID 0x1000003cd, > LID 7 > [Jun-10-2010 15:45:38:890][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x38 tid=0x1000003cd) > -- dropping > [Jun-10-2010 15:45:38:890][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffac:bad6 > [Jun-10-2010 15:45:38:890][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x38, TID 0x100000ddb, > LID 15 > [Jun-10-2010 15:45:38:890][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x38 tid=0x100000ddb) > -- dropping > [Jun-10-2010 15:45:38:890][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ff4a:4b54 > [Jun-10-2010 15:45:38:890][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::180:c200:3 > [Jun-10-2010 15:45:38:890][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffc0:77dc > [Jun-10-2010 15:45:38:890][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ff90:fd62 > [Jun-10-2010 15:45:38:890][0A1C] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25: > Received an invalid delete request for MGID: ff12:401b:ffff::1 for PortGID: > fe80::2:c902:0:25a5 > [Jun-10-2010 15:45:38:890][0A1C] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25: > Received an invalid delete request for MGID: ff12:1405:ffff::180:c200:3 for > PortGID: fe80::2:c902:0:25a5 > [Jun-10-2010 15:45:38:890][0EF0] 0x01 -> mcmr_rcv_leave_mgrp: ERR 1B25: > Received an invalid delete request for MGID: ff12:401b:ffff::ffff:ffff for > PortGID: fe80::2:c902:0:25a5 > [Jun-10-2010 15:45:38:890][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:401b:ffff::16 > [Jun-10-2010 15:45:38:890][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:0:16 > [Jun-10-2010 15:45:38:890][0D5C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ffeb:5867 > [Jun-10-2010 15:45:38:890][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:ff70:431e > [Jun-10-2010 15:45:38:937][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:1:2 > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> umad_receiver: ERR 5411: DR SMP Send > completed with error(9) -- dropping > Method 0x1, Attr 0x20, TID 0x130a, Hop Ptr: 0x0 > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> Received SMP on a 2 hop path: > Initial path = 0,1,14, Return path = 0,0,0 > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x130a > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> umad_receiver: ERR 5411: DR SMP Send > completed with error(9) -- dropping > Method 0x1, Attr 0x20, TID 0x130b, Hop Ptr: 0x0 > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> Received SMP on a 2 hop path: > Initial path = 0,1,10, Return path = 0,0,0 > [Jun-10-2010 15:45:39:202][08E0] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x130b > [Jun-10-2010 15:45:39:202][0CCC] 0x02 -> osm_ucast_mgr_process: minhop tables > configured on all switches > [Jun-10-2010 15:45:39:217][0CCC] 0x02 -> SUBNET UP > [Jun-10-2010 15:45:39:342][0630] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:1405:ffff::3333:0:2 > [Jun-10-2010 15:45:40:372][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:401b:ffff::fb > [Jun-10-2010 15:45:45:364][0EF0] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 GID:ff12:601b:ffff::fb > [Jun-10-2010 15:45:45:379][0A1C] 0x02 -> log_notice: Reporting Generic Notice > type:3 num:66 (New mcast group created) from LID:7 > GID:ff12:601b:ffff::1:ff00:525d > [Jun-10-2010 15:45:49:482][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x400000002, > LID 3 > [Jun-10-2010 15:45:49:482][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x400000002) > -- dropping > [Jun-10-2010 15:45:50:481][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x400000002, > LID 3 > [Jun-10-2010 15:45:50:481][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x400000002) > -- dropping > [Jun-10-2010 15:45:51:479][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x400000002, > LID 3 > [Jun-10-2010 15:45:51:479][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x400000002) > -- dropping > [Jun-10-2010 15:45:52:477][08E0] 0x01 -> umad_receiver: ERR 5410: Send > completed with error(9) -- dropping > Class 0x3, Method 0x92, Attr 0x11, TID 0x400000002, > LID 3 > [Jun-10-2010 15:45:52:477][08E0] 0x01 -> umad_receiver: ERR 5412: Failed to > obtain request madw for timed out MAD (method=0x92 attr=0x11 tid=0x400000002) > -- dropping > [Jun-10-2010 15:45:59:856][0C2C] 0x80 -> Exiting SM > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to [email protected] > More majordomo info at http://*vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
