From: Uri Habusha [mailto:[email protected]] Sent: Wednesday, March 16, 2011 5:10 AM To: Smith, Stan; [email protected]; Gilad Margalit Cc: Tziporet Koren Subject: OpenSm issues
Hi Stan, In last period we returned to run the regression in debug mode. Each night we encounter many issues with OpenSm. See below 3 different issues that are related to OpenSm. I wonder who is responsible to OpenSm? Which testing is done? Please advise how to progress with these failures investigation and fix? Hello, True I was the likely the last person to touch OpenSM, although at this time I do not have any cycles to address winOFED issues. Unfortunately you are on your own debug path. Perhaps discussions with the new OFED for Linux OpenSM maintainer Alex Netes [[email protected]] might shed some light on the failures? Tzachi and Leo maintained OpenSM long before I became involved. As always, a stack trace back without any operational/environmental context is difficult at best to make any sense of. W.r.t. OpenSM testing: 1) all osmtest flavors passed 2) a single OpenSM (multiple Mellanox switches) configuring a 53 node HPC cluster. 3) Multiple windows OpenSMs tested for master/slave and failover operation. 4) Multiple Windows and Linux OpenSMs tested for master/slave and failover operation. Microsoft HPC validation has used the current OpenSM on larger HPC clusters? Stan. Thanks Uri 0: kd> kb RetAddr : Args to Child : Call Site 00000000`ff3f2c36 : 00000000`000a6f00 00000000`00000000 00000000`00000000 00000000`ff368e60 : ntdll!DbgBreakPoint 00000000`ff3ecfbc : 00000000`00602ba0 00000000`006fdde0 00000000`00000001 00000000`74da554c : opensm!osm_vendor_send+0x106 [s:\builds\7523\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 1057] 00000000`ff3ed26f : 00000000`000cf7a0 00000000`006fdde0 00000000`00000001 00000000`ff367eb8 : opensm!vl15_send_mad+0x8c [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 81] 00000000`74db2d3a : 00000000`000cf7a0 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!vl15_poller+0x16f [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 151] 00000000`76c2be3d : 00000000`000cf7b8 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49] 00000000`76d66611 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d 3: kd> kb RetAddr : Args to Child : Call Site 00000000`74fd3c88 : 00000000`0016f748 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!DbgBreakPoint 00000000`ff91d1ae : 00000000`0016f748 00000000`001afe10 00000000`00000001 00000000`ff897eb8 : complibd!cl_qlist_remove_head+0x98 [s:\builds\7523\trunk\inc\complib\cl_qlist.h @ 1220] 00000000`74fe2d3a : 00000000`0016f700 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!vl15_poller+0xae [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 138] 00000000`76e6466d : 00000000`0016f718 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49] 00000000`76f98791 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d 3: kd> kb RetAddr : Args to Child : Call Site 00000000`779e7396 : 00000000`00000002 00000001`00000023 00000000`005bd360 00000000`00000003 : ntdll!RtlReportCriticalFailure+0x2f 00000000`779e86c2 : 00000000`00000000 000601d8`02b138dc 00000000`00000000 00000000`00000000 : ntdll!RtlpReportHeapFailure+0x26 00000000`779ea0c4 : 00000000`005b0000 00000000`00000000 00000000`005bd200 00000000`005bd360 : ntdll!RtlpHeapHandleError+0x12 00000000`7797ea00 : 00000000`005b0000 00000000`001b2d30 00000000`005bd270 00000000`0000029c : ntdll!RtlpLogHeapFailure+0xa4 00000000`779729ac : 00000000`005b0000 00000001`00000002 00000000`000000e0 00000000`000000f0 : ntdll!RtlpAllocateHeap+0x2105 000007fe`ffad1332 : 00000000`00000003 00000000`000000e0 00000000`2821b917 00000000`00000000 : ntdll!RtlAllocateHeap+0x16c 00000000`ff6514cc : 00000000`00000000 00000000`00000000 00000000`005bd370 00000000`000000b0 : msvcrt!malloc+0x70 00000000`ff6ad144 : 00000000`000ff630 00000000`001b2990 00000000`00000100 00000000`00d4f900 : opensm!osm_mad_pool_get+0x7c [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_mad_pool.c @ 86] 000007fe`f9542a1a : 00000000`001b2940 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!umad_receiver+0x3b4 [s:\builds\7523\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 314] 00000000`7771f56d : 00000000`001b2940 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49] 00000000`77953281 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d Uri Habusha Windows SW Development Lead Mellanox Technologies P.OBox 586, Yokneam 20692 Israel
_______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
