On Fri, 2005-05-27 at 17:15 -0400, Hal Rosenstock wrote: > On Fri, 2005-05-27 at 14:31, Tom Duffy wrote: > > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote: > > > I just noticed that my opensm had segv'ed and dumped core. > > > > BTW, here was the tail of the osm.log: > > > > May 27 01:44:09 [43005960] -> osm_vendor_get: [ > > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = > > 0x5678f0 (mad 0x5f33f0 req 1) > > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = > > 0x567908, size = 256. > > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size > > = 256. > > May 27 01:44:09 [43005960] -> osm_vendor_get: ] > > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, > > p_mad = 0x5f3670, size = 256. > > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ] > > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), > > modifier = 0x10001, TID = 0x1c149. > > May 27 01:44:09 [43005960] -> osm_vl15_post: [ > > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 > > (mad 0x5f3670 req 1) > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 > > MADs outstanding. > ^^^^^^^^^^ > This looks weird. > > > May 27 01:44:09 [43005960] -> osm_vl15_poll: [ > > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread. > > May 27 01:44:09 [43005960] -> osm_vl15_poll: ] > > May 27 01:44:09 [43005960] -> osm_vl15_post: ] > > May 27 01:44:09 [43005960] -> osm_req_get: ] > > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ] > > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ] > > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ] > > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [ > > Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack > shown in the previous email as this makes it look like it should be. > > Could you go back a little further in the log ? I'd like to see what is > before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and > osm_pi_rcv_process.
The log had grown to almost 1G, so I actually deleted it. Shit, sorry.
> It's also seems weird to me that there is no other
> log message between these two.
>
> >From the stack trace:
> #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
> at osm_helper.c:1446
> #4 0x000000000042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
> osm_madw.h:575
>
> It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller
>
> if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
> {
> if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
> {
> osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> "__osm_vl15_poller: "
> "Servicing p_madw = %p (mad %p req %d)\n",
> p_madw, p_madw->p_mad, p_madw->resp_expected);
> }
>
> if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
> {
> osm_dump_dr_smp( p_vl->p_log,
> osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES );
> <=== here
> }
>
> when it died but I didn't see the previous log message in the code
> "osm_vl15_poller: Servicing p_madw" which I also would have expected.
> [This would have been telling as p_madw->p_mad would have been logged].
> I also didn't see the __osm_vl15_poller entry message either.
well, if it segv'ed maybe it never finished writing out to the file...
-tduffy
signature.asc
Description: This is a digitally signed message part
_______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
