On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port
I saw the change to not call umad_unregister in the patch. Where is the change for umad_close_port ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 07, 2005 4:21 PM > > To: [EMAIL PROTECTED] > > Cc: [email protected]; [EMAIL PROTECTED] > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
