On Tue, 2005-11-08 at 07:02, Yael Kalka wrote: > Hi Hal, > > The filesystem is not full, since I am using opensm with -e and with no > verbosity. > > swlab53:~ # df -k /var/log/ > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda3 8262068 4705692 3136680 61% /
How large is the osm.log file (ls -lasg) when this occurs ? -- Hal > > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 08, 2005 1:53 PM > To: [EMAIL PROTECTED] > Cc: [email protected]; [EMAIL PROTECTED] > Subject: RE: [PATCH] Opensm - exiting issues > > > Hi Yael, > > On Tue, 2005-11-08 at 05:12, Yael Kalka wrote: > > Hi Hal, > > > > It seems that there is still another race somewhere. > > The situation is much better. I had to run the testing for > > ~45 minutes in order to see the problem. > > Is your filesystem full ? What is the file size of the log when you hit > this ? Is this a max file size issue ? > > -- Hal > > > I ran on a loopback machine the following: > > a) from port #2 > > % while test $? = 0; do opensm -o -e; done > > b) from port #1 > > % while test 1 = 1; do osmtest -f f; done > > > > The process is hang. When getting the process with ps -efww I get: > > root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] <defunct> > > root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g > > 0x2c902000017a2 > > > > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp > > > > lsmod reports the following: > > Module Size Used by > > subfs 12416 1 > > nvram 13576 0 > > usbserial 34024 0 > > autofs4 23556 2 > > speedstep_lib 8324 0 > > freq_table 8832 0 > > thermal 18184 0 > > processor 28648 1 thermal > > ipv6 273920 20 > > fan 8836 0 > > button 11024 0 > > battery 14084 0 > > ac 9220 0 > > edd 14560 0 > > evdev 12928 0 > > joydev 13888 0 > > st 43676 0 > > sr_mod 21284 0 > > ib_ipoib 44804 0 > > ib_sa 16652 1 ib_ipoib > > ib_uverbs 37416 0 > > ib_umad 19376 2 > > af_packet 26760 4 > > sg 42912 0 > > ib_mthca 119452 0 > > ib_mad 41620 3 ib_sa,ib_umad,ib_mthca > > ib_core 48000 6 > > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > > e1000 91316 0 > > e100 43392 0 > > mii 9088 1 e100 > > i2c_i801 12556 0 > > i2c_core 26624 1 i2c_i801 > > uhci_hcd 37008 0 > > usbcore 121688 3 usbserial,uhci_hcd > > parport_pc 44356 0 > > lp 15396 0 > > parport 40392 2 parport_pc,lp > > video1394 22860 0 > > ohci1394 37508 1 video1394 > > raw1394 34540 0 > > ieee1394 108472 3 video1394,ohci1394,raw1394 > > capability 7224 0 > > nls_iso8859_1 8064 1 > > nls_cp437 9728 1 > > vfat 17792 1 > > fat 43804 1 vfat > > dm_mod 64768 0 > > ext3 145032 2 > > jbd 73764 1 ext3 > > ide_cd 44036 0 > > cdrom 42784 2 sr_mod,ide_cd > > ide_disk 22400 0 > > aic7xxx 200632 4 > > piix 14468 0 [permanent] > > ide_core 131904 3 ide_cd,ide_disk,piix > > sd_mod 23168 5 > > scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod > > > > Thanks, > > Yael > > > > > > > > -----Original Message----- > > From: Yael Kalka > > Sent: Tuesday, November 08, 2005 8:38 AM > > To: 'Hal Rosenstock'; Eitan Zahavi > > Cc: Yael Kalka; [email protected] > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > Hi Hal, > > > > Just another comment - when running: > > % while test $? = 0; do opensm -V -o; done > > Try to run from a different port: > > % osmtest -f f > > This causes fludding of mads to the opensm, and that usually is > > the cause for the exiting problem. > > > > Yael > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 07, 2005 10:07 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; [email protected] > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > I will answer for Yael as she already left the office. > > > > > > The way to reproduce the "stuck" case is to run in bash: > > > % while test $? = 0; do opensm -V -o; done > > > > > > The symptom we see is that OpenSM sort of exists but the process stay > > > active (not even defunct). No way to kill it. It seems like one of the > > > threads gets caught in the middle of ioctl or something. To be able to > > > run OpenSM after this we need to reboot the machine. > > > > > > We avoid it by not issuing umad_unregister and umad_close_port > > > > This part of the patch is not needed with the fix to user_mad put in by > > Roland based on the issue (and patch) from Michael on user_mad deadlock. > > > > I've been running your test from over 30 minutes now without a hiccup. > > It used to fail pretty quickly. > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > > > Sent: Monday, November 07, 2005 4:21 PM > > > > To: [EMAIL PROTECTED] > > > > Cc: [email protected]; [EMAIL PROTECTED] > > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > > > Hi Yael, > > > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > > Hi Hal, > > > > > > > > > > There was a problem when running opensm with -o option, that > > caused > > > > > the opensm to always exit with segfault, due to object destruction > > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > > done some clearing to the exiting code. The following patch fixes > > > most > > > > > of it. > > > > > > > > I applied this part of the patch with some cosmetic changes in > > > > osm_vendor_ibumad.c. > > > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > > exit, > > > > > and causes the machine to get stuck too - resulting in need for > > > > > rebooting. In the following patch fixes most of it. > > > > > We did run (in the patch) into rare cases where opensm exits with > > an > > > > > error, but at least it exits without stucking the machine... > > > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly > > do > > > > you mean by this ? > > > > > > > > All umad_unregister does is some validation, a table lookup, and > > issue > > > > the ioctl to unregister the MAD agent. Not explictly unregistering > > the > > > > agent(s) does not cause any harm as when the fd is closed, this will > > > > occur as part of the cleanup. > > > > > > > > -- Hal > > > > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
