Re: Is it just me or has -current suddenly got massively unstable?
On Tue, 23 Jul 2002, Peter Wemm wrote: > > thread_zone = uma_zcreate("THREAD", sizeof (struct thread), > thread_ctor, thread_dtor, thread_init, thread_fini, > - UMA_ALIGN_CACHE, 0); > + UMA_ALIGN_CACHE, UMA_ZONE_NOFREE); > } > > /* > > I haven't paniced yet with that change. :-) For some unknown reason, > selwakeup() is dereferencing pointers to threads that have long gone and > the backing store has been freed. The patch above is a bandaid, not a > solution. It basically prevents threads ever being freed back to the > general pool, even though everything here supposedly does not need that. > (unlike struct proc and socket, for example). Peter.. this comment in selrecord scared the heck out of me.. --- /* 1151 * If the thread is NULL then take ownership of selinfo 1152 * however if the thread is not NULL and the thread points to 1153 * someone else, then we have a collision, otherwise leave it alone 1154 * as we've owned it in a previous selrecord on this selinfo. 1155 */ --- it suggests that select still doesn't clean up after itself. looking in select() however I see: 836 if (timo > 0) 837 error = cv_timedwait_sig(&selwait, &sellock, timo); 838 else 839 error = cv_wait_sig(&selwait, &sellock); 840 841 if (error == 0) 842 goto retry; 843 844 done: 845 clear_selinfo_list(td); This suggests that there is no way to exit this function without clearing the thread pointers but your trace suggests otherwise.. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
Sheldon Hearn wrote: > On (2002/07/23 12:08), Yann Berthier wrote: > >Thanks a lot, patch applied, and all is going fine. Peter: I knew you > >would come up with a solution :) > >(well, feel free to call it bandaid, but it solves the problem BTW) > > To quote Terry Lambert on what he calls Occam's Corollary: > > Anything that works is better than anything that doesn't. > > :-) Be really, really careful here. The reason it works is because it changes the memory to be type stable, so it gets the previous values, if the structure has not been reused, and signals a selwakeup() where there is no one waiting. If the structure *has* been reused, then it issues a selwakeup() to a potentially unrelated thread. In most cases, this is a harmless event, that's not even being checked for; in other cases, it's being checked for, and it looks like a bogus return. Most code that sits in a select loop will only trigger if a bit is set. However, it's a perfectly valid thing to think that you won't get spurios returns -- and write code that *depends* on not getting spurious returns. Since I've only been following this vs. -current by reading, rather than running, source code, and reading, rather than applying patches, this is just my initial reaction to the patch. So take the following with a grain of salt... On the other hand: there is a *real* problem here; again, from just reading the code, it looks like a pretty deep one having to do with events being things which happen *on* descriptors, rather than *to* processes (or threads). I expect that the problem is that a thread has been terminated, and it is the thread which opened a socket, and then did the listen on it, but isn't around to do the accept, or receive the connection event. It's a deep problem because descriptors belong to processes, not threads, and events belong to the decriptors, not to the callers; before KSE's, it was OK to treat it as a commutitive property. I rather expect that there is a similar panic that will show up during stress testing, which will occur at NETISR on incoming connections, in the bottom half of the "accept" code, which has a similar looking selwakeup() call. Probably, the only way to fix this is to make it a process event rather than a thread event, which would avoid the list removal and subsequent dereference. Kind of an ugly kludge. 8-(. It would not surprise me if the kevent() resulting from signals is near the heart of the signal problem, as well, and has a parallel basis. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
On (2002/07/23 12:08), Yann Berthier wrote: >Thanks a lot, patch applied, and all is going fine. Peter: I knew you >would come up with a solution :) >(well, feel free to call it bandaid, but it solves the problem BTW) To quote Terry Lambert on what he calls Occam's Corollary: Anything that works is better than anything that doesn't. :-) Ciao, Sheldon. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
On Tue, 23 Jul 2002, Peter Wemm wrote: [snip] > Thanks for the independent confirmation. Here's a workaround patch > that you might like to try: > > --- kern_thread.c 17 Jul 2002 23:43:55 - 1.8 > +++ kern_thread.c 22 Jul 2002 23:31:06 - > @@ -198,7 +198,7 @@ > > thread_zone = uma_zcreate("THREAD", sizeof (struct thread), > thread_ctor, thread_dtor, thread_init, thread_fini, > - UMA_ALIGN_CACHE, 0); > + UMA_ALIGN_CACHE, UMA_ZONE_NOFREE); > } > > /* > > I haven't paniced yet with that change. :-) For some unknown reason, > selwakeup() is dereferencing pointers to threads that have long gone and > the backing store has been freed. The patch above is a bandaid, not a > solution. It basically prevents threads ever being freed back to the > general pool, even though everything here supposedly does not need that. > (unlike struct proc and socket, for example). Thanks a lot, patch applied, and all is going fine. Peter: I knew you would come up with a solution :) (well, feel free to call it bandaid, but it solves the problem BTW) - yann To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
On (2002/07/22 18:48), Szilveszter Adam wrote: > I have a kernel and world from Saturday, it seems reasonably ok in > console mode (does not panic although it is used as an ADSL router) but > in X, it locks up very easily. I tried it with Mozilla on Sunday, it > froze twice within as many hours, [...] As a datapoint, yesterday's -CURRENT was sufficiently stable on a UP system to perform a full local ports build and install (117 ports including X and mozilla), while also serving as an X workstation with pretty heavily-used mozilla and CVS clients. Ciao, Sheldon. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
Yann Berthier wrote: > On Mon, 22 Jul 2002, Peter Wemm wrote: > > > It might be just me because I swapped an ISA 'si' card for a PCI version, b ut > > the problems I've been seeing are pretty spectacular. I'm regularly seeing > > the following panics: > > > > - selwakeup() taking fatal traps (always while running postfix/smtpd, > > presumably this is happening during the traditional 'select collision' > > window - the locking looks rather suspect there too). This killed my box > > 3 times today alone. > > > > eg: > > Fatal trap 12: page fault while in kernel mode > > fault virtual address = 0xc44a01b4 > > fault code = supervisor write, page not present > > instruction pointer = 0x8:0xc027f945 > > current process = 4078 (smtpd) > > trap number = 12 > >Same here: 2 panics with a kernel from today while running >postfix/smtpd. > >Sorry, I have no more info to give for now though Thanks for the independent confirmation. Here's a workaround patch that you might like to try: --- kern_thread.c 17 Jul 2002 23:43:55 - 1.8 +++ kern_thread.c 22 Jul 2002 23:31:06 - @@ -198,7 +198,7 @@ thread_zone = uma_zcreate("THREAD", sizeof (struct thread), thread_ctor, thread_dtor, thread_init, thread_fini, - UMA_ALIGN_CACHE, 0); + UMA_ALIGN_CACHE, UMA_ZONE_NOFREE); } /* I haven't paniced yet with that change. :-) For some unknown reason, selwakeup() is dereferencing pointers to threads that have long gone and the backing store has been freed. The patch above is a bandaid, not a solution. It basically prevents threads ever being freed back to the general pool, even though everything here supposedly does not need that. (unlike struct proc and socket, for example). peter@overcee[11:57pm]/home/crash-105# gdb -k kernel.12 vmcore.12 ... Fatal trap 12: page fault while in kernel mode fault virtual address = 0xc29b0634 fault code = supervisor write, page not present instruction pointer = 0x8:0xc0257755 current process = 1411 (smtpd) ... (kgdb) l *0xc0257755 0xc0257755 is in selwakeup (../../../kern/sys_generic.c:1186). 1181} 1182if (td == NULL) { 1183mtx_unlock(&sellock); 1184return; 1185} 1186TAILQ_REMOVE(&td->td_selq, sip, si_thrlist); 1187sip->si_thread = NULL; 1188mtx_lock_spin(&sched_lock); 1189if (td->td_wchan == (caddr_t)&selwait) { 1190if (td->td_state == TDS_SLP) #5 0xc034c68d in trap (frame= {tf_fs = -1069613032, tf_es = 16, tf_ds = -1070006256, tf_edi = 0, tf_esi = -1034848204, tf_ebp = -630072692, tf_isp = -630072736, tf_ebx = -1030027776, tf_edx = -1030911744, tf_ecx = 1, tf_eax = -1030027728, tf_trapno = 12, tf_err = 2, tf_eip = -1071286443, tf_cs = 8, tf_eflags = 66118, tf_esp = -1069571036, tf_ss = 0}) at ../../../i386/i386/trap.c:445 #6 0xc0257755 in selwakeup (sip=0xc2517834) at ../../../kern/sys_generic.c:1186 #7 0xc026d249 in sowakeup (so=0xc25177d0, sb=0xc251781c) at ../../../kern/uipc_socket2.c:300 #8 0xc026cdb0 in soisconnected (so=0xc2750bb8) at ../../../kern/uipc_socket2.c:132 #9 0xc02726fd in unp_connect2 (so=0xc30a3190, so2=0xc2750bb8) at ../../../kern/uipc_usrreq.c:769 #10 0xc0272653 in unp_connect (so=0xc30a3190, nam=0xc4359d00, td=0xc30a3190) at ../../../kern/uipc_usrreq.c:737 #11 0xc027173e in uipc_connect (so=0x0, nam=0x0, td=0xc28d8900) at ../../../kern/uipc_usrreq.c:161 #12 0xc026abda in soconnect (so=0xc263c630, nam=0x0, td=0x0) at ../../../kern/uipc_socket.c:429 #13 0xc026eade in connect (td=0xc30a3190, uap=0xc2750bb8) at ../../../kern/uipc_syscalls.c:441 #14 0xc034d1c1 in syscall (frame= {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 11, tf_esi = 0, tf_ebp = -1077938236, tf_isp = -630071948, tf_ebx = 134708840, tf_edx = -1077938342, tf_ecx = 0, tf_eax = 98, tf_trapno = 22, tf_err = 2, tf_eip = 671906955, tf_cs = 31, tf_eflags = 663, tf_esp = -1077938408, tf_ss = 47}) at ../../../i386/i386/trap.c:1049 I've checked the page tables, it is indeed unmapped. Also note that this is in the guts of the unix domain socket code. :-] (kgdb) peter@overcee[11:58pm]/home/crash-110# gdb -k kernel.10 vmcore.10 Fatal trap 12: page fault while in kernel mode fault virtual address = 0xc44a01b4 fault code = supervisor write, page not present instruction pointer = 0x8:0xc027f945 current process = 4078 (smtpd) [..] #13 0xc03750dd in trap () #14 0xc027f945 in selwakeup () #15 0xc02953f9 in sowakeup () #16 0xc0294f60 in soisconnected () #17 0xc029a8ad in unp_connect2 () #18 0xc029a803 in unp_connect () #19 0xc02998ee in uipc_connect () #20 0xc0292d8a in soconnect () #21 0xc0296c8e in connect () #22 0xc0375c11 in syscall () Interestingly, the stack trace is identical on both of these that I managed to capture. Cheers, -Peter -- Pe
Re: Is it just me or has -current suddenly got massively unstable?
Hello, I have a kernel and world from Saturday, it seems reasonably ok in console mode (does not panic although it is used as an ADSL router) but in X, it locks up very easily. I tried it with Mozilla on Sunday, it froze twice within as many hours, in a seemingly undeterministic manner. Unfortunately, the locks make the machine unable to give any useful info, it does not reboot by itself either. As said, I can do even demanding things on the console (eg building Mozilla, generating the PR stats page during the website build etc). Additional detail: My previous kernel was from the 18th, with the "bandaid" fix to pmap.c, and that one did not lock up under X, at least not during my testing. So, yes, I am seeing unstability, but only under X. -- Regards: Szilveszter ADAM Szombathely Hungary To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
On Mon, 22 Jul 2002, Peter Wemm wrote: > It might be just me because I swapped an ISA 'si' card for a PCI version, but > the problems I've been seeing are pretty spectacular. I'm regularly seeing > the following panics: > > - selwakeup() taking fatal traps (always while running postfix/smtpd, > presumably this is happening during the traditional 'select collision' > window - the locking looks rather suspect there too). This killed my box > 3 times today alone. > > eg: > Fatal trap 12: page fault while in kernel mode > fault virtual address = 0xc44a01b4 > fault code = supervisor write, page not present > instruction pointer = 0x8:0xc027f945 > current process = 4078 (smtpd) > trap number = 12 Same here: 2 panics with a kernel from today while running postfix/smtpd. Sorry, I have no more info to give for now though - yann To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Is it just me or has -current suddenly got massively unstable?
* Peter Wemm <[EMAIL PROTECTED]> [020722 00:16] wrote: > It might be just me because I swapped an ISA 'si' card for a PCI version, but > the problems I've been seeing are pretty spectacular. I'm regularly seeing > the following panics: > > - selwakeup() taking fatal traps (always while running postfix/smtpd, > presumably this is happening during the traditional 'select collision' > window - the locking looks rather suspect there too). This killed my box > 3 times today alone. What's suspect about the locking? > This is happening on this line: > > 1182if (td == NULL) { > 1183mtx_unlock(&sellock); > 1184return; > 1185} > 1186 >>>HERE>>> TAILQ_REMOVE(&td->td_selq, sip, si_thrlist); > 1187sip->si_thread = NULL; > 1188mtx_lock_spin(&sched_lock); > 1189if (td->td_wchan == (caddr_t)&selwait) { > 1190if (td->td_state == TDS_SLP) > > All of these panics have been at this identical location -it isn't random. > I briefly went looking and I'm wondering if the locking is adequate here. I was hoping it was, what problem do you see here? All the td->td_selq accesses should be protected by the select mutex. Perhaps selwakeup() is being called on an uninitialized selinfo structure? I guess adding some sort of diagnostic checks for initialized selinfos might help. Yes, I've been seeing some instability, but not of the magnitude you're seeing. :) -- -Alfred Perlstein [[EMAIL PROTECTED]] [#bsdcode/efnet/irc.prison.net] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Is it just me or has -current suddenly got massively unstable?
It might be just me because I swapped an ISA 'si' card for a PCI version, but the problems I've been seeing are pretty spectacular. I'm regularly seeing the following panics: - selwakeup() taking fatal traps (always while running postfix/smtpd, presumably this is happening during the traditional 'select collision' window - the locking looks rather suspect there too). This killed my box 3 times today alone. eg: Fatal trap 12: page fault while in kernel mode fault virtual address = 0xc44a01b4 fault code = supervisor write, page not present instruction pointer = 0x8:0xc027f945 current process = 4078 (smtpd) trap number = 12 #10 0xc025ed8b in panic () #11 0xc03758d3 in trap_fatal () #12 0xc03755b2 in trap_pfault () #13 0xc03750dd in trap () #14 0xc027f945 in selwakeup () #15 0xc02953f9 in sowakeup () #16 0xc0294f60 in soisconnected () #17 0xc029a8ad in unp_connect2 () #18 0xc029a803 in unp_connect () #19 0xc02998ee in uipc_connect () #20 0xc0292d8a in soconnect () #21 0xc0296c8e in connect () #22 0xc0375c11 in syscall () This is happening on this line: 1182if (td == NULL) { 1183mtx_unlock(&sellock); 1184return; 1185} 1186 >>>HERE>>> TAILQ_REMOVE(&td->td_selq, sip, si_thrlist); 1187sip->si_thread = NULL; 1188mtx_lock_spin(&sched_lock); 1189if (td->td_wchan == (caddr_t)&selwait) { 1190if (td->td_state == TDS_SLP) All of these panics have been at this identical location -it isn't random. I briefly went looking and I'm wondering if the locking is adequate here. - random compiler segfaults - vdrop/vrele panics eg: panic: vdrop: holdcnt #2 0xc026190b in panic () at ../../../kern/kern_shutdown.c:493 #3 0xc02ae4bb in vdrop (vp=0x0) at ../../../kern/vfs_subr.c:1986 #4 0xc02a33d9 in cache_zap (ncp=0xc03ce03b) at ../../../kern/vfs_cache.c:241 #5 0xc02a393a in cache_enter (dvp=0xc4196e70, vp=0x0, cnp=0xc5c8c540) at ../../../kern/vfs_cache.c:452 #6 0xc03225e9 in ufs_lookup (ap=0xda6d2ac0) at ../../../ufs/ufs/ufs_lookup.c:457 #7 0xc0328e58 in ufs_vnoperate (ap=0x0) at ../../../ufs/ufs/ufs_vnops.c:2739 #8 0xc02a3d6c in vfs_cache_lookup (ap=0x0) at vnode_if.h:73 #9 0xc0328e58 in ufs_vnoperate (ap=0x0) at ../../../ufs/ufs/ufs_vnops.c:2739 #10 0xc02a801b in lookup (ndp=0xda6d2c24) at vnode_if.h:48 #11 0xc02a7a2e in namei (ndp=0xda6d2c24) at ../../../kern/vfs_lookup.c:175 #12 0xc02b30d2 in lstat (td=0xc5c8c540, uap=0xda6d2d10) at ../../../kern/vfs_syscalls.c:1536 #13 0xc0378be1 in syscall (frame= {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = -1077943328, tf_ebp = -1077943384, tf_isp = -630379148, tf_ebx = -1077943328, tf_edx = -1077943320, tf_ecx = 47, tf_eax = 190, tf_trapno = 12, tf_err = 2, tf_eip = 134629535, tf_cs = 31, tf_eflags = 518, tf_esp = -1077944580, tf_ss = 47}) at ../../../i386/i386/trap.c:1049 I do not have a -g kernel for this one, sorry. The vdrop(vp=0x0) traceback is clearly wrong there though, I'm pretty sure that it is because of the missing -g info (gdb knows where the temporary copies are with -g and dwarf2) - All sorts of other very strange things today. I missed a few crashdumps due to full disk. I'm getting panics just trying to extract tarballs or compiling largish programs. Has anybody else been running into this? I've had most of it happen today, except for two or three selwakeup() panics over the last few days. The really bad stuff seemed to start today. It might be coincidence that today I also moved that card around. ie this: si0 at iomem 0xd8000-0xd irq 12 on isa0 si0: SIHOST2 - no ports found became this: si0: port 0x9400-0x947f mem 0xfc10-0xfc10,0 xfc112000-0xfc11207f irq 9 at device 9.0 on pci0 si0: card: SXPCI, ports: 8, modules: 1, type: 8 Hmm. Anyway, has anybody else seen this sort of thing today? Cheers, -Peter -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message