Re: Is it just me or has -current suddenly got massively unstable?

2002-07-23 Thread Julian Elischer


On Tue, 23 Jul 2002, Peter Wemm wrote:

>  
> thread_zone = uma_zcreate("THREAD", sizeof (struct thread),
> thread_ctor, thread_dtor, thread_init, thread_fini,
> -   UMA_ALIGN_CACHE, 0);
> +   UMA_ALIGN_CACHE, UMA_ZONE_NOFREE);
>  }
>  
>  /*
> 
> I haven't paniced yet with that change. :-) For some unknown reason,
> selwakeup() is dereferencing pointers to threads that have long gone and
> the backing store has been freed.  The patch above is a bandaid, not a
> solution.  It basically prevents threads ever being freed back to the
> general pool, even though everything here supposedly does not need that.
> (unlike struct proc and socket, for example).


Peter.. this comment in selrecord scared the heck out of me..
---
 /*
1151  * If the thread is NULL then take ownership of selinfo
1152  * however if the thread is not NULL and the thread points to
1153  * someone else, then we have a collision, otherwise leave it
alone
1154  * as we've owned it in a previous selrecord on this selinfo.
1155  */

---

it suggests that select still doesn't clean up after itself.

looking in select() however I see:
 836 if (timo > 0)
 837 error = cv_timedwait_sig(&selwait, &sellock, timo);
 838 else
 839 error = cv_wait_sig(&selwait, &sellock);
 840 
 841 if (error == 0)
 842 goto retry;
 843 
 844 done:
 845 clear_selinfo_list(td);

This suggests that there is no way to exit this function without
clearing the thread pointers but your trace suggests otherwise..


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-23 Thread Terry Lambert

Sheldon Hearn wrote:
> On (2002/07/23 12:08), Yann Berthier wrote:
> >Thanks a lot, patch applied, and all is going fine. Peter: I knew you
> >would come up with a solution :)
> >(well, feel free to call it bandaid, but it solves the problem BTW)
> 
> To quote Terry Lambert on what he calls Occam's Corollary:
> 
> Anything that works is better than anything that doesn't.
> 
> :-)

Be really, really careful here.

The reason it works is because it changes the memory to be type
stable, so it gets the previous values, if the structure has
not been reused, and signals a selwakeup() where there is no
one waiting.  If the structure *has* been reused, then it issues
a selwakeup() to a potentially unrelated thread.  In most cases,
this is a harmless event, that's not even being checked for; in
other cases, it's being checked for, and it looks like a bogus
return.

Most code that sits in a select loop will only trigger if a bit
is set.  However, it's a perfectly valid thing to think that you
won't get spurios returns -- and write code that *depends* on not
getting spurious returns.

Since I've only been following this vs. -current by reading,
rather than running, source code, and reading, rather than
applying patches, this is just my initial reaction to the patch.

So take the following with a grain of salt...

On the other hand: there is a *real* problem here; again, from
just reading the code, it looks like a pretty deep one having to
do with events being things which happen *on* descriptors, rather
than *to* processes (or threads).

I expect that the problem is that a thread has been terminated,
and it is the thread which opened a socket, and then did the
listen on it, but isn't around to do the accept, or receive the
connection event.

It's a deep problem because descriptors belong to processes, not
threads, and events belong to the decriptors, not to the callers;
before KSE's, it was OK to treat it as a commutitive property.

I rather expect that there is a similar panic that will show up
during stress testing, which will occur at NETISR on incoming
connections, in the bottom half of the "accept" code, which has
a similar looking selwakeup() call.

Probably, the only way to fix this is to make it a process event
rather than a thread event, which would avoid the list removal
and subsequent dereference.  Kind of an ugly kludge.  8-(.

It would not surprise me if the kevent() resulting from signals
is near the heart of the signal problem, as well, and has a
parallel basis.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-23 Thread Sheldon Hearn

On (2002/07/23 12:08), Yann Berthier wrote:

>Thanks a lot, patch applied, and all is going fine. Peter: I knew you
>would come up with a solution :) 
>(well, feel free to call it bandaid, but it solves the problem BTW)

To quote Terry Lambert on what he calls Occam's Corollary:

Anything that works is better than anything that doesn't.

:-)

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-23 Thread Yann Berthier

On Tue, 23 Jul 2002, Peter Wemm wrote:


   [snip]

> Thanks for the independent confirmation.  Here's a workaround patch
> that you might like to try:
> 
> --- kern_thread.c   17 Jul 2002 23:43:55 -  1.8
> +++ kern_thread.c   22 Jul 2002 23:31:06 -
> @@ -198,7 +198,7 @@
>  
> thread_zone = uma_zcreate("THREAD", sizeof (struct thread),
> thread_ctor, thread_dtor, thread_init, thread_fini,
> -   UMA_ALIGN_CACHE, 0);
> +   UMA_ALIGN_CACHE, UMA_ZONE_NOFREE);
>  }
>  
>  /*
> 
> I haven't paniced yet with that change. :-) For some unknown reason,
> selwakeup() is dereferencing pointers to threads that have long gone and
> the backing store has been freed.  The patch above is a bandaid, not a
> solution.  It basically prevents threads ever being freed back to the
> general pool, even though everything here supposedly does not need that.
> (unlike struct proc and socket, for example).

   Thanks a lot, patch applied, and all is going fine. Peter: I knew you
   would come up with a solution :) 
   (well, feel free to call it bandaid, but it solves the problem BTW)

   - yann

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-23 Thread Sheldon Hearn

On (2002/07/22 18:48), Szilveszter Adam wrote:

> I have a kernel and world from Saturday, it seems reasonably ok in
> console mode (does not panic although it is used as an ADSL router) but
> in X, it locks up very easily. I tried it with Mozilla on Sunday, it
> froze twice within as many hours, [...]

As a datapoint, yesterday's -CURRENT was sufficiently stable on a UP
system to perform a full local ports build and install (117 ports
including X and mozilla), while also serving as an X workstation with
pretty heavily-used mozilla and CVS clients.

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-22 Thread Peter Wemm

Yann Berthier wrote:
> On Mon, 22 Jul 2002, Peter Wemm wrote:
> 
> > It might be just me because I swapped an ISA 'si' card for a PCI version, b
ut
> > the problems I've been seeing are pretty spectacular.  I'm regularly seeing
> > the following panics:
> > 
> > - selwakeup() taking fatal traps (always while running postfix/smtpd,
> > presumably this is happening during the traditional 'select collision'
> > window - the locking looks rather suspect there too).  This killed my box
> > 3 times today alone.
> > 
> > eg:
> > Fatal trap 12: page fault while in kernel mode
> > fault virtual address   = 0xc44a01b4
> > fault code  = supervisor write, page not present
> > instruction pointer = 0x8:0xc027f945
> > current process = 4078 (smtpd)
> > trap number = 12
> 
>Same here: 2 panics with a kernel from today while running
>postfix/smtpd.
> 
>Sorry, I have no more info to give for now though

Thanks for the independent confirmation.  Here's a workaround patch
that you might like to try:

--- kern_thread.c   17 Jul 2002 23:43:55 -  1.8
+++ kern_thread.c   22 Jul 2002 23:31:06 -
@@ -198,7 +198,7 @@
 
thread_zone = uma_zcreate("THREAD", sizeof (struct thread),
thread_ctor, thread_dtor, thread_init, thread_fini,
-   UMA_ALIGN_CACHE, 0);
+   UMA_ALIGN_CACHE, UMA_ZONE_NOFREE);
 }
 
 /*

I haven't paniced yet with that change. :-) For some unknown reason,
selwakeup() is dereferencing pointers to threads that have long gone and
the backing store has been freed.  The patch above is a bandaid, not a
solution.  It basically prevents threads ever being freed back to the
general pool, even though everything here supposedly does not need that.
(unlike struct proc and socket, for example).

peter@overcee[11:57pm]/home/crash-105# gdb -k kernel.12 vmcore.12
...
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xc29b0634
fault code  = supervisor write, page not present
instruction pointer = 0x8:0xc0257755
current process = 1411 (smtpd)
...
(kgdb) l *0xc0257755
0xc0257755 is in selwakeup (../../../kern/sys_generic.c:1186).
1181}
1182if (td == NULL) {
1183mtx_unlock(&sellock);
1184return;
1185}
1186TAILQ_REMOVE(&td->td_selq, sip, si_thrlist);
1187sip->si_thread = NULL;
1188mtx_lock_spin(&sched_lock);
1189if (td->td_wchan == (caddr_t)&selwait) {
1190if (td->td_state == TDS_SLP)

#5  0xc034c68d in trap (frame=
  {tf_fs = -1069613032, tf_es = 16, tf_ds = -1070006256, tf_edi = 0, tf_esi = 
-1034848204, tf_ebp = -630072692, tf_isp = -630072736, tf_ebx = -1030027776, tf_edx = 
-1030911744, tf_ecx = 1, tf_eax = -1030027728, tf_trapno = 12, tf_err = 2, tf_eip = 
-1071286443, tf_cs = 8, tf_eflags = 66118, tf_esp = -1069571036, tf_ss = 0}) at 
../../../i386/i386/trap.c:445
#6  0xc0257755 in selwakeup (sip=0xc2517834)
at ../../../kern/sys_generic.c:1186
#7  0xc026d249 in sowakeup (so=0xc25177d0, sb=0xc251781c)
at ../../../kern/uipc_socket2.c:300
#8  0xc026cdb0 in soisconnected (so=0xc2750bb8)
at ../../../kern/uipc_socket2.c:132
#9  0xc02726fd in unp_connect2 (so=0xc30a3190, so2=0xc2750bb8)
at ../../../kern/uipc_usrreq.c:769
#10 0xc0272653 in unp_connect (so=0xc30a3190, nam=0xc4359d00, td=0xc30a3190)
at ../../../kern/uipc_usrreq.c:737
#11 0xc027173e in uipc_connect (so=0x0, nam=0x0, td=0xc28d8900)
at ../../../kern/uipc_usrreq.c:161
#12 0xc026abda in soconnect (so=0xc263c630, nam=0x0, td=0x0)
at ../../../kern/uipc_socket.c:429
#13 0xc026eade in connect (td=0xc30a3190, uap=0xc2750bb8)
at ../../../kern/uipc_syscalls.c:441
#14 0xc034d1c1 in syscall (frame=
  {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 11, tf_esi = 0, tf_ebp = 
-1077938236, tf_isp = -630071948, tf_ebx = 134708840, tf_edx = -1077938342, tf_ecx = 
0, tf_eax = 98, tf_trapno = 22, tf_err = 2, tf_eip = 671906955, tf_cs = 31, tf_eflags 
= 663, tf_esp = -1077938408, tf_ss = 47})
at ../../../i386/i386/trap.c:1049

I've checked the page tables, it is indeed unmapped.

Also note that this is in the guts of the unix domain socket code. :-]

(kgdb) peter@overcee[11:58pm]/home/crash-110# gdb -k kernel.10 vmcore.10
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xc44a01b4
fault code  = supervisor write, page not present
instruction pointer = 0x8:0xc027f945
current process = 4078 (smtpd)
[..]
#13 0xc03750dd in trap ()
#14 0xc027f945 in selwakeup ()
#15 0xc02953f9 in sowakeup ()
#16 0xc0294f60 in soisconnected ()
#17 0xc029a8ad in unp_connect2 ()
#18 0xc029a803 in unp_connect ()
#19 0xc02998ee in uipc_connect ()
#20 0xc0292d8a in soconnect ()
#21 0xc0296c8e in connect ()
#22 0xc0375c11 in syscall ()

Interestingly, the stack trace is identical on both of these that I managed
to capture.

Cheers,
-Peter
--
Pe

Re: Is it just me or has -current suddenly got massively unstable?

2002-07-22 Thread Szilveszter Adam

Hello,

I have a kernel and world from Saturday, it seems reasonably ok in
console mode (does not panic although it is used as an ADSL router) but
in X, it locks up very easily. I tried it with Mozilla on Sunday, it
froze twice within as many hours, in a seemingly undeterministic manner.
Unfortunately, the locks make the machine unable to give any useful
info, it does not reboot by itself either. As said, I can do even
demanding things on the console (eg building Mozilla, generating the PR
stats page during the website build etc).

Additional detail: My previous kernel was from the 18th, with the
"bandaid" fix to pmap.c, and that one did not lock up under X, at least
not during my testing.

So, yes, I am seeing unstability, but only under X.
-- 
Regards:

Szilveszter ADAM
Szombathely Hungary

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-22 Thread Yann Berthier

On Mon, 22 Jul 2002, Peter Wemm wrote:

> It might be just me because I swapped an ISA 'si' card for a PCI version, but
> the problems I've been seeing are pretty spectacular.  I'm regularly seeing
> the following panics:
> 
> - selwakeup() taking fatal traps (always while running postfix/smtpd,
> presumably this is happening during the traditional 'select collision'
> window - the locking looks rather suspect there too).  This killed my box
> 3 times today alone.
> 
> eg:
> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0xc44a01b4
> fault code  = supervisor write, page not present
> instruction pointer = 0x8:0xc027f945
> current process = 4078 (smtpd)
> trap number = 12

   Same here: 2 panics with a kernel from today while running
   postfix/smtpd.

   Sorry, I have no more info to give for now though
   
   - yann 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Is it just me or has -current suddenly got massively unstable?

2002-07-22 Thread Alfred Perlstein

* Peter Wemm <[EMAIL PROTECTED]> [020722 00:16] wrote:
> It might be just me because I swapped an ISA 'si' card for a PCI version, but
> the problems I've been seeing are pretty spectacular.  I'm regularly seeing
> the following panics:
> 
> - selwakeup() taking fatal traps (always while running postfix/smtpd,
> presumably this is happening during the traditional 'select collision'
> window - the locking looks rather suspect there too).  This killed my box
> 3 times today alone.

What's suspect about the locking?

> This is happening on this line:
> 
> 1182if (td == NULL) {
> 1183mtx_unlock(&sellock);
> 1184return;
> 1185}
> 1186 >>>HERE>>> TAILQ_REMOVE(&td->td_selq, sip, si_thrlist);
> 1187sip->si_thread = NULL;
> 1188mtx_lock_spin(&sched_lock);
> 1189if (td->td_wchan == (caddr_t)&selwait) {
> 1190if (td->td_state == TDS_SLP)
> 
> All of these panics have been at this identical location -it isn't random.
> I briefly went looking and I'm wondering if the locking is adequate here.

I was hoping it was, what problem do you see here?  All the td->td_selq
accesses should be protected by the select mutex.  Perhaps selwakeup()
is being called on an uninitialized selinfo structure?  I guess adding
some sort of diagnostic checks for initialized selinfos might help.

Yes, I've been seeing some instability, but not of the magnitude you're
seeing. :)

-- 
-Alfred Perlstein [[EMAIL PROTECTED]] [#bsdcode/efnet/irc.prison.net]
'Instead of asking why a piece of software is using "1970s technology,"
 start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Is it just me or has -current suddenly got massively unstable?

2002-07-21 Thread Peter Wemm

It might be just me because I swapped an ISA 'si' card for a PCI version, but
the problems I've been seeing are pretty spectacular.  I'm regularly seeing
the following panics:

- selwakeup() taking fatal traps (always while running postfix/smtpd,
presumably this is happening during the traditional 'select collision'
window - the locking looks rather suspect there too).  This killed my box
3 times today alone.

eg:
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xc44a01b4
fault code  = supervisor write, page not present
instruction pointer = 0x8:0xc027f945
current process = 4078 (smtpd)
trap number = 12

#10 0xc025ed8b in panic ()
#11 0xc03758d3 in trap_fatal ()
#12 0xc03755b2 in trap_pfault ()
#13 0xc03750dd in trap ()
#14 0xc027f945 in selwakeup ()
#15 0xc02953f9 in sowakeup ()
#16 0xc0294f60 in soisconnected ()
#17 0xc029a8ad in unp_connect2 ()
#18 0xc029a803 in unp_connect ()
#19 0xc02998ee in uipc_connect ()
#20 0xc0292d8a in soconnect ()
#21 0xc0296c8e in connect ()
#22 0xc0375c11 in syscall ()

This is happening on this line:

1182if (td == NULL) {
1183mtx_unlock(&sellock);
1184return;
1185}
1186 >>>HERE>>> TAILQ_REMOVE(&td->td_selq, sip, si_thrlist);
1187sip->si_thread = NULL;
1188mtx_lock_spin(&sched_lock);
1189if (td->td_wchan == (caddr_t)&selwait) {
1190if (td->td_state == TDS_SLP)

All of these panics have been at this identical location -it isn't random.
I briefly went looking and I'm wondering if the locking is adequate here.

- random compiler segfaults

- vdrop/vrele panics

eg:

panic: vdrop: holdcnt

#2  0xc026190b in panic () at ../../../kern/kern_shutdown.c:493
#3  0xc02ae4bb in vdrop (vp=0x0) at ../../../kern/vfs_subr.c:1986
#4  0xc02a33d9 in cache_zap (ncp=0xc03ce03b) at ../../../kern/vfs_cache.c:241
#5  0xc02a393a in cache_enter (dvp=0xc4196e70, vp=0x0, cnp=0xc5c8c540)
at ../../../kern/vfs_cache.c:452
#6  0xc03225e9 in ufs_lookup (ap=0xda6d2ac0)
at ../../../ufs/ufs/ufs_lookup.c:457
#7  0xc0328e58 in ufs_vnoperate (ap=0x0) at ../../../ufs/ufs/ufs_vnops.c:2739
#8  0xc02a3d6c in vfs_cache_lookup (ap=0x0) at vnode_if.h:73
#9  0xc0328e58 in ufs_vnoperate (ap=0x0) at ../../../ufs/ufs/ufs_vnops.c:2739
#10 0xc02a801b in lookup (ndp=0xda6d2c24) at vnode_if.h:48
#11 0xc02a7a2e in namei (ndp=0xda6d2c24) at ../../../kern/vfs_lookup.c:175
#12 0xc02b30d2 in lstat (td=0xc5c8c540, uap=0xda6d2d10)
at ../../../kern/vfs_syscalls.c:1536
#13 0xc0378be1 in syscall (frame=
  {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = -1077943328, tf_ebp = 
-1077943384, tf_isp = -630379148, tf_ebx = -1077943328, tf_edx = -1077943320, tf_ecx = 
47, tf_eax = 190, tf_trapno = 12, tf_err = 2, tf_eip = 134629535, tf_cs = 31, 
tf_eflags = 518, tf_esp = -1077944580, tf_ss = 47})
at ../../../i386/i386/trap.c:1049

I do not have a -g kernel for this one, sorry.  The vdrop(vp=0x0) traceback
is clearly wrong there though, I'm pretty sure that it is because
of the missing -g info (gdb knows where the temporary copies are with
-g and dwarf2)

- All sorts of other very strange things today.  I missed a few crashdumps
due to full disk.  I'm getting panics just trying to extract tarballs or
compiling largish programs.

Has anybody else been running into this?  I've had most of it happen today,
except for two or three selwakeup() panics over the last few days. The
really bad stuff seemed to start today.  It might be coincidence that today
I also moved that card around.

ie this:
si0 at iomem 0xd8000-0xd irq 12 on isa0
si0: SIHOST2 - no ports found

became this:
si0:  port 0x9400-0x947f mem 0xfc10-0xfc10,0
xfc112000-0xfc11207f irq 9 at device 9.0 on pci0
si0: card: SXPCI, ports: 8, modules: 1, type: 8

Hmm.

Anyway, has anybody else seen this sort of thing today?

Cheers,
-Peter
--
Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message