Re: RELENG_6 panic under heavy load

2006-12-12 Thread Dmitriy Kirhlarov
On Thu, Dec 07, 2006 at 11:18:52AM +0800, David Xu wrote:
 On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote:
  On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
  T   I wonder why UMA was suspected to be the problem. Dima gave
  T me access to the core. Here are more details from the trace:
 
  It looks like a race between two threads in one process. Look here:
 
 Can you try the patch ?
 http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch 

I've tested it. This patch works also, but with a little bit different
behaviour. With patch from jhb@ I got LA 7-8, with this patch I have
LA 5-6, same as on unpatched system. But it seems to me, that system
is less interactive, compared to jhb@ patch.

WBR
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-12 Thread David Xu
On Tuesday 12 December 2006 20:22, Dmitriy Kirhlarov wrote:
 On Thu, Dec 07, 2006 at 11:18:52AM +0800, David Xu wrote:
  On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote:
   On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
   T   I wonder why UMA was suspected to be the problem. Dima gave
   T me access to the core. Here are more details from the trace:
  
   It looks like a race between two threads in one process. Look here:
 
  Can you try the patch ?
  http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch

 I've tested it. This patch works also, but with a little bit different
 behaviour. With patch from jhb@ I got LA 7-8, with this patch I have
 LA 5-6, same as on unpatched system. But it seems to me, that system
 is less interactive, compared to jhb@ patch.

 WBR
 Dmitriy

jhb patch is incomplete, it implies that every place a thread is doing state
transition and waking another thread up should be patched, there is
other code in kern_sig.c unpatched, though I don't know other places,
but the code maybe_preempt_in_ksegrp should be synced with
maybe_preempt, it should fix all problems. the LA you have seen is lower
than jhb might be a nature of KSEGRP, but I am not sure, if you program
forces all threads to be system-scope, it might fix the problem. 

David Xu
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-12 Thread Dmitriy Kirhlarov
On Tue, Dec 12, 2006 at 08:49:21PM +0800, David Xu wrote:

  I've tested it. This patch works also, but with a little bit different
  behaviour. With patch from jhb@ I got LA 7-8, with this patch I have
  LA 5-6, same as on unpatched system. But it seems to me, that system
  is less interactive, compared to jhb@ patch.
 
 jhb patch is incomplete, it implies that every place a thread is doing state
 transition and waking another thread up should be patched, there is
 other code in kern_sig.c unpatched, though I don't know other places,
 but the code maybe_preempt_in_ksegrp should be synced with
 maybe_preempt, it should fix all problems. the LA you have seen is lower
 than jhb might be a nature of KSEGRP, but I am not sure, if you program
 forces all threads to be system-scope, it might fix the problem. 

I think, kern/105464 can be closed, after commit your patch.

WBR.
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-08 Thread Dmitriy Kirhlarov
On Wed, Dec 06, 2006 at 12:09:39PM -0500, John Baldwin wrote:

  ...) and here is something difficult to understand, when $poll tries to
  make $fork runnable, while $fork is trying to put itself in the turnstile
  that is owned by $poll
 
 Hmm.  I'm guessing the problem is the $poll thread is suspended (not exited) 
 while holding the proc lock?  That would appear to be the problem.  That 
 thread can't run again to release the lock.  Ah, yes, I see the bug.  
 Something like this should fix it:
 
 Index: kern_thread.c
 ===
 RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v
 retrieving revision 1.216.2.6
 diff -u -r1.216.2.6 kern_thread.c
 --- kern_thread.c   2 Sep 2006 17:29:57 -   1.216.2.6
 +++ kern_thread.c   6 Dec 2006 17:06:26 -
 @@ -969,7 +969,9 @@
 TAILQ_REMOVE(p-p_suspended, td, td_runq);
 TD_CLR_SUSPENDED(td);
 p-p_suspcount--;
 +   critical_enter();
 setrunnable(td);
 +   critical_exit();
  }
 
  /*
 
 What this does is force setrunnable() to be in a nested critical section so 
 we 
 won't preempt during setrunnable() until either the caller of 
 thread_unsuspend_one() eventually releases sched_lock, or, in the case you 
 ran into, the thread does a PROC_UNLOCK() and calls mi_switch().

lbsd02# uptime
9:46AM  up 22:45, 2 users, load averages: 7.50, 6.59, 6.32

It's work. Thank, you.
Without your patch max uptime was 9 hours.

I'm planning to test David's patch on weekend.

WBR
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-06 Thread John Baldwin
On Thursday 16 November 2006 11:09, Gleb Smirnoff wrote:
 On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote:
 T On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
 T T   I wonder why UMA was suspected to be the problem. Dima gave
 T T me access to the core. Here are more details from the trace:
 
 And even more:
 
 (kgdb) thread 133
 [Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900, 
newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980
 980 sched_lock.mtx_lock = (uintptr_t)td;
 (kgdb) frame 9
 #9  0xd07a6e16 in syscall (frame=
   {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 
134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, 
tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 
0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp 
= -813707972, tf_ss = 59})
 at /usr/src/sys/i386/i386/trap.c:1034
 1034userret(td, frame, sticks);
 (kgdb) p *callp
 $92 = {sy_narg = 65539, sy_call = 0xd0630550 poll, sy_auevent = 43012}
 
 (kgdb) set $poll = (struct thread *)0xd745c900
 (kgdb) set $fork = (struct thread *)0xd59aad80
 
 (kgdb) p $poll-td_state
 $93 = TDS_INHIBITED
 (kgdb) p $poll-td_inhibitors
 $94 = 1 == TDI_SUSPENDED
 (kgdb) p/x $poll-td_flags
 $96 = 0x1010c01   == TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | 
TDF_NEEDRESCHED | TDF_SCHED0
 (kgdb) p $fork-td_state
 $97 = TDS_INHIBITED
 (kgdb) p $fork-td_inhibitors
 $98 = 8 == TDI_LOCK
 (kgdb) p/x $fork-td_flags 
 $99 = 0x100 == TDF_SCHED0
 
 Not everything clear yet, but looks like:
 
 1) $fork thread obtains proc lock
 2) $poll thread blocks on proc lock
 3) $fork thread has suspended the $poll thread in thread_single()
 4) $fork thread temporarily unlocks proc lock (line 821) and is
preempted by $poll thread
 5) $poll thread obtains proc lock, and starts doing its poll job
 6) $fork thread blocks on proc lock, and is added to its turnstile
 7) $poll thread drops the proc lock, but isn't preempted by $fork
 8) $poll thread exits and is preempted by $fork
 
 ...) and here is something difficult to understand, when $poll tries to
 make $fork runnable, while $fork is trying to put itself in the turnstile
 that is owned by $poll

Hmm.  I'm guessing the problem is the $poll thread is suspended (not exited) 
while holding the proc lock?  That would appear to be the problem.  That 
thread can't run again to release the lock.  Ah, yes, I see the bug.  
Something like this should fix it:

Index: kern_thread.c
===
RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v
retrieving revision 1.216.2.6
diff -u -r1.216.2.6 kern_thread.c
--- kern_thread.c   2 Sep 2006 17:29:57 -   1.216.2.6
+++ kern_thread.c   6 Dec 2006 17:06:26 -
@@ -969,7 +969,9 @@
TAILQ_REMOVE(p-p_suspended, td, td_runq);
TD_CLR_SUSPENDED(td);
p-p_suspcount--;
+   critical_enter();
setrunnable(td);
+   critical_exit();
 }

 /*

What this does is force setrunnable() to be in a nested critical section so we 
won't preempt during setrunnable() until either the caller of 
thread_unsuspend_one() eventually releases sched_lock, or, in the case you 
ran into, the thread does a PROC_UNLOCK() and calls mi_switch().

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-06 Thread David Xu
On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote:
 On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
 T   I wonder why UMA was suspected to be the problem. Dima gave
 T me access to the core. Here are more details from the trace:

 It looks like a race between two threads in one process. Look here:

 (kgdb) frame 12
 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0,
 file=0x12 Address 0x12 out of bounds, line=18) at
 /usr/src/sys/kern/kern_mutex.c:579 579
 turnstile_wait(m-mtx_object, mtx_owner(m)); (kgdb) p *m
 $10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process
 lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = {
 tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100},
 mtx_lock = 3611674882, mtx_recurse = 0} (kgdb) p ((struct thread *)tid)
 $15 = (struct thread *) 0xd59aad80
 (kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))
 $17 = (struct thread *) 0xd745c900
 (kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))-td_proc
 $18 = (struct proc *) 0xd5dd5430
 (kgdb) p ((struct thread *)tid)-td_proc
 $19 = (struct proc *) 0xd5dd5430

 So, we see that one thread blocks on the lock that is held by an
 other thread of the same process. Here they are:

 * 134 Thread 100198 (PID=47872: nagios)  doadump () at pcpu.h:165
   133 Thread 100147 (PID=47872: nagios)  sched_switch (td=0xd745c900,
 newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980

 Let's look at the second one:

 (kgdb) thread 133
 [Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900,
 newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980   
  sched_lock.mtx_lock = (uintptr_t)td;
 (kgdb) bt
 #0  sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at
 /usr/src/sys/kern/sched_4bsd.c:980 #1  0xd0607f46 in mi_switch (flags=2,
 newtd=0x0) at /usr/src/sys/kern/kern_synch.c:420 #2  0xd0615ecf in
 maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467 #3  0xd06160c8
 in setrunqueue (td=0xd59aad80, flags=0) at kern_switch.c:585 #4  0xd06151e7
 in sched_wakeup (td=0xd59aad80) at /usr/src/sys/kern/sched_4bsd.c:996 #5 
 0xd0608025 in setrunnable (td=0xd59aad80) at
 /usr/src/sys/kern/kern_synch.c:483 #6  0xd060d78e in thread_unsuspend_one
 (td=0xd59aad80) at /usr/src/sys/kern/kern_thread.c:972 #7  0xd060d584 in
 thread_suspend_check (return_instead=0) at
 /usr/src/sys/kern/kern_thread.c:935 #8  0xd0628a88 in userret
 (td=0xd745c900, frame=0xf5dd4d38, oticks=1) at
 /usr/src/sys/kern/subr_trap.c:116 #9  0xd07a6e16 in syscall (frame=
   {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi =
 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108,
 tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno =
 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp =
 -813707972, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:1034
 #10 0xd078f38f in Xint0x80_syscall () at
 /usr/src/sys/i386/i386/exception.s:200

maybe_preempt_in_ksegrp is broken, it should do some checks like
maybe_preempt() does, some special cases should prevent preemption,
I believe this will not be a problem on -CURRENT with Julian's ksegrp removal
yesterday. the problem is not in thread suspension code.

Regards,
David Xu
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-12-06 Thread David Xu
On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote:
 On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
 T   I wonder why UMA was suspected to be the problem. Dima gave
 T me access to the core. Here are more details from the trace:

 It looks like a race between two threads in one process. Look here:

 (kgdb) frame 12
 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0,
 file=0x12 Address 0x12 out of bounds, line=18) at
 /usr/src/sys/kern/kern_mutex.c:579 579
 turnstile_wait(m-mtx_object, mtx_owner(m)); (kgdb) p *m
 $10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process
 lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = {
 tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100},
 mtx_lock = 3611674882, mtx_recurse = 0} (kgdb) p ((struct thread *)tid)
 $15 = (struct thread *) 0xd59aad80
 (kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))
 $17 = (struct thread *) 0xd745c900
 (kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))-td_proc
 $18 = (struct proc *) 0xd5dd5430
 (kgdb) p ((struct thread *)tid)-td_proc
 $19 = (struct proc *) 0xd5dd5430

 So, we see that one thread blocks on the lock that is held by an
 other thread of the same process. Here they are:

 * 134 Thread 100198 (PID=47872: nagios)  doadump () at pcpu.h:165
   133 Thread 100147 (PID=47872: nagios)  sched_switch (td=0xd745c900,
 newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980

 Let's look at the second one:

 (kgdb) thread 133
 [Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900,
 newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980   
  sched_lock.mtx_lock = (uintptr_t)td;
 (kgdb) bt
 #0  sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at
 /usr/src/sys/kern/sched_4bsd.c:980 #1  0xd0607f46 in mi_switch (flags=2,
 newtd=0x0) at /usr/src/sys/kern/kern_synch.c:420 #2  0xd0615ecf in
 maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467 #3  0xd06160c8
Can you try the patch ?
http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch 

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Dmitriy Kirhlarov
On Wed, Nov 15, 2006 at 03:37:40PM -0500, Kris Kennaway wrote:
 On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote:
  On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote:
  
   From alc@:
   
   ---
   I've never seen anything like this before.  UMA is failing to allocate
   the zone structure.  This is unrelated to the large-swap scenario that
   you ran into.  Ask him to uncomment all of the UMA debugging #define's
   at the start of uma_core.c.
  
  It was very painfull for me and I don't get result...
  
  #define UMA_DEBUG 1
  #define UMA_DEBUG_ALLOC 1
  #define UMA_DEBUG_ALLOC_1 1
  
  in uma_core.c kill my machine.
  I get tons of crap to serial console.
 
 The tons of crap is what was necessary to proceed.

Not shure.
I can't find way for collect output of UMA_DEBUG* with current console
server and it can't be replaced before Jan.

But I think UMA* is a wrong idea.
You suppose bad hardware (I'm right?). Possible. But it can be
detected only under heavy load -- when I enabling nagios on this
server. When nagios disabled, machine work perfectly.
With enabled UMA_DEBUG* options server not operable, no one service
starting, we don't have this load and can't reproduce this behaviour.

WBR
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Gleb Smirnoff
  I wonder why UMA was suspected to be the problem. Dima gave
me access to the core. Here are more details from the trace:

Unread portion of the kernel message buffer:
panic: thread 100147(nagios):1 holds process lock but isn't blocked on a lock

#9  0xd060038e in panic (fmt=0xd08094d9 thread %d(%s):%d holds %s but isn't 
blocked on a lock\n) at /usr/src/sys/kern/kern_shutdown.c:549
#10 0xd0629228 in propagate_priority (td=0xd745c900) at 
/usr/src/sys/kern/subr_turnstile.c:239
#11 0xd0629f32 in turnstile_wait (lock=0xd5dd5498, owner=0xd745c900) at 
/usr/src/sys/kern/subr_turnstile.c:643
#12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, 
file=0x12 Address 0x12 out of bounds, line=18) at 
/usr/src/sys/kern/kern_mutex.c:579
#13 0xd05f4992 in _mtx_lock_flags (m=0xd5dd5498, opts=0, file=0xd0806c3d 
/usr/src/sys/kern/kern_thread.c, line=824) at 
/usr/src/sys/kern/kern_mutex.c:288
#14 0xd060d340 in thread_single (mode=0) at /usr/src/sys/kern/kern_thread.c:824
#15 0xd05e38b9 in fork1 (td=0xd59aad80, flags=20, pages=0, procp=0xf5ca) at 
/usr/src/sys/kern/kern_fork.c:274
#16 0xd05e3509 in fork (td=0xd59aad80, uap=0xf5cacd04) at 
/usr/src/sys/kern/kern_fork.c:98
#17 0xd07a6d10 in syscall (frame=
  {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134953856, 
tf_esi = 673312612, tf_ebp = -809526568, tf_isp = -171258524, tf_ebx = 
672261300, tf_edx = 0, tf_ecx = 134963456, tf_eax = 2, tf_trapno = 12, tf_err = 
2, tf_eip = 672684403, tf_cs = 51, tf_eflags = 642, tf_esp = -809526660, tf_ss 
= 59})
at /usr/src/sys/i386/i386/trap.c:983
#18 0xd078f38f in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200
#19 0x0033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 10
#10 0xd0629228 in propagate_priority (td=0xd745c900) at 
/usr/src/sys/kern/subr_turnstile.c:239
239 KASSERT(TD_ON_LOCK(td), (
(kgdb) list
234 #endif
235
236 /*
237  * If we aren't blocked on a lock, we should be.
238  */
239 KASSERT(TD_ON_LOCK(td), (
240 thread %d(%s):%d holds %s but isn't blocked on a 
lock\n,
241 td-td_tid, td-td_proc-p_comm, td-td_state,
242 ts-ts_lockobj-lo_name));
243
(kgdb) frame 14
#14 0xd060d340 in thread_single (mode=0) at /usr/src/sys/kern/kern_thread.c:824
824 PROC_LOCK(p);
(kgdb) list
819 thread_stopped(p);
820 thread_suspend_one(td);
821 PROC_UNLOCK(p);
822 mi_switch(SW_VOL, NULL);
823 mtx_unlock_spin(sched_lock);
824 PROC_LOCK(p);
825 mtx_lock_spin(sched_lock);
826 if (mode == SINGLE_EXIT)
827 remaining = p-p_numthreads;
828 else if (mode == SINGLE_BOUNDARY)

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Gleb Smirnoff
On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
T   I wonder why UMA was suspected to be the problem. Dima gave
T me access to the core. Here are more details from the trace:

It looks like a race between two threads in one process. Look here:

(kgdb) frame 12
#12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, 
file=0x12 Address 0x12 out of bounds, line=18) at 
/usr/src/sys/kern/kern_mutex.c:579
579 turnstile_wait(m-mtx_object, mtx_owner(m));
(kgdb) p *m
$10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process 
lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = {
  tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100}, 
mtx_lock = 3611674882, mtx_recurse = 0}
(kgdb) p ((struct thread *)tid)
$15 = (struct thread *) 0xd59aad80
(kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))
$17 = (struct thread *) 0xd745c900
(kgdb) p ((struct thread *)(m-mtx_lock  ~(0x1 | 0x2)))-td_proc
$18 = (struct proc *) 0xd5dd5430
(kgdb) p ((struct thread *)tid)-td_proc
$19 = (struct proc *) 0xd5dd5430

So, we see that one thread blocks on the lock that is held by an
other thread of the same process. Here they are:

* 134 Thread 100198 (PID=47872: nagios)  doadump () at pcpu.h:165
  133 Thread 100147 (PID=47872: nagios)  sched_switch (td=0xd745c900, 
newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980

Let's look at the second one:

(kgdb) thread 133
[Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900, 
newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980
980 sched_lock.mtx_lock = (uintptr_t)td;
(kgdb) bt
#0  sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at 
/usr/src/sys/kern/sched_4bsd.c:980
#1  0xd0607f46 in mi_switch (flags=2, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:420
#2  0xd0615ecf in maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467
#3  0xd06160c8 in setrunqueue (td=0xd59aad80, flags=0) at kern_switch.c:585
#4  0xd06151e7 in sched_wakeup (td=0xd59aad80) at 
/usr/src/sys/kern/sched_4bsd.c:996
#5  0xd0608025 in setrunnable (td=0xd59aad80) at 
/usr/src/sys/kern/kern_synch.c:483
#6  0xd060d78e in thread_unsuspend_one (td=0xd59aad80) at 
/usr/src/sys/kern/kern_thread.c:972
#7  0xd060d584 in thread_suspend_check (return_instead=0) at 
/usr/src/sys/kern/kern_thread.c:935
#8  0xd0628a88 in userret (td=0xd745c900, frame=0xf5dd4d38, oticks=1) at 
/usr/src/sys/kern/subr_trap.c:116
#9  0xd07a6e16 in syscall (frame=
  {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, 
tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 
672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 
2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss 
= 59})
at /usr/src/sys/i386/i386/trap.c:1034
#10 0xd078f38f in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Gavin Atkinson
On Wed, 2006-11-15 at 15:37 -0500, Kris Kennaway wrote:
 On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote:
  On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote:
  
   From alc@:
   
   ---
   I've never seen anything like this before.  UMA is failing to allocate
   the zone structure.  This is unrelated to the large-swap scenario that
   you ran into.  Ask him to uncomment all of the UMA debugging #define's
   at the start of uma_core.c.
  
  It was very painfull for me and I don't get result...
  
  #define UMA_DEBUG 1
  #define UMA_DEBUG_ALLOC 1
  #define UMA_DEBUG_ALLOC_1 1
  
  in uma_core.c kill my machine.
  I get tons of crap to serial console.
 
 The tons of crap is what was necessary to proceed.

I actually suspect that the confusin comes from replying to the wrong
mail.

I believe Kris Kennaway's response from alc@ in
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=473581+0
+current/freebsd-stable
was actually supposed to be a reply to
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=434333+0
+current/freebsd-stable

... and actually had nothing to do with this thread.

Gavin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Gleb Smirnoff
On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote:
T On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
T T   I wonder why UMA was suspected to be the problem. Dima gave
T T me access to the core. Here are more details from the trace:

And even more:

(kgdb) thread 133
[Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900, 
newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980
980 sched_lock.mtx_lock = (uintptr_t)td;
(kgdb) frame 9
#9  0xd07a6e16 in syscall (frame=
  {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, 
tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 
672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 
2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss 
= 59})
at /usr/src/sys/i386/i386/trap.c:1034
1034userret(td, frame, sticks);
(kgdb) p *callp
$92 = {sy_narg = 65539, sy_call = 0xd0630550 poll, sy_auevent = 43012}

(kgdb) set $poll = (struct thread *)0xd745c900
(kgdb) set $fork = (struct thread *)0xd59aad80

(kgdb) p $poll-td_state
$93 = TDS_INHIBITED
(kgdb) p $poll-td_inhibitors
$94 = 1 == TDI_SUSPENDED
(kgdb) p/x $poll-td_flags
$96 = 0x1010c01 == TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | 
TDF_NEEDRESCHED | TDF_SCHED0
(kgdb) p $fork-td_state
$97 = TDS_INHIBITED
(kgdb) p $fork-td_inhibitors
$98 = 8 == TDI_LOCK
(kgdb) p/x $fork-td_flags 
$99 = 0x100 == TDF_SCHED0

Not everything clear yet, but looks like:

1) $fork thread obtains proc lock
2) $poll thread blocks on proc lock
3) $fork thread has suspended the $poll thread in thread_single()
4) $fork thread temporarily unlocks proc lock (line 821) and is
   preempted by $poll thread
5) $poll thread obtains proc lock, and starts doing its poll job
6) $fork thread blocks on proc lock, and is added to its turnstile
7) $poll thread drops the proc lock, but isn't preempted by $fork
8) $poll thread exits and is preempted by $fork

...) and here is something difficult to understand, when $poll tries to
make $fork runnable, while $fork is trying to put itself in the turnstile
that is owned by $poll

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-16 Thread Kris Kennaway
On Thu, Nov 16, 2006 at 02:31:32PM +, Gavin Atkinson wrote:
 On Wed, 2006-11-15 at 15:37 -0500, Kris Kennaway wrote:
  On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote:
   On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote:
   
From alc@:

---
I've never seen anything like this before.  UMA is failing to allocate
the zone structure.  This is unrelated to the large-swap scenario that
you ran into.  Ask him to uncomment all of the UMA debugging #define's
at the start of uma_core.c.
   
   It was very painfull for me and I don't get result...
   
   #define UMA_DEBUG 1
   #define UMA_DEBUG_ALLOC 1
   #define UMA_DEBUG_ALLOC_1 1
   
   in uma_core.c kill my machine.
   I get tons of crap to serial console.
  
  The tons of crap is what was necessary to proceed.
 
 I actually suspect that the confusin comes from replying to the wrong
 mail.
 
 I believe Kris Kennaway's response from alc@ in
 http://docs.freebsd.org/cgi/getmsg.cgi?fetch=473581+0
 +current/freebsd-stable
 was actually supposed to be a reply to
 http://docs.freebsd.org/cgi/getmsg.cgi?fetch=434333+0
 +current/freebsd-stable
 
 ... and actually had nothing to do with this thread.

Sorry, yes, you're right!

Kris


pgpWFFpujog5x.pgp
Description: PGP signature


Re: RELENG_6 panic under heavy load

2006-11-15 Thread Dmitriy Kirhlarov
On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote:

 From alc@:
 
 ---
 I've never seen anything like this before.  UMA is failing to allocate
 the zone structure.  This is unrelated to the large-swap scenario that
 you ran into.  Ask him to uncomment all of the UMA debugging #define's
 at the start of uma_core.c.

It was very painfull for me and I don't get result...

#define UMA_DEBUG 1
#define UMA_DEBUG_ALLOC 1
#define UMA_DEBUG_ALLOC_1 1

in uma_core.c kill my machine.
I get tons of crap to serial console.
Server unaccessable over network too.
I ask colocation support for manualy reboot server, for access to
console and boot with old kernel.

Now I update world and kernel to RELENG_6 11/15/2006 00:00:00 UTC

Any other idea?

PS. unionfs switch off now. UMA debug switch off too.

WBR
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-15 Thread Kris Kennaway
On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote:
 On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote:
 
  From alc@:
  
  ---
  I've never seen anything like this before.  UMA is failing to allocate
  the zone structure.  This is unrelated to the large-swap scenario that
  you ran into.  Ask him to uncomment all of the UMA debugging #define's
  at the start of uma_core.c.
 
 It was very painfull for me and I don't get result...
 
 #define UMA_DEBUG 1
 #define UMA_DEBUG_ALLOC 1
 #define UMA_DEBUG_ALLOC_1 1
 
 in uma_core.c kill my machine.
 I get tons of crap to serial console.

The tons of crap is what was necessary to proceed.

Kris


pgpP96wePv6yv.pgp
Description: PGP signature


Re: RELENG_6 panic under heavy load

2006-11-14 Thread Kris Kennaway
On Tue, Nov 14, 2006 at 10:50:21AM +0300, Dmitriy Kirhlarov wrote:
 On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote:
  On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote:
   Hi, list.
   
   One from my monitoring servers running with
   FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386
   
   Under heavy load it panic several times per day. Backtrace accesable:
   http://clh.higis.ru/~dimma/btfull.0 
   Can somebody take a look?
   I send Problem Report, but not get feed back now from gnats.
 
  Please provide your kernel config.
 
 http://clh.higis.ru/~dimma/OILSPACE1DEB
 
 I open PR for this problem -- kern/105464

Are you using unionfs?  It's too broken to use, so you might as well
take it out of your kernel config.

Kris


pgpdxIxHoteAz.pgp
Description: PGP signature


Re: RELENG_6 panic under heavy load

2006-11-14 Thread Kris Kennaway
On Tue, Nov 14, 2006 at 10:50:21AM +0300, Dmitriy Kirhlarov wrote:
 On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote:
  On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote:
   Hi, list.
   
   One from my monitoring servers running with
   FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386
   
   Under heavy load it panic several times per day. Backtrace accesable:
   http://clh.higis.ru/~dimma/btfull.0 
   Can somebody take a look?
   I send Problem Report, but not get feed back now from gnats.
 
  Please provide your kernel config.
 
 http://clh.higis.ru/~dimma/OILSPACE1DEB
 
 I open PR for this problem -- kern/105464

From alc@:

---
I've never seen anything like this before.  UMA is failing to allocate
the zone structure.  This is unrelated to the large-swap scenario that
you ran into.  Ask him to uncomment all of the UMA debugging #define's
at the start of uma_core.c.

Alan
---

Kris


pgp9KV5A00GpX.pgp
Description: PGP signature


RELENG_6 panic under heavy load

2006-11-13 Thread Dmitriy Kirhlarov
Hi, list.

One from my monitoring servers running with
FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386

Under heavy load it panic several times per day. Backtrace accesable:
http://clh.higis.ru/~dimma/btfull.0 
Can somebody take a look?
I send Problem Report, but not get feed back now from gnats.

Also, after replacing kernel, I has other easy reproduceble panic.
String
if_vlan_load=YES
in /boot/loader.conf and
device  vlan
in kernel.
I use fxp network cards, if it important.
I don't have backtrace for this situation now, but, if it needed, I
will make it.

-- 
Dmitriy Kirhlarov
OILspace, 26 Leninskaya sloboda, bld. 2, 2nd floor, 115280 Moscow, Russia
P:+7 495 105 7247 ext.208 F:+7 495 105 7246 E:[EMAIL PROTECTED]
OILspace - The resource enriched - www.oilspace.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RELENG_6 panic under heavy load

2006-11-13 Thread Kris Kennaway
On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote:
 Hi, list.
 
 One from my monitoring servers running with
 FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386
 
 Under heavy load it panic several times per day. Backtrace accesable:
 http://clh.higis.ru/~dimma/btfull.0 
 Can somebody take a look?
 I send Problem Report, but not get feed back now from gnats.

Please provide your kernel config.

 Also, after replacing kernel, I has other easy reproduceble panic.
 String
 if_vlan_load=YES
 in /boot/loader.conf and
 device  vlan
 in kernel.
 I use fxp network cards, if it important.
 I don't have backtrace for this situation now, but, if it needed, I
 will make it.

Yes, you'll need to provide a backtrace for this too.  Please post it
separately to avoid confusion between the two problems.

Kris


pgp3oji8ALSrd.pgp
Description: PGP signature


Re: RELENG_6 panic under heavy load

2006-11-13 Thread Dmitriy Kirhlarov
On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote:
 On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote:
  Hi, list.
  
  One from my monitoring servers running with
  FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386
  
  Under heavy load it panic several times per day. Backtrace accesable:
  http://clh.higis.ru/~dimma/btfull.0 
  Can somebody take a look?
  I send Problem Report, but not get feed back now from gnats.

 Please provide your kernel config.

http://clh.higis.ru/~dimma/OILSPACE1DEB

I open PR for this problem -- kern/105464

WBR
Dmitriy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]