Re: RELENG_6 panic under heavy load
On Thu, Dec 07, 2006 at 11:18:52AM +0800, David Xu wrote: On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote: On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T I wonder why UMA was suspected to be the problem. Dima gave T me access to the core. Here are more details from the trace: It looks like a race between two threads in one process. Look here: Can you try the patch ? http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch I've tested it. This patch works also, but with a little bit different behaviour. With patch from jhb@ I got LA 7-8, with this patch I have LA 5-6, same as on unpatched system. But it seems to me, that system is less interactive, compared to jhb@ patch. WBR Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Tuesday 12 December 2006 20:22, Dmitriy Kirhlarov wrote: On Thu, Dec 07, 2006 at 11:18:52AM +0800, David Xu wrote: On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote: On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T I wonder why UMA was suspected to be the problem. Dima gave T me access to the core. Here are more details from the trace: It looks like a race between two threads in one process. Look here: Can you try the patch ? http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch I've tested it. This patch works also, but with a little bit different behaviour. With patch from jhb@ I got LA 7-8, with this patch I have LA 5-6, same as on unpatched system. But it seems to me, that system is less interactive, compared to jhb@ patch. WBR Dmitriy jhb patch is incomplete, it implies that every place a thread is doing state transition and waking another thread up should be patched, there is other code in kern_sig.c unpatched, though I don't know other places, but the code maybe_preempt_in_ksegrp should be synced with maybe_preempt, it should fix all problems. the LA you have seen is lower than jhb might be a nature of KSEGRP, but I am not sure, if you program forces all threads to be system-scope, it might fix the problem. David Xu ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Tue, Dec 12, 2006 at 08:49:21PM +0800, David Xu wrote: I've tested it. This patch works also, but with a little bit different behaviour. With patch from jhb@ I got LA 7-8, with this patch I have LA 5-6, same as on unpatched system. But it seems to me, that system is less interactive, compared to jhb@ patch. jhb patch is incomplete, it implies that every place a thread is doing state transition and waking another thread up should be patched, there is other code in kern_sig.c unpatched, though I don't know other places, but the code maybe_preempt_in_ksegrp should be synced with maybe_preempt, it should fix all problems. the LA you have seen is lower than jhb might be a nature of KSEGRP, but I am not sure, if you program forces all threads to be system-scope, it might fix the problem. I think, kern/105464 can be closed, after commit your patch. WBR. Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Wed, Dec 06, 2006 at 12:09:39PM -0500, John Baldwin wrote: ...) and here is something difficult to understand, when $poll tries to make $fork runnable, while $fork is trying to put itself in the turnstile that is owned by $poll Hmm. I'm guessing the problem is the $poll thread is suspended (not exited) while holding the proc lock? That would appear to be the problem. That thread can't run again to release the lock. Ah, yes, I see the bug. Something like this should fix it: Index: kern_thread.c === RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v retrieving revision 1.216.2.6 diff -u -r1.216.2.6 kern_thread.c --- kern_thread.c 2 Sep 2006 17:29:57 - 1.216.2.6 +++ kern_thread.c 6 Dec 2006 17:06:26 - @@ -969,7 +969,9 @@ TAILQ_REMOVE(p-p_suspended, td, td_runq); TD_CLR_SUSPENDED(td); p-p_suspcount--; + critical_enter(); setrunnable(td); + critical_exit(); } /* What this does is force setrunnable() to be in a nested critical section so we won't preempt during setrunnable() until either the caller of thread_unsuspend_one() eventually releases sched_lock, or, in the case you ran into, the thread does a PROC_UNLOCK() and calls mi_switch(). lbsd02# uptime 9:46AM up 22:45, 2 users, load averages: 7.50, 6.59, 6.32 It's work. Thank, you. Without your patch max uptime was 9 hours. I'm planning to test David's patch on weekend. WBR Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thursday 16 November 2006 11:09, Gleb Smirnoff wrote: On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote: T On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T T I wonder why UMA was suspected to be the problem. Dima gave T T me access to the core. Here are more details from the trace: And even more: (kgdb) thread 133 [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980 sched_lock.mtx_lock = (uintptr_t)td; (kgdb) frame 9 #9 0xd07a6e16 in syscall (frame= {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:1034 1034userret(td, frame, sticks); (kgdb) p *callp $92 = {sy_narg = 65539, sy_call = 0xd0630550 poll, sy_auevent = 43012} (kgdb) set $poll = (struct thread *)0xd745c900 (kgdb) set $fork = (struct thread *)0xd59aad80 (kgdb) p $poll-td_state $93 = TDS_INHIBITED (kgdb) p $poll-td_inhibitors $94 = 1 == TDI_SUSPENDED (kgdb) p/x $poll-td_flags $96 = 0x1010c01 == TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | TDF_NEEDRESCHED | TDF_SCHED0 (kgdb) p $fork-td_state $97 = TDS_INHIBITED (kgdb) p $fork-td_inhibitors $98 = 8 == TDI_LOCK (kgdb) p/x $fork-td_flags $99 = 0x100 == TDF_SCHED0 Not everything clear yet, but looks like: 1) $fork thread obtains proc lock 2) $poll thread blocks on proc lock 3) $fork thread has suspended the $poll thread in thread_single() 4) $fork thread temporarily unlocks proc lock (line 821) and is preempted by $poll thread 5) $poll thread obtains proc lock, and starts doing its poll job 6) $fork thread blocks on proc lock, and is added to its turnstile 7) $poll thread drops the proc lock, but isn't preempted by $fork 8) $poll thread exits and is preempted by $fork ...) and here is something difficult to understand, when $poll tries to make $fork runnable, while $fork is trying to put itself in the turnstile that is owned by $poll Hmm. I'm guessing the problem is the $poll thread is suspended (not exited) while holding the proc lock? That would appear to be the problem. That thread can't run again to release the lock. Ah, yes, I see the bug. Something like this should fix it: Index: kern_thread.c === RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v retrieving revision 1.216.2.6 diff -u -r1.216.2.6 kern_thread.c --- kern_thread.c 2 Sep 2006 17:29:57 - 1.216.2.6 +++ kern_thread.c 6 Dec 2006 17:06:26 - @@ -969,7 +969,9 @@ TAILQ_REMOVE(p-p_suspended, td, td_runq); TD_CLR_SUSPENDED(td); p-p_suspcount--; + critical_enter(); setrunnable(td); + critical_exit(); } /* What this does is force setrunnable() to be in a nested critical section so we won't preempt during setrunnable() until either the caller of thread_unsuspend_one() eventually releases sched_lock, or, in the case you ran into, the thread does a PROC_UNLOCK() and calls mi_switch(). -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote: On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T I wonder why UMA was suspected to be the problem. Dima gave T me access to the core. Here are more details from the trace: It looks like a race between two threads in one process. Look here: (kgdb) frame 12 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, file=0x12 Address 0x12 out of bounds, line=18) at /usr/src/sys/kern/kern_mutex.c:579 579 turnstile_wait(m-mtx_object, mtx_owner(m)); (kgdb) p *m $10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = { tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100}, mtx_lock = 3611674882, mtx_recurse = 0} (kgdb) p ((struct thread *)tid) $15 = (struct thread *) 0xd59aad80 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2))) $17 = (struct thread *) 0xd745c900 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2)))-td_proc $18 = (struct proc *) 0xd5dd5430 (kgdb) p ((struct thread *)tid)-td_proc $19 = (struct proc *) 0xd5dd5430 So, we see that one thread blocks on the lock that is held by an other thread of the same process. Here they are: * 134 Thread 100198 (PID=47872: nagios) doadump () at pcpu.h:165 133 Thread 100147 (PID=47872: nagios) sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 Let's look at the second one: (kgdb) thread 133 [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980 sched_lock.mtx_lock = (uintptr_t)td; (kgdb) bt #0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 #1 0xd0607f46 in mi_switch (flags=2, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:420 #2 0xd0615ecf in maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467 #3 0xd06160c8 in setrunqueue (td=0xd59aad80, flags=0) at kern_switch.c:585 #4 0xd06151e7 in sched_wakeup (td=0xd59aad80) at /usr/src/sys/kern/sched_4bsd.c:996 #5 0xd0608025 in setrunnable (td=0xd59aad80) at /usr/src/sys/kern/kern_synch.c:483 #6 0xd060d78e in thread_unsuspend_one (td=0xd59aad80) at /usr/src/sys/kern/kern_thread.c:972 #7 0xd060d584 in thread_suspend_check (return_instead=0) at /usr/src/sys/kern/kern_thread.c:935 #8 0xd0628a88 in userret (td=0xd745c900, frame=0xf5dd4d38, oticks=1) at /usr/src/sys/kern/subr_trap.c:116 #9 0xd07a6e16 in syscall (frame= {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:1034 #10 0xd078f38f in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200 maybe_preempt_in_ksegrp is broken, it should do some checks like maybe_preempt() does, some special cases should prevent preemption, I believe this will not be a problem on -CURRENT with Julian's ksegrp removal yesterday. the problem is not in thread suspension code. Regards, David Xu ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thursday 16 November 2006 19:15, Gleb Smirnoff wrote: On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T I wonder why UMA was suspected to be the problem. Dima gave T me access to the core. Here are more details from the trace: It looks like a race between two threads in one process. Look here: (kgdb) frame 12 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, file=0x12 Address 0x12 out of bounds, line=18) at /usr/src/sys/kern/kern_mutex.c:579 579 turnstile_wait(m-mtx_object, mtx_owner(m)); (kgdb) p *m $10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = { tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100}, mtx_lock = 3611674882, mtx_recurse = 0} (kgdb) p ((struct thread *)tid) $15 = (struct thread *) 0xd59aad80 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2))) $17 = (struct thread *) 0xd745c900 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2)))-td_proc $18 = (struct proc *) 0xd5dd5430 (kgdb) p ((struct thread *)tid)-td_proc $19 = (struct proc *) 0xd5dd5430 So, we see that one thread blocks on the lock that is held by an other thread of the same process. Here they are: * 134 Thread 100198 (PID=47872: nagios) doadump () at pcpu.h:165 133 Thread 100147 (PID=47872: nagios) sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 Let's look at the second one: (kgdb) thread 133 [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980 sched_lock.mtx_lock = (uintptr_t)td; (kgdb) bt #0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 #1 0xd0607f46 in mi_switch (flags=2, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:420 #2 0xd0615ecf in maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467 #3 0xd06160c8 Can you try the patch ? http://people.freebsd.org/~davidxu/patch/ksegrp_preempt.patch ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Wed, Nov 15, 2006 at 03:37:40PM -0500, Kris Kennaway wrote: On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote: On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote: From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. It was very painfull for me and I don't get result... #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 in uma_core.c kill my machine. I get tons of crap to serial console. The tons of crap is what was necessary to proceed. Not shure. I can't find way for collect output of UMA_DEBUG* with current console server and it can't be replaced before Jan. But I think UMA* is a wrong idea. You suppose bad hardware (I'm right?). Possible. But it can be detected only under heavy load -- when I enabling nagios on this server. When nagios disabled, machine work perfectly. With enabled UMA_DEBUG* options server not operable, no one service starting, we don't have this load and can't reproduce this behaviour. WBR Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
I wonder why UMA was suspected to be the problem. Dima gave me access to the core. Here are more details from the trace: Unread portion of the kernel message buffer: panic: thread 100147(nagios):1 holds process lock but isn't blocked on a lock #9 0xd060038e in panic (fmt=0xd08094d9 thread %d(%s):%d holds %s but isn't blocked on a lock\n) at /usr/src/sys/kern/kern_shutdown.c:549 #10 0xd0629228 in propagate_priority (td=0xd745c900) at /usr/src/sys/kern/subr_turnstile.c:239 #11 0xd0629f32 in turnstile_wait (lock=0xd5dd5498, owner=0xd745c900) at /usr/src/sys/kern/subr_turnstile.c:643 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, file=0x12 Address 0x12 out of bounds, line=18) at /usr/src/sys/kern/kern_mutex.c:579 #13 0xd05f4992 in _mtx_lock_flags (m=0xd5dd5498, opts=0, file=0xd0806c3d /usr/src/sys/kern/kern_thread.c, line=824) at /usr/src/sys/kern/kern_mutex.c:288 #14 0xd060d340 in thread_single (mode=0) at /usr/src/sys/kern/kern_thread.c:824 #15 0xd05e38b9 in fork1 (td=0xd59aad80, flags=20, pages=0, procp=0xf5ca) at /usr/src/sys/kern/kern_fork.c:274 #16 0xd05e3509 in fork (td=0xd59aad80, uap=0xf5cacd04) at /usr/src/sys/kern/kern_fork.c:98 #17 0xd07a6d10 in syscall (frame= {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134953856, tf_esi = 673312612, tf_ebp = -809526568, tf_isp = -171258524, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134963456, tf_eax = 2, tf_trapno = 12, tf_err = 2, tf_eip = 672684403, tf_cs = 51, tf_eflags = 642, tf_esp = -809526660, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:983 #18 0xd078f38f in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200 #19 0x0033 in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) frame 10 #10 0xd0629228 in propagate_priority (td=0xd745c900) at /usr/src/sys/kern/subr_turnstile.c:239 239 KASSERT(TD_ON_LOCK(td), ( (kgdb) list 234 #endif 235 236 /* 237 * If we aren't blocked on a lock, we should be. 238 */ 239 KASSERT(TD_ON_LOCK(td), ( 240 thread %d(%s):%d holds %s but isn't blocked on a lock\n, 241 td-td_tid, td-td_proc-p_comm, td-td_state, 242 ts-ts_lockobj-lo_name)); 243 (kgdb) frame 14 #14 0xd060d340 in thread_single (mode=0) at /usr/src/sys/kern/kern_thread.c:824 824 PROC_LOCK(p); (kgdb) list 819 thread_stopped(p); 820 thread_suspend_one(td); 821 PROC_UNLOCK(p); 822 mi_switch(SW_VOL, NULL); 823 mtx_unlock_spin(sched_lock); 824 PROC_LOCK(p); 825 mtx_lock_spin(sched_lock); 826 if (mode == SINGLE_EXIT) 827 remaining = p-p_numthreads; 828 else if (mode == SINGLE_BOUNDARY) -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T I wonder why UMA was suspected to be the problem. Dima gave T me access to the core. Here are more details from the trace: It looks like a race between two threads in one process. Look here: (kgdb) frame 12 #12 0xd05f4fc1 in _mtx_lock_sleep (m=0xd5dd5498, tid=3583683968, opts=0, file=0x12 Address 0x12 out of bounds, line=18) at /usr/src/sys/kern/kern_mutex.c:579 579 turnstile_wait(m-mtx_object, mtx_owner(m)); (kgdb) p *m $10 = {mtx_object = {lo_class = 0xd084e224, lo_name = 0xd080508c process lock, lo_type = 0xd080508c process lock, lo_flags = 4390912, lo_list = { tqe_next = 0xd5dd56b0, tqe_prev = 0xd5dd5290}, lo_witness = 0xd088a100}, mtx_lock = 3611674882, mtx_recurse = 0} (kgdb) p ((struct thread *)tid) $15 = (struct thread *) 0xd59aad80 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2))) $17 = (struct thread *) 0xd745c900 (kgdb) p ((struct thread *)(m-mtx_lock ~(0x1 | 0x2)))-td_proc $18 = (struct proc *) 0xd5dd5430 (kgdb) p ((struct thread *)tid)-td_proc $19 = (struct proc *) 0xd5dd5430 So, we see that one thread blocks on the lock that is held by an other thread of the same process. Here they are: * 134 Thread 100198 (PID=47872: nagios) doadump () at pcpu.h:165 133 Thread 100147 (PID=47872: nagios) sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 Let's look at the second one: (kgdb) thread 133 [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980 sched_lock.mtx_lock = (uintptr_t)td; (kgdb) bt #0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 #1 0xd0607f46 in mi_switch (flags=2, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:420 #2 0xd0615ecf in maybe_preempt_in_ksegrp (td=0xd59aad80) at kern_switch.c:467 #3 0xd06160c8 in setrunqueue (td=0xd59aad80, flags=0) at kern_switch.c:585 #4 0xd06151e7 in sched_wakeup (td=0xd59aad80) at /usr/src/sys/kern/sched_4bsd.c:996 #5 0xd0608025 in setrunnable (td=0xd59aad80) at /usr/src/sys/kern/kern_synch.c:483 #6 0xd060d78e in thread_unsuspend_one (td=0xd59aad80) at /usr/src/sys/kern/kern_thread.c:972 #7 0xd060d584 in thread_suspend_check (return_instead=0) at /usr/src/sys/kern/kern_thread.c:935 #8 0xd0628a88 in userret (td=0xd745c900, frame=0xf5dd4d38, oticks=1) at /usr/src/sys/kern/subr_trap.c:116 #9 0xd07a6e16 in syscall (frame= {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:1034 #10 0xd078f38f in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:200 -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Wed, 2006-11-15 at 15:37 -0500, Kris Kennaway wrote: On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote: On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote: From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. It was very painfull for me and I don't get result... #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 in uma_core.c kill my machine. I get tons of crap to serial console. The tons of crap is what was necessary to proceed. I actually suspect that the confusin comes from replying to the wrong mail. I believe Kris Kennaway's response from alc@ in http://docs.freebsd.org/cgi/getmsg.cgi?fetch=473581+0 +current/freebsd-stable was actually supposed to be a reply to http://docs.freebsd.org/cgi/getmsg.cgi?fetch=434333+0 +current/freebsd-stable ... and actually had nothing to do with this thread. Gavin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote: T On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: T T I wonder why UMA was suspected to be the problem. Dima gave T T me access to the core. Here are more details from the trace: And even more: (kgdb) thread 133 [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 980 sched_lock.mtx_lock = (uintptr_t)td; (kgdb) frame 9 #9 0xd07a6e16 in syscall (frame= {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss = 59}) at /usr/src/sys/i386/i386/trap.c:1034 1034userret(td, frame, sticks); (kgdb) p *callp $92 = {sy_narg = 65539, sy_call = 0xd0630550 poll, sy_auevent = 43012} (kgdb) set $poll = (struct thread *)0xd745c900 (kgdb) set $fork = (struct thread *)0xd59aad80 (kgdb) p $poll-td_state $93 = TDS_INHIBITED (kgdb) p $poll-td_inhibitors $94 = 1 == TDI_SUSPENDED (kgdb) p/x $poll-td_flags $96 = 0x1010c01 == TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | TDF_NEEDRESCHED | TDF_SCHED0 (kgdb) p $fork-td_state $97 = TDS_INHIBITED (kgdb) p $fork-td_inhibitors $98 = 8 == TDI_LOCK (kgdb) p/x $fork-td_flags $99 = 0x100 == TDF_SCHED0 Not everything clear yet, but looks like: 1) $fork thread obtains proc lock 2) $poll thread blocks on proc lock 3) $fork thread has suspended the $poll thread in thread_single() 4) $fork thread temporarily unlocks proc lock (line 821) and is preempted by $poll thread 5) $poll thread obtains proc lock, and starts doing its poll job 6) $fork thread blocks on proc lock, and is added to its turnstile 7) $poll thread drops the proc lock, but isn't preempted by $fork 8) $poll thread exits and is preempted by $fork ...) and here is something difficult to understand, when $poll tries to make $fork runnable, while $fork is trying to put itself in the turnstile that is owned by $poll -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Thu, Nov 16, 2006 at 02:31:32PM +, Gavin Atkinson wrote: On Wed, 2006-11-15 at 15:37 -0500, Kris Kennaway wrote: On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote: On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote: From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. It was very painfull for me and I don't get result... #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 in uma_core.c kill my machine. I get tons of crap to serial console. The tons of crap is what was necessary to proceed. I actually suspect that the confusin comes from replying to the wrong mail. I believe Kris Kennaway's response from alc@ in http://docs.freebsd.org/cgi/getmsg.cgi?fetch=473581+0 +current/freebsd-stable was actually supposed to be a reply to http://docs.freebsd.org/cgi/getmsg.cgi?fetch=434333+0 +current/freebsd-stable ... and actually had nothing to do with this thread. Sorry, yes, you're right! Kris pgpWFFpujog5x.pgp Description: PGP signature
Re: RELENG_6 panic under heavy load
On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote: From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. It was very painfull for me and I don't get result... #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 in uma_core.c kill my machine. I get tons of crap to serial console. Server unaccessable over network too. I ask colocation support for manualy reboot server, for access to console and boot with old kernel. Now I update world and kernel to RELENG_6 11/15/2006 00:00:00 UTC Any other idea? PS. unionfs switch off now. UMA debug switch off too. WBR Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Wed, Nov 15, 2006 at 09:24:21PM +0300, Dmitriy Kirhlarov wrote: On Tue, Nov 14, 2006 at 01:53:45PM -0500, Kris Kennaway wrote: From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. It was very painfull for me and I don't get result... #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 in uma_core.c kill my machine. I get tons of crap to serial console. The tons of crap is what was necessary to proceed. Kris pgpP96wePv6yv.pgp Description: PGP signature
Re: RELENG_6 panic under heavy load
On Tue, Nov 14, 2006 at 10:50:21AM +0300, Dmitriy Kirhlarov wrote: On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote: On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote: Hi, list. One from my monitoring servers running with FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386 Under heavy load it panic several times per day. Backtrace accesable: http://clh.higis.ru/~dimma/btfull.0 Can somebody take a look? I send Problem Report, but not get feed back now from gnats. Please provide your kernel config. http://clh.higis.ru/~dimma/OILSPACE1DEB I open PR for this problem -- kern/105464 Are you using unionfs? It's too broken to use, so you might as well take it out of your kernel config. Kris pgpdxIxHoteAz.pgp Description: PGP signature
Re: RELENG_6 panic under heavy load
On Tue, Nov 14, 2006 at 10:50:21AM +0300, Dmitriy Kirhlarov wrote: On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote: On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote: Hi, list. One from my monitoring servers running with FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386 Under heavy load it panic several times per day. Backtrace accesable: http://clh.higis.ru/~dimma/btfull.0 Can somebody take a look? I send Problem Report, but not get feed back now from gnats. Please provide your kernel config. http://clh.higis.ru/~dimma/OILSPACE1DEB I open PR for this problem -- kern/105464 From alc@: --- I've never seen anything like this before. UMA is failing to allocate the zone structure. This is unrelated to the large-swap scenario that you ran into. Ask him to uncomment all of the UMA debugging #define's at the start of uma_core.c. Alan --- Kris pgp9KV5A00GpX.pgp Description: PGP signature
RELENG_6 panic under heavy load
Hi, list. One from my monitoring servers running with FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386 Under heavy load it panic several times per day. Backtrace accesable: http://clh.higis.ru/~dimma/btfull.0 Can somebody take a look? I send Problem Report, but not get feed back now from gnats. Also, after replacing kernel, I has other easy reproduceble panic. String if_vlan_load=YES in /boot/loader.conf and device vlan in kernel. I use fxp network cards, if it important. I don't have backtrace for this situation now, but, if it needed, I will make it. -- Dmitriy Kirhlarov OILspace, 26 Leninskaya sloboda, bld. 2, 2nd floor, 115280 Moscow, Russia P:+7 495 105 7247 ext.208 F:+7 495 105 7246 E:[EMAIL PROTECTED] OILspace - The resource enriched - www.oilspace.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RELENG_6 panic under heavy load
On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote: Hi, list. One from my monitoring servers running with FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386 Under heavy load it panic several times per day. Backtrace accesable: http://clh.higis.ru/~dimma/btfull.0 Can somebody take a look? I send Problem Report, but not get feed back now from gnats. Please provide your kernel config. Also, after replacing kernel, I has other easy reproduceble panic. String if_vlan_load=YES in /boot/loader.conf and device vlan in kernel. I use fxp network cards, if it important. I don't have backtrace for this situation now, but, if it needed, I will make it. Yes, you'll need to provide a backtrace for this too. Please post it separately to avoid confusion between the two problems. Kris pgp3oji8ALSrd.pgp Description: PGP signature
Re: RELENG_6 panic under heavy load
On Mon, Nov 13, 2006 at 01:45:05PM -0500, Kris Kennaway wrote: On Mon, Nov 13, 2006 at 11:44:31AM +0300, Dmitriy Kirhlarov wrote: Hi, list. One from my monitoring servers running with FreeBSD 6.2-PRERELEASE #0: Fri Nov 10 11:03:10 UTC 2006 i386 Under heavy load it panic several times per day. Backtrace accesable: http://clh.higis.ru/~dimma/btfull.0 Can somebody take a look? I send Problem Report, but not get feed back now from gnats. Please provide your kernel config. http://clh.higis.ru/~dimma/OILSPACE1DEB I open PR for this problem -- kern/105464 WBR Dmitriy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]