ZFS panic on a RELENG_8 NFS server (Was: panic: spin lock held too long (RELENG_8 from today))

2011-09-09 Thread Hiroki Sato
Hiroki Sato h...@freebsd.org wrote
  in 20110907.094717.2272609566853905102@allbsd.org:

hr  During this investigation an disk has to be replaced and resilvering
hr  it is now in progress.  A deadlock and a forced reboot after that
hr  make recovering of the zfs datasets take a long time (for committing
hr  logs, I think), so I will try to reproduce the deadlock and get a
hr  core dump after it finished.

 I think I could reproduce the symptoms.  I have no idea about if
 these are exactly the same as occurred on my box before because the
 kernel was replaced with one with some debugging options, but these
 are reproducible at least.

 There are two symptoms.  One is a panic.  A DDB output when the panic
 occurred is the following:


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x10040
fault code  = supervisor read data, page not present
instruction pointer = 0x20:0x8065b926
stack pointer   = 0x28:0xff8257b94d70
frame pointer   = 0x28:0xff8257b94e10
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 992 (nfsd: service)
[thread pid 992 tid 100586 ]
Stopped at  witness_checkorder+0x246:   movl0x40(%r13),%ebx

db bt
Tracing pid 992 tid 100586 td 0xff00595d9000
witness_checkorder() at witness_checkorder+0x246
_sx_slock() at _sx_slock+0x35
dmu_bonus_hold() at dmu_bonus_hold+0x57
zfs_zget() at zfs_zget+0x237
zfs_dirent_lock() at zfs_dirent_lock+0x488
zfs_dirlook() at zfs_dirlook+0x69
zfs_lookup() at zfs_lookup+0x26b
zfs_freebsd_lookup() at zfs_freebsd_lookup+0x81
vfs_cache_lookup() at vfs_cache_lookup+0xf0
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x40
lookup() at lookup+0x384
nfsvno_namei() at nfsvno_namei+0x268
nfsrvd_lookup() at nfsrvd_lookup+0xd6
nfsrvd_dorpc() at nfsrvd_dorpc+0x745
nfssvc_program() at nfssvc_program+0x447
svc_run_internal() at svc_run_internal+0x51b
svc_thread_start() at svc_thread_start+0xb
fork_exit() at fork_exit+0x11d
fork_trampoline() at fork_trampoline+0xe
--- trap 0xc, rip = 0x8006a031c, rsp = 0x7fffe6c8, rbp = 0x6 ---


 The complete output can be found at:

  http://people.allbsd.org/~hrs/zfs_panic_20110909_1/pool-zfs-20110909-1.txt

 Another is getting stuck at ZFS access.  The kernel is running with
 no panic but any access to ZFS datasets causes a program
 non-responsive.  The DDB output can be found at:

  http://people.allbsd.org/~hrs/zfs_panic_20110909_2/pool-zfs-20110909-2.txt

 The trigger for the both was some access to a ZFS dataset from the
 NFS clients.  Because the access pattern was complex I could not
 narrow down what was the culprit, but it seems timing-dependent and
 simply doing rm -rf locally on the server can sometimes trigger
 them.

 The crash dump and the kernel can be found at the following URLs:

  panic:
http://people.allbsd.org/~hrs/zfs_panic_20110909_1/

  no panic but unresponsive:
http://people.allbsd.org/~hrs/zfs_panic_20110909_2/

  kernel:
http://people.allbsd.org/~hrs/zfs_panic_20110909_kernel/

-- Hiroki
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-06 Thread Hiroki Sato
Attilio Rao atti...@freebsd.org wrote
  in CAJ-FndAChGndC=lkzni7i6mot+spw3-ofto9rh0+5wnnvwz...@mail.gmail.com:

at This should be enough for someone NFS-aware to look into it.
at
at Were you also able to get a core?

 Yes.  But as kib@ pointed out it seems a deadlock in ZFS.  Some
 experiments I did showed that this deadlock can be triggered at least
 by doing rm -rf against a local directory that has a large number
 of files/sub-directories.

 Then, I updated the kernel with the latest 8-STABLE + WITNESS option
 because a fix for LOR of spa_config lock was committed and tracking
 locks without WITNESS was hard.  The deadlock can still be triggered
 after that.

 During this investigation an disk has to be replaced and resilvering
 it is now in progress.  A deadlock and a forced reboot after that
 make recovering of the zfs datasets take a long time (for committing
 logs, I think), so I will try to reproduce the deadlock and get a
 core dump after it finished.

 If the old kernel and core of the deadlock I reported on Saturday are
 still useful for debugging, I can put them to somewhere you can
 access.

-- Hiroki


pgptCZubr4hdM.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-03 Thread Kostik Belousov
On Sat, Sep 03, 2011 at 12:05:47PM +0200, Attilio Rao wrote:
 This should be enough for someone NFS-aware to look into it.
 
 Were you also able to get a core?
 
 I'll try to look into it in the next days, in particular about the
 softclock state.
 
I am absolutely  sure that this is a zfs deadlock.


pgpV5NUD9Kyx1.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-01 Thread Trent Nelson

On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote:

 If nobody complains about it earlier, I'll propose the patch to re@ in 8 
 hours.

Just a friendly 'me too', for the records.  22 hours of heavy network/disk I/O 
and no panic yet -- prior to the patch it was a panic orgy.

Any response from re@ on the patch?  It didn't appear to be in stable/8 as of 
yesterday:

[root@flanker/ttypts/0(../src/sys/kern)#] svn diff
Index: kern_timeout.c
===
--- kern_timeout.c  (revision 225280)
+++ kern_timeout.c  (working copy)
@@ -268,9 +268,11 @@
CC_LOCK_ASSERT(cc);
 
c-c_cpu = CPUBLOCK;
+   spinlock_enter();
CC_UNLOCK(cc);
new_cc = CC_CPU(new_cpu);
CC_LOCK(new_cc);
+   spinlock_exit();
c-c_cpu = new_cpu;
return (new_cc);
 }


Regards,

Trent.___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-01 Thread Attilio Rao
2011/9/1 Trent Nelson tr...@snakebite.org:

 On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote:

 If nobody complains about it earlier, I'll propose the patch to re@ in 8 
 hours.

 Just a friendly 'me too', for the records.  22 hours of heavy network/disk 
 I/O and no panic yet -- prior to the patch it was a panic orgy.

 Any response from re@ on the patch?  It didn't appear to be in stable/8 as of 
 yesterday:

It has been committed to STABLE_8 as r225288.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Mike Tancsa
On 8/18/2011 8:37 PM, Chip Camden wrote:

 st Thanks, Attilio.  I've applied the patch and removed the extra debug
 st options I had added (though keeping debug symbols).  I'll let you know if
 st I experience any more panics.

  No panic for 20 hours at this moment, FYI.  For my NFS server, I
  think another 24 hours would be sufficient to confirm the stability.
  I will see how it works...

 -- Hiroki
 
 Likewise:
 
 $ uptime
  5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63
 
 So far, so good (knocks on head).
 


0(ns4)% uptime
 8:55AM  up 22:39, 3 users, load averages: 0.01, 0.00, 0.00
0(ns4)%


So far so good for me too

---Mike

-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Chip Camden
Quoth Mike Tancsa on Friday, 19 August 2011:
 On 8/18/2011 8:37 PM, Chip Camden wrote:
 
  st Thanks, Attilio.  I've applied the patch and removed the extra debug
  st options I had added (though keeping debug symbols).  I'll let you know 
  if
  st I experience any more panics.
 
   No panic for 20 hours at this moment, FYI.  For my NFS server, I
   think another 24 hours would be sufficient to confirm the stability.
   I will see how it works...
 
  -- Hiroki
  
  Likewise:
  
  $ uptime
   5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63
  
  So far, so good (knocks on head).
  
 
 
 0(ns4)% uptime
  8:55AM  up 22:39, 3 users, load averages: 0.01, 0.00, 0.00
 0(ns4)%
 
 
 So far so good for me too
 
   ---Mike
 
 -- 
 ---
 Mike Tancsa, tel +1 519 651 3400
 Sentex Communications, m...@sentex.net
 Providing Internet services since 1994 www.sentex.net
 Cambridge, Ontario Canada   http://www.tancsa.com/
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Still up and running here.

 8:02AM  up 1 day, 12:10, 4 users, load averages: 0.08, 0.26, 0.52

After the panics began, I never went more than 12 hours without one before
applying this patch.  I think you nailed it, Attilio.  Or at least, you
moved it.

-- 
.O. | Sterling (Chip) Camden  | http://camdensoftware.com
..O | sterl...@camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91  | http://chipstips.com


pgp4szrgFEc1J.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Attilio Rao
If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours.

Attilio

2011/8/19 Mike Tancsa m...@sentex.net:
 On 8/18/2011 8:37 PM, Chip Camden wrote:

 st Thanks, Attilio.  I've applied the patch and removed the extra debug
 st options I had added (though keeping debug symbols).  I'll let you know 
 if
 st I experience any more panics.

  No panic for 20 hours at this moment, FYI.  For my NFS server, I
  think another 24 hours would be sufficient to confirm the stability.
  I will see how it works...

 -- Hiroki

 Likewise:

 $ uptime
  5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63

 So far, so good (knocks on head).



 0(ns4)% uptime
  8:55AM  up 22:39, 3 users, load averages: 0.01, 0.00, 0.00
 0(ns4)%


 So far so good for me too

        ---Mike

 --
 ---
 Mike Tancsa, tel +1 519 651 3400
 Sentex Communications, m...@sentex.net
 Providing Internet services since 1994 www.sentex.net
 Cambridge, Ontario Canada   http://www.tancsa.com/




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Hiroki Sato
Attilio Rao atti...@freebsd.org wrote
  in CAJ-FndDHmwa+=lnggu+5mk2xmtj8kwhb10jsoytkmgetvgn...@mail.gmail.com:

at If nobody complains about it earlier, I'll propose the patch to re@ in 8 
hours.

 Running fine for 45 hours so far.  Please go ahead!

-- Hiroki


pgp3JVRs7kKa0.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-18 Thread Hiroki Sato
Chip Camden sterl...@camdensoftware.com wrote
  in 20110818025550.ga1...@libertas.local.camdensoftware.com:

st Quoth Attilio Rao on Thursday, 18 August 2011:
st  In callout_cpu_switch() if a low priority thread is migrating the
st  callout and gets preempted after the outcoming cpu queue lock is left
st  (and scheduled much later) we get this problem.
st 
st  In order to fix this bug it could be enough to use a critical section,
st  but I think this should be really interrupt safe, thus I'd wrap them
st  up with spinlock_enter()/spinlock_exit(). Fortunately
st  callout_cpu_switch() should be called rarely and also we already do
st  expensive locking operations in callout, thus we should not have
st  problem performance-wise.
st 
st  Can the guys I also CC'ed here try the following patch, with all the
st  initial kernel options that were leading you to the deadlock? (thus
st  revert any debugging patch/option you added for the moment):
st  http://www.freebsd.org/~attilio/callout-fixup.diff
st 
st  Please note that this patch is for STABLE_8, if you can confirm the
st  good result I'll commit to -CURRENT and then backmarge as soon as
st  possible.
st 
st  Thanks,
st  Attilio
st 
st
st Thanks, Attilio.  I've applied the patch and removed the extra debug
st options I had added (though keeping debug symbols).  I'll let you know if
st I experience any more panics.

 No panic for 20 hours at this moment, FYI.  For my NFS server, I
 think another 24 hours would be sufficient to confirm the stability.
 I will see how it works...

-- Hiroki


pgpatVE0r5wVx.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-18 Thread Chip Camden
Quoth Hiroki Sato on Friday, 19 August 2011:
 Chip Camden sterl...@camdensoftware.com wrote
   in 20110818025550.ga1...@libertas.local.camdensoftware.com:
 
 st Quoth Attilio Rao on Thursday, 18 August 2011:
 st  In callout_cpu_switch() if a low priority thread is migrating the
 st  callout and gets preempted after the outcoming cpu queue lock is left
 st  (and scheduled much later) we get this problem.
 st 
 st  In order to fix this bug it could be enough to use a critical section,
 st  but I think this should be really interrupt safe, thus I'd wrap them
 st  up with spinlock_enter()/spinlock_exit(). Fortunately
 st  callout_cpu_switch() should be called rarely and also we already do
 st  expensive locking operations in callout, thus we should not have
 st  problem performance-wise.
 st 
 st  Can the guys I also CC'ed here try the following patch, with all the
 st  initial kernel options that were leading you to the deadlock? (thus
 st  revert any debugging patch/option you added for the moment):
 st  http://www.freebsd.org/~attilio/callout-fixup.diff
 st 
 st  Please note that this patch is for STABLE_8, if you can confirm the
 st  good result I'll commit to -CURRENT and then backmarge as soon as
 st  possible.
 st 
 st  Thanks,
 st  Attilio
 st 
 st
 st Thanks, Attilio.  I've applied the patch and removed the extra debug
 st options I had added (though keeping debug symbols).  I'll let you know if
 st I experience any more panics.
 
  No panic for 20 hours at this moment, FYI.  For my NFS server, I
  think another 24 hours would be sufficient to confirm the stability.
  I will see how it works...
 
 -- Hiroki

Likewise:

$ uptime
 5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63

So far, so good (knocks on head).

-- 
.O. | Sterling (Chip) Camden  | http://camdensoftware.com
..O | sterl...@camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91  | http://chipstips.com


pgpkzPv5qfAnG.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Hiroki Sato
Hi,

Mike Tancsa m...@sentex.net wrote
  in 4e15a08c.6090...@sentex.net:

mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
mi 
mi  BTW, we had a similar panic, spinlock held too long, the spinlock
mi  is the sched lock N, on busy 8-core box recently upgraded to the
mi  stable/8. Unfortunately, machine hung dumping core, so the stack trace
mi  for the owner thread was not available.
mi 
mi  I was unable to make any conclusion from the data that was present.
mi  If the situation is reproducable, you coulld try to revert r221937. This
mi  is pure speculation, though.
mi 
mi  Another crash just now after 5hrs uptime. I will try and revert r221937
mi  unless there is any extra debugging you want me to add to the kernel
mi  instead  ?

 I am also suffering from a reproducible panic on an 8-STABLE box, an
 NFS server with heavy I/O load.  I could not get a kernel dump
 because this panic locked up the machine just after it occurred, but
 according to the stack trace it was the same as posted one.
 Switching to an 8.2R kernel can prevent this panic.

 Any progress on the investigation?

--
spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 
100489) too long
panic: spin lock held too long
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kdb_backtrace() at kdb_backtrace+0x37
panic() at panic+0x187
_mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
_mtx_lock_spin() at _mtx_lock_spin+0x9e
sched_add() at sched_add+0x117
setrunnable() at setrunnable+0x78
sleepq_signal() at sleepq_signal+0x7a
cv_signal() at cv_signal+0x3b
xprt_active() at xprt_active+0xe3
svc_vc_soupcall() at svc_vc_soupcall+0xc
sowakeup() at sowakeup+0x69
tcp_do_segment() at tcp_do_segment+0x25e7
tcp_input() at tcp_input+0xcdd
ip_input() at ip_input+0xac
netisr_dispatch_src() at netisr_dispatch_src+0x7e
ether_demux() at ether_demux+0x14d
ether_input() at ether_input+0x17d
em_rxeof() at em_rxeof+0x1ca
em_handle_que() at em_handle_que+0x5b
taskqueue_run_locked() at taskqueue_run_locked+0x85
taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--

-- Hiroki


pgpq7HXO6kUuo.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Chip Camden
Quoth Hiroki Sato on Thursday, 18 August 2011:
 Hi,
 
 Mike Tancsa m...@sentex.net wrote
   in 4e15a08c.6090...@sentex.net:
 
 mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 mi 
 mi  BTW, we had a similar panic, spinlock held too long, the spinlock
 mi  is the sched lock N, on busy 8-core box recently upgraded to the
 mi  stable/8. Unfortunately, machine hung dumping core, so the stack trace
 mi  for the owner thread was not available.
 mi 
 mi  I was unable to make any conclusion from the data that was present.
 mi  If the situation is reproducable, you coulld try to revert r221937. 
 This
 mi  is pure speculation, though.
 mi 
 mi  Another crash just now after 5hrs uptime. I will try and revert r221937
 mi  unless there is any extra debugging you want me to add to the kernel
 mi  instead  ?
 
  I am also suffering from a reproducible panic on an 8-STABLE box, an
  NFS server with heavy I/O load.  I could not get a kernel dump
  because this panic locked up the machine just after it occurred, but
  according to the stack trace it was the same as posted one.
  Switching to an 8.2R kernel can prevent this panic.
 
  Any progress on the investigation?
 
 --
 spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 
 100489) too long
 panic: spin lock held too long
 cpuid = 1
 KDB: stack backtrace:
 db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
 kdb_backtrace() at kdb_backtrace+0x37
 panic() at panic+0x187
 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
 _mtx_lock_spin() at _mtx_lock_spin+0x9e
 sched_add() at sched_add+0x117
 setrunnable() at setrunnable+0x78
 sleepq_signal() at sleepq_signal+0x7a
 cv_signal() at cv_signal+0x3b
 xprt_active() at xprt_active+0xe3
 svc_vc_soupcall() at svc_vc_soupcall+0xc
 sowakeup() at sowakeup+0x69
 tcp_do_segment() at tcp_do_segment+0x25e7
 tcp_input() at tcp_input+0xcdd
 ip_input() at ip_input+0xac
 netisr_dispatch_src() at netisr_dispatch_src+0x7e
 ether_demux() at ether_demux+0x14d
 ether_input() at ether_input+0x17d
 em_rxeof() at em_rxeof+0x1ca
 em_handle_que() at em_handle_que+0x5b
 taskqueue_run_locked() at taskqueue_run_locked+0x85
 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
 fork_exit() at fork_exit+0x11f
 fork_trampoline() at fork_trampoline+0xe
 --
 
 -- Hiroki


I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
have to power off.  Once, I happened to be looking at the console when it
happened and copied dow the following:

Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
panic: sleeping thread
cpuid=1

Another time I got:

lock order reversal:
1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587

I didn't copy down the traceback.

These panics seem to hit when I'm doing heavy WAN I/O.  I can go for
about a day without one as long as I stay away from the web or even chat.
Last night this system copied a backup of 35GB over the local network
without failing, but as soon as I hopped onto Firefox this morning, down
she went.  I don't know if that's coincidence or useful data.

I didn't get to say Thanks to Eitan Adler for attempting to help me
with this on Monday night.  Thanks, Eitan!

-- 
.O. | Sterling (Chip) Camden  | http://camdensoftware.com
..O | sterl...@camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91  | http://chipstips.com


pgpiDl3SMsKRu.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Mike Tancsa
On 8/17/2011 1:38 PM, Hiroki Sato wrote:
  Any progress on the investigation?

Unfortunately, I cannot reproduce it yet with a debugging kernel :(


---Mike

 
 --
 spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 
 100489) too long
 panic: spin lock held too long
 cpuid = 1
 KDB: stack backtrace:
 db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
 kdb_backtrace() at kdb_backtrace+0x37
 panic() at panic+0x187
 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
 _mtx_lock_spin() at _mtx_lock_spin+0x9e
 sched_add() at sched_add+0x117
 setrunnable() at setrunnable+0x78
 sleepq_signal() at sleepq_signal+0x7a
 cv_signal() at cv_signal+0x3b
 xprt_active() at xprt_active+0xe3
 svc_vc_soupcall() at svc_vc_soupcall+0xc
 sowakeup() at sowakeup+0x69
 tcp_do_segment() at tcp_do_segment+0x25e7
 tcp_input() at tcp_input+0xcdd
 ip_input() at ip_input+0xac
 netisr_dispatch_src() at netisr_dispatch_src+0x7e
 ether_demux() at ether_demux+0x14d
 ether_input() at ether_input+0x17d
 em_rxeof() at em_rxeof+0x1ca
 em_handle_que() at em_handle_que+0x5b
 taskqueue_run_locked() at taskqueue_run_locked+0x85
 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
 fork_exit() at fork_exit+0x11f
 fork_trampoline() at fork_trampoline+0xe
 --
 
 -- Hiroki


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/17 Hiroki Sato h...@freebsd.org:
 Hi,

 Mike Tancsa m...@sentex.net wrote
  in 4e15a08c.6090...@sentex.net:

 mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 mi 
 mi  BTW, we had a similar panic, spinlock held too long, the spinlock
 mi  is the sched lock N, on busy 8-core box recently upgraded to the
 mi  stable/8. Unfortunately, machine hung dumping core, so the stack trace
 mi  for the owner thread was not available.
 mi 
 mi  I was unable to make any conclusion from the data that was present.
 mi  If the situation is reproducable, you coulld try to revert r221937. 
 This
 mi  is pure speculation, though.
 mi 
 mi  Another crash just now after 5hrs uptime. I will try and revert r221937
 mi  unless there is any extra debugging you want me to add to the kernel
 mi  instead  ?

  I am also suffering from a reproducible panic on an 8-STABLE box, an
  NFS server with heavy I/O load.  I could not get a kernel dump
  because this panic locked up the machine just after it occurred, but
  according to the stack trace it was the same as posted one.
  Switching to an 8.2R kernel can prevent this panic.

  Any progress on the investigation?

Hiroki,
how easilly can you reproduce it?

It would be important to have a DDB textdump with these informations:
- bt
- ps
- show allpcpu
- alltrace

Alternatively, a coredump which has the stop cpu patch which Andryi can provide.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Hiroki Sato
Attilio Rao atti...@freebsd.org wrote
  in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:

at 2011/8/17 Hiroki Sato h...@freebsd.org:
at  Hi,
at 
at  Mike Tancsa m...@sentex.net wrote
at   in 4e15a08c.6090...@sentex.net:
at 
at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
at  mi 
at  mi  BTW, we had a similar panic, spinlock held too long, the spinlock
at  mi  is the sched lock N, on busy 8-core box recently upgraded to the
at  mi  stable/8. Unfortunately, machine hung dumping core, so the stack 
trace
at  mi  for the owner thread was not available.
at  mi 
at  mi  I was unable to make any conclusion from the data that was present.
at  mi  If the situation is reproducable, you coulld try to revert 
r221937. This
at  mi  is pure speculation, though.
at  mi 
at  mi  Another crash just now after 5hrs uptime. I will try and revert 
r221937
at  mi  unless there is any extra debugging you want me to add to the kernel
at  mi  instead  ?
at 
at   I am also suffering from a reproducible panic on an 8-STABLE box, an
at   NFS server with heavy I/O load.  I could not get a kernel dump
at   because this panic locked up the machine just after it occurred, but
at   according to the stack trace it was the same as posted one.
at   Switching to an 8.2R kernel can prevent this panic.
at 
at   Any progress on the investigation?
at 
at Hiroki,
at how easilly can you reproduce it?

 It takes 5-10 hours.  I installed another kernel for debugging just
 now, so I think I will be able to collect more detail information in
 a couple of days.

at It would be important to have a DDB textdump with these informations:
at - bt
at - ps
at - show allpcpu
at - alltrace
at 
at Alternatively, a coredump which has the stop cpu patch which Andryi can 
provide.

 Okay, I will post them once I can get another panic.  Thanks!

-- Hiroki


pgpFqPofBZyKa.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Jeremy Chadwick
On Wed, Aug 17, 2011 at 10:52:01AM -0700, Chip Camden wrote:
 Quoth Hiroki Sato on Thursday, 18 August 2011:
  Hi,
  
  Mike Tancsa m...@sentex.net wrote
in 4e15a08c.6090...@sentex.net:
  
  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
  mi 
  mi  BTW, we had a similar panic, spinlock held too long, the spinlock
  mi  is the sched lock N, on busy 8-core box recently upgraded to the
  mi  stable/8. Unfortunately, machine hung dumping core, so the stack 
  trace
  mi  for the owner thread was not available.
  mi 
  mi  I was unable to make any conclusion from the data that was present.
  mi  If the situation is reproducable, you coulld try to revert r221937. 
  This
  mi  is pure speculation, though.
  mi 
  mi  Another crash just now after 5hrs uptime. I will try and revert 
  r221937
  mi  unless there is any extra debugging you want me to add to the kernel
  mi  instead  ?
  
   I am also suffering from a reproducible panic on an 8-STABLE box, an
   NFS server with heavy I/O load.  I could not get a kernel dump
   because this panic locked up the machine just after it occurred, but
   according to the stack trace it was the same as posted one.
   Switching to an 8.2R kernel can prevent this panic.
  
   Any progress on the investigation?
  
  --
  spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 
  100489) too long
  panic: spin lock held too long
  cpuid = 1
  KDB: stack backtrace:
  db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
  kdb_backtrace() at kdb_backtrace+0x37
  panic() at panic+0x187
  _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
  _mtx_lock_spin() at _mtx_lock_spin+0x9e
  sched_add() at sched_add+0x117
  setrunnable() at setrunnable+0x78
  sleepq_signal() at sleepq_signal+0x7a
  cv_signal() at cv_signal+0x3b
  xprt_active() at xprt_active+0xe3
  svc_vc_soupcall() at svc_vc_soupcall+0xc
  sowakeup() at sowakeup+0x69
  tcp_do_segment() at tcp_do_segment+0x25e7
  tcp_input() at tcp_input+0xcdd
  ip_input() at ip_input+0xac
  netisr_dispatch_src() at netisr_dispatch_src+0x7e
  ether_demux() at ether_demux+0x14d
  ether_input() at ether_input+0x17d
  em_rxeof() at em_rxeof+0x1ca
  em_handle_que() at em_handle_que+0x5b
  taskqueue_run_locked() at taskqueue_run_locked+0x85
  taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
  fork_exit() at fork_exit+0x11f
  fork_trampoline() at fork_trampoline+0xe
  --
  
  -- Hiroki
 
 
 I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
 have to power off.  Once, I happened to be looking at the console when it
 happened and copied dow the following:
 
 Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
 panic: sleeping thread
 cpuid=1

No idea, might be relevant to the thread.

 Another time I got:
 
 lock order reversal:
 1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
 2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587
 
 I didn't copy down the traceback.

snaplk refers to UFS snapshots.  The above must have been typed in
manually as well, due to some typos in filenames as well.

Either this is a different problem, or if everyone in this thread is
doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem
happen then I recommend people stop using UFS snapshots.  I've ranted
about their unreliability in the past (years upon years ago -- still
seems valid) and just how badly they can wedge a system.  This is one
of the many (MANY!) reasons why we use rsnapshot/rsync instead.  The
atime clobbering issue is the only downside.

I don't see what this has to do with heavy WAN I/O unless you're doing
something like dump-over-ssh, in which case see the above paragraph.

 These panics seem to hit when I'm doing heavy WAN I/O.  I can go for
 about a day without one as long as I stay away from the web or even chat.
 Last night this system copied a backup of 35GB over the local network
 without failing, but as soon as I hopped onto Firefox this morning, down
 she went.  I don't know if that's coincidence or useful data.
 
 I didn't get to say Thanks to Eitan Adler for attempting to help me
 with this on Monday night.  Thanks, Eitan!

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, US |
| Making life hard for others since 1977.   PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Chip Camden
Quoth Jeremy Chadwick on Wednesday, 17 August 2011:
  
  I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
  have to power off.  Once, I happened to be looking at the console when it
  happened and copied dow the following:
  
  Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
  panic: sleeping thread
  cpuid=1
 
 No idea, might be relevant to the thread.
 
  Another time I got:
  
  lock order reversal:
  1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
  2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587
  
  I didn't copy down the traceback.
 
 snaplk refers to UFS snapshots.  The above must have been typed in
 manually as well, due to some typos in filenames as well.
 
 Either this is a different problem, or if everyone in this thread is
 doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem
 happen then I recommend people stop using UFS snapshots.  I've ranted
 about their unreliability in the past (years upon years ago -- still
 seems valid) and just how badly they can wedge a system.  This is one
 of the many (MANY!) reasons why we use rsnapshot/rsync instead.  The
 atime clobbering issue is the only downside.
 

If I'm doing UFS snapshots, I didn't know it.  Yes, everything was copied
manually because it only displays on the console and the keyboard does
not respond after that point.  So I copied first to paper, then had to
decode my lousy handwriting to put it in an email.  Sorry for the scribal
errors.

-- 
.O. | Sterling (Chip) Camden  | http://camdensoftware.com
..O | sterl...@camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91  | http://chipstips.com


pgpOx3HMzh2AX.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Hiroki Sato
Hiroki Sato h...@freebsd.org wrote
  in 20110818.043332.27079545013461535@allbsd.org:

hr Attilio Rao atti...@freebsd.org wrote
hr   in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
hr 
hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
hr at  Hi,
hr at 
hr at  Mike Tancsa m...@sentex.net wrote
hr at   in 4e15a08c.6090...@sentex.net:
hr at 
hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
hr at  mi 
hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
spinlock
hr at  mi  is the sched lock N, on busy 8-core box recently upgraded to 
the
hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
stack trace
hr at  mi  for the owner thread was not available.
hr at  mi 
hr at  mi  I was unable to make any conclusion from the data that was 
present.
hr at  mi  If the situation is reproducable, you coulld try to revert 
r221937. This
hr at  mi  is pure speculation, though.
hr at  mi 
hr at  mi  Another crash just now after 5hrs uptime. I will try and revert 
r221937
hr at  mi  unless there is any extra debugging you want me to add to the 
kernel
hr at  mi  instead  ?
hr at 
hr at   I am also suffering from a reproducible panic on an 8-STABLE box, an
hr at   NFS server with heavy I/O load.  I could not get a kernel dump
hr at   because this panic locked up the machine just after it occurred, but
hr at   according to the stack trace it was the same as posted one.
hr at   Switching to an 8.2R kernel can prevent this panic.
hr at 
hr at   Any progress on the investigation?
hr at 
hr at Hiroki,
hr at how easilly can you reproduce it?
hr 
hr  It takes 5-10 hours.  I installed another kernel for debugging just
hr  now, so I think I will be able to collect more detail information in
hr  a couple of days.
hr 
hr at It would be important to have a DDB textdump with these informations:
hr at - bt
hr at - ps
hr at - show allpcpu
hr at - alltrace
hr at 
hr at Alternatively, a coredump which has the stop cpu patch which Andryi can 
provide.
hr 
hr  Okay, I will post them once I can get another panic.  Thanks!

 I got the panic with a crash dump this time.  The result of bt, ps,
 allpcpu, and traces can be found at the following URL:

  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

-- Hiroki
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/18 Hiroki Sato h...@freebsd.org:
 Hiroki Sato h...@freebsd.org wrote
  in 20110818.043332.27079545013461535@allbsd.org:

 hr Attilio Rao atti...@freebsd.org wrote
 hr   in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
 hr
 hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
 hr at  Hi,
 hr at 
 hr at  Mike Tancsa m...@sentex.net wrote
 hr at   in 4e15a08c.6090...@sentex.net:
 hr at 
 hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 hr at  mi 
 hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
 spinlock
 hr at  mi  is the sched lock N, on busy 8-core box recently upgraded to 
 the
 hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
 stack trace
 hr at  mi  for the owner thread was not available.
 hr at  mi 
 hr at  mi  I was unable to make any conclusion from the data that was 
 present.
 hr at  mi  If the situation is reproducable, you coulld try to revert 
 r221937. This
 hr at  mi  is pure speculation, though.
 hr at  mi 
 hr at  mi  Another crash just now after 5hrs uptime. I will try and 
 revert r221937
 hr at  mi  unless there is any extra debugging you want me to add to the 
 kernel
 hr at  mi  instead  ?
 hr at 
 hr at   I am also suffering from a reproducible panic on an 8-STABLE box, 
 an
 hr at   NFS server with heavy I/O load.  I could not get a kernel dump
 hr at   because this panic locked up the machine just after it occurred, 
 but
 hr at   according to the stack trace it was the same as posted one.
 hr at   Switching to an 8.2R kernel can prevent this panic.
 hr at 
 hr at   Any progress on the investigation?
 hr at
 hr at Hiroki,
 hr at how easilly can you reproduce it?
 hr
 hr  It takes 5-10 hours.  I installed another kernel for debugging just
 hr  now, so I think I will be able to collect more detail information in
 hr  a couple of days.
 hr
 hr at It would be important to have a DDB textdump with these informations:
 hr at - bt
 hr at - ps
 hr at - show allpcpu
 hr at - alltrace
 hr at
 hr at Alternatively, a coredump which has the stop cpu patch which Andryi 
 can provide.
 hr
 hr  Okay, I will post them once I can get another panic.  Thanks!

  I got the panic with a crash dump this time.  The result of bt, ps,
  allpcpu, and traces can be found at the following URL:

  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

I'm not sure I understand it, is also a corefile available?
If yes, where I could get it? (with the relevant sources and kernel.debug).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/18 Hiroki Sato h...@freebsd.org:
 Hiroki Sato h...@freebsd.org wrote
  in 20110818.043332.27079545013461535@allbsd.org:

 hr Attilio Rao atti...@freebsd.org wrote
 hr   in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
 hr
 hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
 hr at  Hi,
 hr at 
 hr at  Mike Tancsa m...@sentex.net wrote
 hr at   in 4e15a08c.6090...@sentex.net:
 hr at 
 hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 hr at  mi 
 hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
 spinlock
 hr at  mi  is the sched lock N, on busy 8-core box recently upgraded to 
 the
 hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
 stack trace
 hr at  mi  for the owner thread was not available.
 hr at  mi 
 hr at  mi  I was unable to make any conclusion from the data that was 
 present.
 hr at  mi  If the situation is reproducable, you coulld try to revert 
 r221937. This
 hr at  mi  is pure speculation, though.
 hr at  mi 
 hr at  mi  Another crash just now after 5hrs uptime. I will try and 
 revert r221937
 hr at  mi  unless there is any extra debugging you want me to add to the 
 kernel
 hr at  mi  instead  ?
 hr at 
 hr at   I am also suffering from a reproducible panic on an 8-STABLE box, 
 an
 hr at   NFS server with heavy I/O load.  I could not get a kernel dump
 hr at   because this panic locked up the machine just after it occurred, 
 but
 hr at   according to the stack trace it was the same as posted one.
 hr at   Switching to an 8.2R kernel can prevent this panic.
 hr at 
 hr at   Any progress on the investigation?
 hr at
 hr at Hiroki,
 hr at how easilly can you reproduce it?
 hr
 hr  It takes 5-10 hours.  I installed another kernel for debugging just
 hr  now, so I think I will be able to collect more detail information in
 hr  a couple of days.
 hr
 hr at It would be important to have a DDB textdump with these informations:
 hr at - bt
 hr at - ps
 hr at - show allpcpu
 hr at - alltrace
 hr at
 hr at Alternatively, a coredump which has the stop cpu patch which Andryi 
 can provide.
 hr
 hr  Okay, I will post them once I can get another panic.  Thanks!

  I got the panic with a crash dump this time.  The result of bt, ps,
  allpcpu, and traces can be found at the following URL:

  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

Actually, I think I see the bug here.

In callout_cpu_switch() if a low priority thread is migrating the
callout and gets preempted after the outcoming cpu queue lock is left
(and scheduled much later) we get this problem.

In order to fix this bug it could be enough to use a critical section,
but I think this should be really interrupt safe, thus I'd wrap them
up with spinlock_enter()/spinlock_exit(). Fortunately
callout_cpu_switch() should be called rarely and also we already do
expensive locking operations in callout, thus we should not have
problem performance-wise.

Can the guys I also CC'ed here try the following patch, with all the
initial kernel options that were leading you to the deadlock? (thus
revert any debugging patch/option you added for the moment):
http://www.freebsd.org/~attilio/callout-fixup.diff

Please note that this patch is for STABLE_8, if you can confirm the
good result I'll commit to -CURRENT and then backmarge as soon as
possible.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Jeremy Chadwick
On Wed, Aug 17, 2011 at 05:01:05PM -0700, Chip Camden wrote:
 Quoth Jeremy Chadwick on Wednesday, 17 August 2011:
   
   I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
   have to power off.  Once, I happened to be looking at the console when it
   happened and copied dow the following:
   
   Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
   panic: sleeping thread
   cpuid=1
  
  No idea, might be relevant to the thread.
  
   Another time I got:
   
   lock order reversal:
   1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
   2nd 0xff0005e5d578 ufs (ufs) @ 
   /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587
   
   I didn't copy down the traceback.
  
  snaplk refers to UFS snapshots.  The above must have been typed in
  manually as well, due to some typos in filenames as well.
  
  Either this is a different problem, or if everyone in this thread is
  doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem
  happen then I recommend people stop using UFS snapshots.  I've ranted
  about their unreliability in the past (years upon years ago -- still
  seems valid) and just how badly they can wedge a system.  This is one
  of the many (MANY!) reasons why we use rsnapshot/rsync instead.  The
  atime clobbering issue is the only downside.
  
 
 If I'm doing UFS snapshots, I didn't know it.

The backtrace indicates that a UFS snapshot is being made -- which
causes the state to be set to string snaplk, which is then honoured in
vfs_vnops.c.

You can see for yourself: grep -r snaplk /usr/src/sys.

So yes, I'm inclined to believe something on your system is doing UFS
snapshot generation.  Whether or not other people are doing it as well
is a different story.

 Yes, everything was copied manually because it only displays on the
 console and the keyboard does not respond after that point.  So I
 copied first to paper, then had to decode my lousy handwriting to put
 it in an email.  Sorry for the scribal errors.

That sounds more or less like what I saw with UFS snapshots: the system
would go catatonic in one way or another.  It wouldn't hard lock (as
in if you had powered it off, etc.), it would live lock (as in the
kernel was wedged or held up/spinning doing something).

I never saw a panic as a result of UFS snapshots, only what I described
here.

TL;DR -- Your system appears to be making UFS snapshots, and that
situation is possibly (likely?) unrelated to the sleeping thread issue
you see that causes a panic.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, US |
| Making life hard for others since 1977.   PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Chip Camden
Quoth Attilio Rao on Thursday, 18 August 2011:
 2011/8/18 Hiroki Sato h...@freebsd.org:
  Hiroki Sato h...@freebsd.org wrote
   in 20110818.043332.27079545013461535@allbsd.org:
 
  hr Attilio Rao atti...@freebsd.org wrote
  hr   in 
  caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
  hr
  hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
  hr at  Hi,
  hr at 
  hr at  Mike Tancsa m...@sentex.net wrote
  hr at   in 4e15a08c.6090...@sentex.net:
  hr at 
  hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
  hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
  hr at  mi 
  hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
  spinlock
  hr at  mi  is the sched lock N, on busy 8-core box recently upgraded 
  to the
  hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
  stack trace
  hr at  mi  for the owner thread was not available.
  hr at  mi 
  hr at  mi  I was unable to make any conclusion from the data that was 
  present.
  hr at  mi  If the situation is reproducable, you coulld try to revert 
  r221937. This
  hr at  mi  is pure speculation, though.
  hr at  mi 
  hr at  mi  Another crash just now after 5hrs uptime. I will try and 
  revert r221937
  hr at  mi  unless there is any extra debugging you want me to add to 
  the kernel
  hr at  mi  instead  ?
  hr at 
  hr at   I am also suffering from a reproducible panic on an 8-STABLE 
  box, an
  hr at   NFS server with heavy I/O load.  I could not get a kernel dump
  hr at   because this panic locked up the machine just after it occurred, 
  but
  hr at   according to the stack trace it was the same as posted one.
  hr at   Switching to an 8.2R kernel can prevent this panic.
  hr at 
  hr at   Any progress on the investigation?
  hr at
  hr at Hiroki,
  hr at how easilly can you reproduce it?
  hr
  hr  It takes 5-10 hours.  I installed another kernel for debugging just
  hr  now, so I think I will be able to collect more detail information in
  hr  a couple of days.
  hr
  hr at It would be important to have a DDB textdump with these 
  informations:
  hr at - bt
  hr at - ps
  hr at - show allpcpu
  hr at - alltrace
  hr at
  hr at Alternatively, a coredump which has the stop cpu patch which Andryi 
  can provide.
  hr
  hr  Okay, I will post them once I can get another panic.  Thanks!
 
   I got the panic with a crash dump this time.  The result of bt, ps,
   allpcpu, and traces can be found at the following URL:
 
   http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt
 
 Actually, I think I see the bug here.
 
 In callout_cpu_switch() if a low priority thread is migrating the
 callout and gets preempted after the outcoming cpu queue lock is left
 (and scheduled much later) we get this problem.
 
 In order to fix this bug it could be enough to use a critical section,
 but I think this should be really interrupt safe, thus I'd wrap them
 up with spinlock_enter()/spinlock_exit(). Fortunately
 callout_cpu_switch() should be called rarely and also we already do
 expensive locking operations in callout, thus we should not have
 problem performance-wise.
 
 Can the guys I also CC'ed here try the following patch, with all the
 initial kernel options that were leading you to the deadlock? (thus
 revert any debugging patch/option you added for the moment):
 http://www.freebsd.org/~attilio/callout-fixup.diff
 
 Please note that this patch is for STABLE_8, if you can confirm the
 good result I'll commit to -CURRENT and then backmarge as soon as
 possible.
 
 Thanks,
 Attilio
 

Thanks, Attilio.  I've applied the patch and removed the extra debug
options I had added (though keeping debug symbols).  I'll let you know if
I experience any more panics.

Regards,

-- 
.O. | Sterling (Chip) Camden  | http://camdensoftware.com
..O | sterl...@camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91  | http://chipstips.com


pgpJ447gdPrNv.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Andriy Gapon
on 07/07/2011 08:55 Mike Tancsa said the following:
 I did a buildworld on this box to bring it up to RELENG_8 for the BIND
 fixes.  Unfortunately, the formerly solid box (April 13th kernel)
 panic'd tonight with
 
 Unread portion of the kernel message buffer:
 spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long
 panic: spin lock held too long
 cpuid = 0
 Uptime: 13h30m4s
 Physical memory: 2035 MB
 
 
 Its a somewhat busy box taking in mail as well as backups for a few
 servers over nfs.  At the time, it would have been getting about 250Mb/s
 inbound on its gigabit interface.  Full core.txt file at
 
 http://www.tancsa.com/core-jul8-2011.txt

I thought that this was supposed to contain output of 'thread apply all bt' in
kgdb.  Anyway, I think that stacktrace for tid 100107 may have some useful
information.

 #0  doadump () at pcpu.h:231
 231 pcpu.h: No such file or directory.
 in pcpu.h
 (kgdb) #0  doadump () at pcpu.h:231
 #1  0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429
 #2  0xc06fd937 in panic (fmt=Variable fmt is not available.
 ) at /usr/src/sys/kern/kern_shutdown.c:602
 #3  0xc06ed95f in _mtx_lock_spin_failed (m=0x0)
 at /usr/src/sys/kern/kern_mutex.c:490
 #4  0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0,
 file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526
 #5  0xc0720254 in sched_add (td=0xc5dac5c0, flags=0)
 at /usr/src/sys/kern/sched_ule.c:1119
 #6  0xc07203f9 in sched_wakeup (td=0xc5dac5c0)
 at /usr/src/sys/kern/sched_ule.c:1950
 #7  0xc07061f8 in setrunnable (td=0xc5dac5c0)
 at /usr/src/sys/kern/kern_synch.c:499
 #8  0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0,
 pri=Variable pri is not available.
 )
 at /usr/src/sys/kern/subr_sleepqueue.c:751
 #9  0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0)
 at /usr/src/sys/kern/subr_sleepqueue.c:825
 #10 0xc06b6764 in cv_signal (cvp=0xc5fafe50)
 at /usr/src/sys/kern/kern_condvar.c:422
 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available.
 ) at /usr/src/sys/rpc/svc.c:342
 #12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at
 /usr/src/sys/rpc/svc.c:378
 #13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600,
 waitflag=1)
 at /usr/src/sys/rpc/svc_vc.c:747
 #14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34)
 at /usr/src/sys/kern/uipc_sockbuf.c:191
 #15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824,
 so=0xc6372ce0,
 tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2)
 at /usr/src/sys/netinet/tcp_input.c:1775
 #16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20)
 at /usr/src/sys/netinet/tcp_input.c:1329
 #17 0xc07ddaf7 in ip_input (m=0xcaa8d200)
 at /usr/src/sys/netinet/ip_input.c:787
 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200)
 at /usr/src/sys/net/netisr.c:859
 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200)
 at /usr/src/sys/net/netisr.c:946
 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200)
 at /usr/src/sys/net/if_ethersubr.c:894
 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200)
 at /usr/src/sys/net/if_ethersubr.c:753
 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1)
 at /usr/src/sys/dev/nfe/if_nfe.c:2187
 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440)
 at /usr/src/sys/kern/subr_taskqueue.c:248
 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130)
 at /usr/src/sys/kern/subr_taskqueue.c:385
 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop,
 arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861
 #26 0xc09a5c24 in fork_trampoline () at
 /usr/src/sys/i386/i386/exception.s:275
 (kgdb)
 


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Kostik Belousov
On Thu, Jul 07, 2011 at 10:36:42AM +0300, Andriy Gapon wrote:
 on 07/07/2011 08:55 Mike Tancsa said the following:
  I did a buildworld on this box to bring it up to RELENG_8 for the BIND
  fixes.  Unfortunately, the formerly solid box (April 13th kernel)
  panic'd tonight with
  
  Unread portion of the kernel message buffer:
  spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long
  panic: spin lock held too long
  cpuid = 0
  Uptime: 13h30m4s
  Physical memory: 2035 MB
  
  
  Its a somewhat busy box taking in mail as well as backups for a few
  servers over nfs.  At the time, it would have been getting about 250Mb/s
  inbound on its gigabit interface.  Full core.txt file at
  
  http://www.tancsa.com/core-jul8-2011.txt
 
 I thought that this was supposed to contain output of 'thread apply all bt' in
 kgdb.  Anyway, I think that stacktrace for tid 100107 may have some useful
 information.
 
  #0  doadump () at pcpu.h:231
  231 pcpu.h: No such file or directory.
  in pcpu.h
  (kgdb) #0  doadump () at pcpu.h:231
  #1  0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429
  #2  0xc06fd937 in panic (fmt=Variable fmt is not available.
  ) at /usr/src/sys/kern/kern_shutdown.c:602
  #3  0xc06ed95f in _mtx_lock_spin_failed (m=0x0)
  at /usr/src/sys/kern/kern_mutex.c:490
  #4  0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0,
  file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526
  #5  0xc0720254 in sched_add (td=0xc5dac5c0, flags=0)
  at /usr/src/sys/kern/sched_ule.c:1119
  #6  0xc07203f9 in sched_wakeup (td=0xc5dac5c0)
  at /usr/src/sys/kern/sched_ule.c:1950
  #7  0xc07061f8 in setrunnable (td=0xc5dac5c0)
  at /usr/src/sys/kern/kern_synch.c:499
  #8  0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0,
  pri=Variable pri is not available.
  )
  at /usr/src/sys/kern/subr_sleepqueue.c:751
  #9  0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0)
  at /usr/src/sys/kern/subr_sleepqueue.c:825
  #10 0xc06b6764 in cv_signal (cvp=0xc5fafe50)
  at /usr/src/sys/kern/kern_condvar.c:422
  #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available.
  ) at /usr/src/sys/rpc/svc.c:342
  #12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at
  /usr/src/sys/rpc/svc.c:378
  #13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600,
  waitflag=1)
  at /usr/src/sys/rpc/svc_vc.c:747
  #14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34)
  at /usr/src/sys/kern/uipc_sockbuf.c:191
  #15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824,
  so=0xc6372ce0,
  tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2)
  at /usr/src/sys/netinet/tcp_input.c:1775
  #16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20)
  at /usr/src/sys/netinet/tcp_input.c:1329
  #17 0xc07ddaf7 in ip_input (m=0xcaa8d200)
  at /usr/src/sys/netinet/ip_input.c:787
  #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200)
  at /usr/src/sys/net/netisr.c:859
  #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200)
  at /usr/src/sys/net/netisr.c:946
  #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200)
  at /usr/src/sys/net/if_ethersubr.c:894
  #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200)
  at /usr/src/sys/net/if_ethersubr.c:753
  #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1)
  at /usr/src/sys/dev/nfe/if_nfe.c:2187
  #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440)
  at /usr/src/sys/kern/subr_taskqueue.c:248
  #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130)
  at /usr/src/sys/kern/subr_taskqueue.c:385
  #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop,
  arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861
  #26 0xc09a5c24 in fork_trampoline () at
  /usr/src/sys/i386/i386/exception.s:275
  (kgdb)
  

BTW, we had a similar panic, spinlock held too long, the spinlock
is the sched lock N, on busy 8-core box recently upgraded to the
stable/8. Unfortunately, machine hung dumping core, so the stack trace
for the owner thread was not available.

I was unable to make any conclusion from the data that was present.
If the situation is reproducable, you coulld try to revert r221937. This
is pure speculation, though.


pgpW2o7azLFBo.pgp
Description: PGP signature


Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Mike Tancsa
On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 
 BTW, we had a similar panic, spinlock held too long, the spinlock
 is the sched lock N, on busy 8-core box recently upgraded to the
 stable/8. Unfortunately, machine hung dumping core, so the stack trace
 for the owner thread was not available.
 
 I was unable to make any conclusion from the data that was present.
 If the situation is reproducable, you coulld try to revert r221937. This
 is pure speculation, though.

Another crash just now after 5hrs uptime. I will try and revert r221937
unless there is any extra debugging you want me to add to the kernel
instead  ?

This is an inbound mail server so a little disruption is possible

 kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as i386-marcel-freebsd...

Unread portion of the kernel message buffer:
spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long
panic: spin lock held too long
cpuid = 0
Uptime: 5h37m43s
Physical memory: 2035 MB
Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5

Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from
/boot/kernel/amdsbwd.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/amdsbwd.ko
#0  doadump () at pcpu.h:231
231 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:231
#1  0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429
#2  0xc06fd937 in panic (fmt=Variable fmt is not available.
) at /usr/src/sys/kern/kern_shutdown.c:602
#3  0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at
/usr/src/sys/kern/kern_mutex.c:490
#4  0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0,
file=0x0, line=0)
at /usr/src/sys/kern/kern_mutex.c:526
#5  0xc0720254 in sched_add (td=0xc61892e0, flags=0) at
/usr/src/sys/kern/sched_ule.c:1119
#6  0xc07203f9 in sched_wakeup (td=0xc61892e0) at
/usr/src/sys/kern/sched_ule.c:1950
#7  0xc07061f8 in setrunnable (td=0xc61892e0) at
/usr/src/sys/kern/kern_synch.c:499
#8  0xc07362af in sleepq_resume_thread (sq=0xc55311c0, td=0xc61892e0,
pri=Variable pri is not available.
)
at /usr/src/sys/kern/subr_sleepqueue.c:751
#9  0xc0736e18 in sleepq_signal (wchan=0xc60386d0, flags=1, pri=0, queue=0)
at /usr/src/sys/kern/subr_sleepqueue.c:825
#10 0xc06b6764 in cv_signal (cvp=0xc60386d0) at
/usr/src/sys/kern/kern_condvar.c:422
#11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available.
) at /usr/src/sys/rpc/svc.c:342
#12 0xc08ec502 in xprt_active (xprt=0xc5db8a00) at
/usr/src/sys/rpc/svc.c:378
#13 0xc08ee051 in svc_vc_soupcall (so=0xc618a19c, arg=0xc5db8a00,
waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747
#14 0xc075bbb1 in sowakeup (so=0xc618a19c, sb=0xc618a1f0) at
/usr/src/sys/kern/uipc_sockbuf.c:191
#15 0xc08447bc in tcp_do_segment (m=0xc6567a00, th=0xc6785824,
so=0xc618a19c, tp=0xc617e000, drop_hdrlen=52,
tlen=1448, iptos=0 '\0', ti_locked=2) at
/usr/src/sys/netinet/tcp_input.c:1775
#16 0xc0847930 in tcp_input (m=0xc6567a00, off0=20) at
/usr/src/sys/netinet/tcp_input.c:1329
#17 0xc07ddaf7 in ip_input (m=0xc6567a00) at
/usr/src/sys/netinet/ip_input.c:787
#18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xc6567a00)
at /usr/src/sys/net/netisr.c:859
#19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xc6567a00) at
/usr/src/sys/net/netisr.c:946
#20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xc6567a00) at
/usr/src/sys/net/if_ethersubr.c:894
#21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xc6567a00) at
/usr/src/sys/net/if_ethersubr.c:753
#22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at
/usr/src/sys/dev/nfe/if_nfe.c:2187
#23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at
/usr/src/sys/kern/subr_taskqueue.c:248
#24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at
/usr/src/sys/kern/subr_taskqueue.c:385
#25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop,
arg=0xc56ff130, frame=0xc538ed28)
at /usr/src/sys/kern/kern_fork.c:861
#26 0xc09a5c24 in fork_trampoline () at
/usr/src/sys/i386/i386/exception.s:275


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Jeremy Chadwick
On Thu, Jul 07, 2011 at 07:32:41AM -0400, Mike Tancsa wrote:
 On 7/7/2011 4:20 AM, Kostik Belousov wrote:
  
  BTW, we had a similar panic, spinlock held too long, the spinlock
  is the sched lock N, on busy 8-core box recently upgraded to the
  stable/8. Unfortunately, machine hung dumping core, so the stack trace
  for the owner thread was not available.
  
  I was unable to make any conclusion from the data that was present.
  If the situation is reproducable, you coulld try to revert r221937. This
  is pure speculation, though.
 
 Another crash just now after 5hrs uptime. I will try and revert r221937
 unless there is any extra debugging you want me to add to the kernel
 instead  ?
 
 This is an inbound mail server so a little disruption is possible
 
  kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13
 GNU gdb 6.1.1 [FreeBSD]
 Copyright 2004 Free Software Foundation, Inc.
 GDB is free software, covered by the GNU General Public License, and you are
 welcome to change it and/or distribute copies of it under certain
 conditions.
 Type show copying to see the conditions.
 There is absolutely no warranty for GDB.  Type show warranty for details.
 This GDB was configured as i386-marcel-freebsd...
 
 Unread portion of the kernel message buffer:
 spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long
 panic: spin lock held too long
 cpuid = 0
 Uptime: 5h37m43s
 Physical memory: 2035 MB
 Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5
 
 Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from
 /boot/kernel/amdsbwd.ko.symbols...done.
 done.
 Loaded symbols for /boot/kernel/amdsbwd.ko
 #0  doadump () at pcpu.h:231
 231 pcpu.h: No such file or directory.
 in pcpu.h
 (kgdb) bt
 #0  doadump () at pcpu.h:231
 #1  0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429
 #2  0xc06fd937 in panic (fmt=Variable fmt is not available.
 ) at /usr/src/sys/kern/kern_shutdown.c:602
 #3  0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at
 /usr/src/sys/kern/kern_mutex.c:490
 #4  0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0,
 file=0x0, line=0)
 at /usr/src/sys/kern/kern_mutex.c:526
 #5  0xc0720254 in sched_add (td=0xc61892e0, flags=0) at
 /usr/src/sys/kern/sched_ule.c:1119
 #6  0xc07203f9 in sched_wakeup (td=0xc61892e0) at
 /usr/src/sys/kern/sched_ule.c:1950
 #7  0xc07061f8 in setrunnable (td=0xc61892e0) at
 /usr/src/sys/kern/kern_synch.c:499
 #8  0xc07362af in sleepq_resume_thread (sq=0xc55311c0, td=0xc61892e0,
 pri=Variable pri is not available.
 )
 at /usr/src/sys/kern/subr_sleepqueue.c:751
 #9  0xc0736e18 in sleepq_signal (wchan=0xc60386d0, flags=1, pri=0, queue=0)
 at /usr/src/sys/kern/subr_sleepqueue.c:825
 #10 0xc06b6764 in cv_signal (cvp=0xc60386d0) at
 /usr/src/sys/kern/kern_condvar.c:422
 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available.
 ) at /usr/src/sys/rpc/svc.c:342
 #12 0xc08ec502 in xprt_active (xprt=0xc5db8a00) at
 /usr/src/sys/rpc/svc.c:378
 #13 0xc08ee051 in svc_vc_soupcall (so=0xc618a19c, arg=0xc5db8a00,
 waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747
 #14 0xc075bbb1 in sowakeup (so=0xc618a19c, sb=0xc618a1f0) at
 /usr/src/sys/kern/uipc_sockbuf.c:191
 #15 0xc08447bc in tcp_do_segment (m=0xc6567a00, th=0xc6785824,
 so=0xc618a19c, tp=0xc617e000, drop_hdrlen=52,
 tlen=1448, iptos=0 '\0', ti_locked=2) at
 /usr/src/sys/netinet/tcp_input.c:1775
 #16 0xc0847930 in tcp_input (m=0xc6567a00, off0=20) at
 /usr/src/sys/netinet/tcp_input.c:1329
 #17 0xc07ddaf7 in ip_input (m=0xc6567a00) at
 /usr/src/sys/netinet/ip_input.c:787
 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xc6567a00)
 at /usr/src/sys/net/netisr.c:859
 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xc6567a00) at
 /usr/src/sys/net/netisr.c:946
 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xc6567a00) at
 /usr/src/sys/net/if_ethersubr.c:894
 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xc6567a00) at
 /usr/src/sys/net/if_ethersubr.c:753
 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at
 /usr/src/sys/dev/nfe/if_nfe.c:2187
 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at
 /usr/src/sys/kern/subr_taskqueue.c:248
 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at
 /usr/src/sys/kern/subr_taskqueue.c:385
 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop,
 arg=0xc56ff130, frame=0xc538ed28)
 at /usr/src/sys/kern/kern_fork.c:861
 #26 0xc09a5c24 in fork_trampoline () at
 /usr/src/sys/i386/i386/exception.s:275

1. info threads
2. Find the index value that matches the tid in question (in the above
   spin lock panic, that'd be tid 100109).  The index value will be
   the first number shown on the left
3. thread {index}
4. bt

If this doesn't work, alternatively you can try (from the beginning)
thread apply all bt and provide the output from that.  (It will be
quite lengthy, and at this point I think tid 100109 is the one of
interest in this 

Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Andriy Gapon
on 07/07/2011 14:41 Jeremy Chadwick said the following:
 1. info threads
 2. Find the index value that matches the tid in question (in the above
spin lock panic, that'd be tid 100109).  The index value will be
the first number shown on the left
 3. thread {index}

Just in case, in kgdb there is a command 'tid' that does all of the above steps 
in
one go.

 4. bt
 
 If this doesn't work, alternatively you can try (from the beginning)
 thread apply all bt and provide the output from that.  (It will be
 quite lengthy, and at this point I think tid 100109 is the one of
 interest in this crash, based on what Andriy said earlier)


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-07-07 Thread Mike Tancsa
On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 On 7/7/2011 4:20 AM, Kostik Belousov wrote:

 BTW, we had a similar panic, spinlock held too long, the spinlock
 is the sched lock N, on busy 8-core box recently upgraded to the
 stable/8. Unfortunately, machine hung dumping core, so the stack trace
 for the owner thread was not available.

 I was unable to make any conclusion from the data that was present.
 If the situation is reproducable, you coulld try to revert r221937. This
 is pure speculation, though.
 
 Another crash just now after 5hrs uptime. I will try and revert r221937
 unless there is any extra debugging you want me to add to the kernel
 instead  ?
 
 This is an inbound mail server so a little disruption is possible
 
  kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13
 GNU gdb 6.1.1 [FreeBSD]
 Copyright 2004 Free Software Foundation, Inc.
 GDB is free software, covered by the GNU General Public License, and you are
 welcome to change it and/or distribute copies of it under certain
 conditions.
 Type show copying to see the conditions.
 There is absolutely no warranty for GDB.  Type show warranty for details.
 This GDB was configured as i386-marcel-freebsd...
 
 Unread portion of the kernel message buffer:
 spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long
 panic: spin lock held too long
 cpuid = 0
 Uptime: 5h37m43s
 Physical memory: 2035 MB
 Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5

And the second crash from today

 kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as i386-marcel-freebsd...

Unread portion of the kernel message buffer:
spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long
panic: spin lock held too long
cpuid = 0
Uptime: 5h37m43s
Physical memory: 2035 MB
Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5

Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from
/boot/kernel/amdsbwd.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/amdsbwd.ko
#0  doadump () at pcpu.h:231
231 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) tid 100109
[Switching to thread 82 (Thread 100109)]#0  sched_switch (td=0xc5dac2e0,
newtd=0xc553c5c0, flags=260)
at /usr/src/sys/kern/sched_ule.c:1866
1866cpuid = PCPU_GET(cpuid);
(kgdb) list
1861/*
1862 * We may return from cpu_switch on a different
cpu.  However,
1863 * we always return with td_lock pointing to the
current cpu's
1864 * run queue lock.
1865 */
1866cpuid = PCPU_GET(cpuid);
1867tdq = TDQ_CPU(cpuid);
1868lock_profile_obtain_lock_success(
1869TDQ_LOCKPTR(tdq)-lock_object, 0, 0,
__FILE__, __LINE__);
1870#ifdef  HWPMC_HOOKS
(kgdb) p *td
$1 = {td_lock = 0xc0b1d200, td_proc = 0xc5db4000, td_plist = {tqe_next =
0xc5dac5c0, tqe_prev = 0xc5dac008},
  td_runq = {tqe_next = 0x0, tqe_prev = 0xc0b1d334}, td_slpq = {tqe_next
= 0x0, tqe_prev = 0xc65d3b00},
  td_lockq = {tqe_next = 0x0, tqe_prev = 0xc51f6b38}, td_cpuset =
0xc5533e38, td_sel = 0x0,
  td_sleepqueue = 0xc65d3b00, td_turnstile = 0xc63ceb80, td_umtxq =
0xc5d229c0, td_tid = 100109, td_sigqueue = {
sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0,
0}}, sq_list = {tqh_first = 0x0,
  tqh_last = 0xc5dac340}, sq_proc = 0xc5db4000, sq_flags = 1},
td_flags = 4, td_inhibitors = 0,
  td_pflags = 2097152, td_dupfd = 0, td_sqqueue = 0, td_wchan = 0x0,
td_wmesg = 0x0, td_lastcpu = 0 '\0',
  td_oncpu = 1 '\001', td_owepreempt = 0 '\0', td_tsqueue = 0 '\0',
td_locks = -291, td_rw_rlocks = 0,
  td_lk_slocks = 0, td_blocked = 0x0, td_lockname = 0x0, td_contested =
{lh_first = 0x0}, td_sleeplocks = 0x0,
  td_intr_nesting_level = 0, td_pinned = 0, td_ucred = 0xc5538100,
td_estcpu = 0, td_slptick = 0,
  td_blktick = 0, td_ru = {ru_utime = {tv_sec = 0, tv_usec = 0},
ru_stime = {tv_sec = 0, tv_usec = 0},
ru_maxrss = 1048, ru_ixrss = 85216, ru_idrss = 3834720, ru_isrss =
681728, ru_minflt = 0, ru_majflt = 0,
ru_nswap = 0, ru_inblock = 82, ru_oublock = 271222, ru_msgsnd =
135625, ru_msgrcv = 2427350,
ru_nsignals = 0, ru_nvcsw = 2076938, ru_nivcsw = 731134},
td_incruntime = 852332612,
  td_runtime = 88202475877, td_pticks = 5326, td_sticks = 48, td_iticks
= 0, td_uticks = 0, td_intrval = 0,
  td_oldsigmask = {__bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0,
0, 0}}, td_generation = 2808072,
  td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 0}, td_xsig = 0,
td_profil_addr 

panic: spin lock held too long (RELENG_8 from today)

2011-07-06 Thread Mike Tancsa
I did a buildworld on this box to bring it up to RELENG_8 for the BIND
fixes.  Unfortunately, the formerly solid box (April 13th kernel)
panic'd tonight with

Unread portion of the kernel message buffer:
spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long
panic: spin lock held too long
cpuid = 0
Uptime: 13h30m4s
Physical memory: 2035 MB


Its a somewhat busy box taking in mail as well as backups for a few
servers over nfs.  At the time, it would have been getting about 250Mb/s
inbound on its gigabit interface.  Full core.txt file at

http://www.tancsa.com/core-jul8-2011.txt




#0  doadump () at pcpu.h:231
231 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) #0  doadump () at pcpu.h:231
#1  0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429
#2  0xc06fd937 in panic (fmt=Variable fmt is not available.
) at /usr/src/sys/kern/kern_shutdown.c:602
#3  0xc06ed95f in _mtx_lock_spin_failed (m=0x0)
at /usr/src/sys/kern/kern_mutex.c:490
#4  0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0,
file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526
#5  0xc0720254 in sched_add (td=0xc5dac5c0, flags=0)
at /usr/src/sys/kern/sched_ule.c:1119
#6  0xc07203f9 in sched_wakeup (td=0xc5dac5c0)
at /usr/src/sys/kern/sched_ule.c:1950
#7  0xc07061f8 in setrunnable (td=0xc5dac5c0)
at /usr/src/sys/kern/kern_synch.c:499
#8  0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0,
pri=Variable pri is not available.
)
at /usr/src/sys/kern/subr_sleepqueue.c:751
#9  0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0)
at /usr/src/sys/kern/subr_sleepqueue.c:825
#10 0xc06b6764 in cv_signal (cvp=0xc5fafe50)
at /usr/src/sys/kern/kern_condvar.c:422
#11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available.
) at /usr/src/sys/rpc/svc.c:342
#12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at
/usr/src/sys/rpc/svc.c:378
#13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600,
waitflag=1)
at /usr/src/sys/rpc/svc_vc.c:747
#14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34)
at /usr/src/sys/kern/uipc_sockbuf.c:191
#15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824,
so=0xc6372ce0,
tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2)
at /usr/src/sys/netinet/tcp_input.c:1775
#16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20)
at /usr/src/sys/netinet/tcp_input.c:1329
#17 0xc07ddaf7 in ip_input (m=0xcaa8d200)
at /usr/src/sys/netinet/ip_input.c:787
#18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200)
at /usr/src/sys/net/netisr.c:859
#19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200)
at /usr/src/sys/net/netisr.c:946
#20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200)
at /usr/src/sys/net/if_ethersubr.c:894
#21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200)
at /usr/src/sys/net/if_ethersubr.c:753
#22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1)
at /usr/src/sys/dev/nfe/if_nfe.c:2187
#23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440)
at /usr/src/sys/kern/subr_taskqueue.c:248
#24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130)
at /usr/src/sys/kern/subr_taskqueue.c:385
#25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop,
arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861
#26 0xc09a5c24 in fork_trampoline () at
/usr/src/sys/i386/i386/exception.s:275
(kgdb)

-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org