Re: locks under printf(9) and WITNESS = panic?

2013-07-11 Thread Attilio Rao
On Thu, Jul 11, 2013 at 1:21 PM, John Baldwin j...@freebsd.org wrote:
 On Saturday, June 29, 2013 9:19:24 pm Steven Hartland wrote:
 when booting stable/9 under a debug kernel with WITNESS
 enabled and verbose I get the following panic..

 It seems very much like the discussion from a year back on
 current: http://lists.freebsd.org/pipermail/freebsd-current/2012-
 January/031375.html

 Any ideas?

 Yeah, that lock needs to be MTX_RECURSE (the cnputs_mtx).  However, it
 only recurses under witness.  *sigh*

I have a patch to make mtx_lock_flags() to accept MTX_RECURSE. I will
commit it as long as all the consumers code will be reviewed which
should be any day.

Attilio


--
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1 coredump

2013-01-23 Thread Attilio Rao
On Wed, Jan 23, 2013 at 1:32 PM, Alexander Nikiforenko
a...@rambler-co.ru wrote:
hi, i was run ssh-keygen with output to 32g usb 3.0 flash, and got this core

 sorry, i was forgot.
 i mount that flash via fusefs-exfat-0.9.8

This is on stable/9?
If yes, I will send you patches to use new fuse approach in a while.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: lock violation in unionfs (9.0-STABLE r230270)

2012-11-02 Thread Attilio Rao
On Wed, Oct 31, 2012 at 11:11 AM, Harald Schmalzbauer
h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 29.10.2012 23:02 (localtime):
 On Mon, Oct 29, 2012 at 7:37 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 27.10.2012 23:07 (localtime):
 On Sat, Oct 27, 2012 at 9:46 PM, Attilio Rao atti...@freebsd.org wrote:
 On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao atti...@freebsd.org wrote:
 On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
 On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock 
 can
 be switch later on.

 Let me know what you think about it and what the test does.
 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested 
 it so
 late.
 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?
 Can you and Pavel try the attached patch? Unfortunately I had no time
 to test it, I just made in 5 free mins from a non-FreeBSD workstation,
 Sorry, couldn't test earlier, but now I did:
 With this patch applied the machine hangs without debug kernel and the
 latter gives the following panic:
 System call nmount returning with the following locks held:
 exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
 src/sys/fs/unionfs/union_vnops.c:1938
 panic: witness_warn
 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
 db_trace_self_wrapper+0x26
 kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
 witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) ar syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21
 --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
 0xbfbfe46c, ebp = 0xbfbfede8 ---
 KDB: enter: panic
 [ thread pid 86 tid 100054 ]
 Stopped adkdb_enter+0x3a: movl $0,kdb_why
 db bt
 Tracing pid 86 tid 100054 td 0xc541b000
 kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
 witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) at syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21

 Hmm, I guess I forgot to install kernel debug symbols...
 Coming back if I have more
 Unfortunately unionfs does very wrong things with the insmntque() 
 locking.
 It basically expects the vnode to return locked in the same way
 requested by the precedent namei() (when that happens) but when you do
 insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.
 Hello,
 the following patch should workout the issues around unionfs_nodeget() a 
 bit:
 http://www.freebsd.org/~attilio/unionfs_nodeget2.patch

 Unfortunately unionfs code is rather messy in the lookup path about
 locking requirements so follow what it needs to be done there is a bit
 difficult.
 I have no way to test this patch, so it is just test-compiled at the
 moment, but I would need that you also test lookup path (so directory
 ls, find(1) on the whole unionfs volume, etc.) to validate it
 someway.
 On a second thought, I think that locking in lookup (and also other
 operations) is so fragile and difficult to follow that it makes all
 vnops real locking landmines.
 I think that the following patch fixes the insmntque insertion and
 follows the old approach well enough to be committed separately:
 http://www.freebsd.org/~attilio/unionfs_nodeget3.patch

 Unfortunately I have no idea about all those locking strategies and
 implementations.
 Applying unionfs_nodeget3.patch results in:
 sys/fs/unionfs/union_subr.c: In function 'unionfs_nodeget':
 sys/fs/unionfs/union_subr.c:332: error: expected statement
 before ')' token
 *** [union_subr.o] Error code 1

 I guess there is a typo in this chunk:
 @@ -317,11 +328,11 @@ unionfs_nodeget(struct mount *mp, struct vnode *up

 vref(vp);
 } else
 *vpp = vp;
 -
 -unionfs_nodeget_out:
 -   if (lkflags  LK_TYPE_MASK)
 -   vn_lock(vp, lkflags | LK_RETRY);
 -
 +   if (lkflags  LK_TYPE_MASK) {
 +   if (lkflags == LK_SHARED))
  ^
 +   vn_lock(vp, LK_DOWNGRADE | LK_RETRY);
 +   } else
 +   VOP_UNLOCK(vp, LK_RELEASE);
 return (0);
  }

 After removing the second right parenthesis kernel compiles.
 But it still crashes:
 panic: Lock (lockmgr) ufs not locked @ sys/kern/vfs_default.c:512
 cpuid = 1
 KDB: stack backtrace:
 ...
 If you can use

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-10-27 Thread Attilio Rao
On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao atti...@freebsd.org wrote:
 On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
 On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.
 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested it so
 late.
 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?
 Can you and Pavel try the attached patch? Unfortunately I had no time
 to test it, I just made in 5 free mins from a non-FreeBSD workstation,

 Sorry, couldn't test earlier, but now I did:
 With this patch applied the machine hangs without debug kernel and the
 latter gives the following panic:
 System call nmount returning with the following locks held:
 exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
 src/sys/fs/unionfs/union_vnops.c:1938
 panic: witness_warn
 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
 db_trace_self_wrapper+0x26
 kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
 witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) ar syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21
 --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
 0xbfbfe46c, ebp = 0xbfbfede8 ---
 KDB: enter: panic
 [ thread pid 86 tid 100054 ]
 Stopped adkdb_enter+0x3a: movl $0,kdb_why
 db bt
 Tracing pid 86 tid 100054 td 0xc541b000
 kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
 witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) at syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21

 Hmm, I guess I forgot to install kernel debug symbols...
 Coming back if I have more

 Unfortunately unionfs does very wrong things with the insmntque() locking.
 It basically expects the vnode to return locked in the same way
 requested by the precedent namei() (when that happens) but when you do
 insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.

Hello,
the following patch should workout the issues around unionfs_nodeget() a bit:
http://www.freebsd.org/~attilio/unionfs_nodeget2.patch

Unfortunately unionfs code is rather messy in the lookup path about
locking requirements so follow what it needs to be done there is a bit
difficult.
I have no way to test this patch, so it is just test-compiled at the
moment, but I would need that you also test lookup path (so directory
ls, find(1) on the whole unionfs volume, etc.) to validate it
someway.

If it panics again, please provide the kernel.debug and the vmcore.X file.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: lock violation in unionfs (9.0-STABLE r230270)

2012-10-27 Thread Attilio Rao
On Sat, Oct 27, 2012 at 9:46 PM, Attilio Rao atti...@freebsd.org wrote:
 On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao atti...@freebsd.org wrote:
 On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
 On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.
 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested it so
 late.
 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?
 Can you and Pavel try the attached patch? Unfortunately I had no time
 to test it, I just made in 5 free mins from a non-FreeBSD workstation,

 Sorry, couldn't test earlier, but now I did:
 With this patch applied the machine hangs without debug kernel and the
 latter gives the following panic:
 System call nmount returning with the following locks held:
 exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
 src/sys/fs/unionfs/union_vnops.c:1938
 panic: witness_warn
 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
 db_trace_self_wrapper+0x26
 kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
 witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) ar syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21
 --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
 0xbfbfe46c, ebp = 0xbfbfede8 ---
 KDB: enter: panic
 [ thread pid 86 tid 100054 ]
 Stopped adkdb_enter+0x3a: movl $0,kdb_why
 db bt
 Tracing pid 86 tid 100054 td 0xc541b000
 kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
 witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) at syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21

 Hmm, I guess I forgot to install kernel debug symbols...
 Coming back if I have more

 Unfortunately unionfs does very wrong things with the insmntque() locking.
 It basically expects the vnode to return locked in the same way
 requested by the precedent namei() (when that happens) but when you do
 insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.

 Hello,
 the following patch should workout the issues around unionfs_nodeget() a bit:
 http://www.freebsd.org/~attilio/unionfs_nodeget2.patch

 Unfortunately unionfs code is rather messy in the lookup path about
 locking requirements so follow what it needs to be done there is a bit
 difficult.
 I have no way to test this patch, so it is just test-compiled at the
 moment, but I would need that you also test lookup path (so directory
 ls, find(1) on the whole unionfs volume, etc.) to validate it
 someway.

On a second thought, I think that locking in lookup (and also other
operations) is so fragile and difficult to follow that it makes all
vnops real locking landmines.
I think that the following patch fixes the insmntque insertion and
follows the old approach well enough to be committed separately:
http://www.freebsd.org/~attilio/unionfs_nodeget3.patch

However I strongly suggest that someone does review  sweep all the
locking from nodeget and  related functions removing the tedious
lkflags conditional, reinforcing and expliciting locking rules within
functions, checking out for races (which I'm sure are quite a few by
the fact that vn lock gets dropped indiscriminately in many points)
and possibly review the highly proficient usage of LK_RETRY that I'm
sure is not always safe.

All these steps should really be carried out separately.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-10-08 Thread Attilio Rao
On Fri, Sep 28, 2012 at 11:32 PM, Kevin Oberman kob6...@gmail.com wrote:
 On Fri, Sep 28, 2012 at 7:20 AM, Attilio Rao atti...@freebsd.org wrote:
 On Mon, Sep 24, 2012 at 6:25 PM, Kevin Oberman kob6...@gmail.com wrote:
 On Tue, Sep 18, 2012 at 7:55 PM, Attilio Rao atti...@freebsd.org wrote:
 On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim gon...@bsdinfo.com.br 
 wrote:
 I installed the package ntfs-fusefs on two different servers and both 
 causes
 kernel panic when trying to copy anything.
 A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
 amd64.
 Someone is having the same problem?

 Hello Marcelo,
 Do you think you can try fuse import explained here:
 http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html

 The proposed patch is for HEAD@240684 but I'm sure it should apply
 cleanly to RELENG_9_1 too.

 Please let me know if you have further questions.

 I tried patching 9-Stable with fuse_240684.patch. It applied cleanly,
 but the kernel build failed:
 cc -O2 -pipe -fno-strict-aliasing -Werror -D_KERNEL -DKLD_MODULE
 -nostdinc   -DHAVE_KERNEL_OPTION_HEADERS -include
 /usr/obj/usr/src/sys/GENERIC/opt_global.h -I. -I@ -I@/contrib/altq
 -finline-limit=8000 --param inline-unit-growth=100 --param
 large-function-growth=1000 -fno-common -g -fno-omit-frame-pointer
 -I/usr/obj/usr/src/sys/GENERIC  -mcmodel=kernel -mno-red-zone -mno-mmx
 -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding
 -fstack-protector -std=iso9899:1999 -fstack-protector -Wall
 -Wredundant-decls -Wnested-externs -Wstrict-prototypes
 -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
 -Wno-pointer-sign -fformat-extensions  -Wmissing-include-dirs
 -fdiagnostics-show-option   -c
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_ipc.c
 cc1: warnings being treated as errors
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c: In function
 'fuse_vnode_setsize':
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: warning:
 passing argument 3 of 'vtruncbuf' makes pointer from integer without a
 cast
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: error: too
 few arguments to function 'vtruncbuf'
 *** [fuse_node.o] Error code 1

 Looks like something has changed between stable and current that won't
 work. Any suggestions for a quick fix?

 Please check this out:
 http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036862.html

 Attilio,

 stable/9 (r239879) patched and lightly tested. Seems to be working
 fine at this time. I still need to further study mount_fusefs and I am
 still using the old mount-fuse script until I can look at how HAL and
 Gnome will handle things.

 I'll try my rsync test, which reliably crashed the system with the old
 fusefs stuff, this weekend.

So, did you try this? Any new?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-09-28 Thread Attilio Rao
On Mon, Sep 24, 2012 at 6:25 PM, Kevin Oberman kob6...@gmail.com wrote:
 On Tue, Sep 18, 2012 at 7:55 PM, Attilio Rao atti...@freebsd.org wrote:
 On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim gon...@bsdinfo.com.br 
 wrote:
 I installed the package ntfs-fusefs on two different servers and both causes
 kernel panic when trying to copy anything.
 A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
 amd64.
 Someone is having the same problem?

 Hello Marcelo,
 Do you think you can try fuse import explained here:
 http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html

 The proposed patch is for HEAD@240684 but I'm sure it should apply
 cleanly to RELENG_9_1 too.

 Please let me know if you have further questions.

 I tried patching 9-Stable with fuse_240684.patch. It applied cleanly,
 but the kernel build failed:
 cc -O2 -pipe -fno-strict-aliasing -Werror -D_KERNEL -DKLD_MODULE
 -nostdinc   -DHAVE_KERNEL_OPTION_HEADERS -include
 /usr/obj/usr/src/sys/GENERIC/opt_global.h -I. -I@ -I@/contrib/altq
 -finline-limit=8000 --param inline-unit-growth=100 --param
 large-function-growth=1000 -fno-common -g -fno-omit-frame-pointer
 -I/usr/obj/usr/src/sys/GENERIC  -mcmodel=kernel -mno-red-zone -mno-mmx
 -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding
 -fstack-protector -std=iso9899:1999 -fstack-protector -Wall
 -Wredundant-decls -Wnested-externs -Wstrict-prototypes
 -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
 -Wno-pointer-sign -fformat-extensions  -Wmissing-include-dirs
 -fdiagnostics-show-option   -c
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_ipc.c
 cc1: warnings being treated as errors
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c: In function
 'fuse_vnode_setsize':
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: warning:
 passing argument 3 of 'vtruncbuf' makes pointer from integer without a
 cast
 /usr/src/sys/modules/fuse/../../fs/fuse/fuse_node.c:378: error: too
 few arguments to function 'vtruncbuf'
 *** [fuse_node.o] Error code 1

 Looks like something has changed between stable and current that won't
 work. Any suggestions for a quick fix?

Please check this out:
http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036862.html

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Panic with fusefs-ntfs on FreeBSD 9 RC1 amd64

2012-09-18 Thread Attilio Rao
On Tue, Sep 18, 2012 at 5:14 PM, Marcelo Gondim gon...@bsdinfo.com.br wrote:
 I installed the package ntfs-fusefs on two different servers and both causes
 kernel panic when trying to copy anything.
 A server using FreeBSD 9.0 STABLE amd64 and the other using FreeBSD 9 RC1
 amd64.
 Someone is having the same problem?

Hello Marcelo,
Do you think you can try fuse import explained here:
http://lists.freebsd.org/pipermail/freebsd-current/2012-September/036677.html

The proposed patch is for HEAD@240684 but I'm sure it should apply
cleanly to RELENG_9_1 too.

Please let me know if you have further questions.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: lock violation in unionfs (9.0-STABLE r230270)

2012-09-07 Thread Attilio Rao
On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
 On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.
 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested it so
 late.
 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?
 Can you and Pavel try the attached patch? Unfortunately I had no time
 to test it, I just made in 5 free mins from a non-FreeBSD workstation,

 Sorry, couldn't test earlier, but now I did:
 With this patch applied the machine hangs without debug kernel and the
 latter gives the following panic:
 System call nmount returning with the following locks held:
 exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
 src/sys/fs/unionfs/union_vnops.c:1938
 panic: witness_warn
 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
 db_trace_self_wrapper+0x26
 kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
 witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) ar syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21
 --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
 0xbfbfe46c, ebp = 0xbfbfede8 ---
 KDB: enter: panic
 [ thread pid 86 tid 100054 ]
 Stopped adkdb_enter+0x3a: movl $0,kdb_why
 db bt
 Tracing pid 86 tid 100054 td 0xc541b000
 kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
 witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) at syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21

 Hmm, I guess I forgot to install kernel debug symbols...
 Coming back if I have more

Unfortunately unionfs does very wrong things with the insmntque() locking.
It basically expects the vnode to return locked in the same way
requested by the precedent namei() (when that happens) but when you do
insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.

I still need some time to fix this but my bandwidth is basically 0 at
the moment, I'll try to get back to you with a patch as soon as
possible.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: lock violation in unionfs (9.0-STABLE r230270)

2012-08-09 Thread Attilio Rao
On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation

 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.

 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested it so
 late.

 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?

Can you and Pavel try the attached patch? Unfortunately I had no time
to test it, I just made in 5 free mins from a non-FreeBSD workstation,
then you should be able to tell me if it works or not, even compiling
it on a RELENG_9_1.
Please try with INVARIANTS option on.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
Index: sys/fs/unionfs/union_subr.c
===
--- sys/fs/unionfs/union_subr.c	(revision 239152)
+++ sys/fs/unionfs/union_subr.c	(working copy)
@@ -237,7 +237,8 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 		if (vp != NULLVP) {
 			vref(vp);
 			*vpp = vp;
-			goto unionfs_nodeget_out;
+			lockmgr(vp-v_vnlock, LK_EXCLUSIVE, NULL);
+			return (0);
 		}
 	}
 
@@ -255,17 +256,19 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 	 */
 	unp = malloc(sizeof(struct unionfs_node),
 	M_UNIONFSNODE, M_WAITOK | M_ZERO);
+	if (path != NULL) {
+		unp-un_path = (char *)
+		malloc(cnp-cn_namelen +1, M_UNIONFSPATH, M_WAITOK|M_ZERO);
+		bcopy(cnp-cn_nameptr, unp-un_path, cnp-cn_namelen);
+		unp-un_path[cnp-cn_namelen] = '\0';
+	}
 
 	error = getnewvnode(unionfs, mp, unionfs_vnodeops, vp);
 	if (error != 0) {
+		free(unp-un_path, M_UNIONFSNODE);
 		free(unp, M_UNIONFSNODE);
 		return (error);
 	}
-	error = insmntque(vp, mp);	/* XXX: Too early for mpsafe fs */
-	if (error != 0) {
-		free(unp, M_UNIONFSNODE);
-		return (error);
-	}
 	if (dvp != NULLVP)
 		vref(dvp);
 	if (uppervp != NULLVP)
@@ -286,15 +289,22 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 	else
 		vp-v_vnlock = lowervp-v_vnlock;
 
-	if (path != NULL) {
-		unp-un_path = (char *)
-		malloc(cnp-cn_namelen +1, M_UNIONFSPATH, M_WAITOK|M_ZERO);
-		bcopy(cnp-cn_nameptr, unp-un_path, cnp-cn_namelen);
-		unp-un_path[cnp-cn_namelen] = '\0';
-	}
 	vp-v_type = vt;
 	vp-v_data = unp;
 
+	lockmgr(vp-v_vnlock, LK_EXCLUSIVE, NULL);
+	error = insmntque(vp, mp);
+	if (error != 0) {
+		if (dvp != NULLVP)
+			vrele(dvp);
+		if (uppervp != NULLVP)
+			vrele(uppervp);
+		if (lowervp != NULLVP)
+			vrele(lowervp);
+		free(unp-un_path, M_UNIONFSNODE);
+		free(unp, M_UNIONFSNODE);
+		return (error);
+	}
 	if ((uppervp != NULLVP  ump-um_uppervp == uppervp) 
 	(lowervp != NULLVP  ump-um_lowervp == lowervp))
 		vp-v_vflag |= VV_ROOT;
@@ -317,11 +327,6 @@ unionfs_nodeget(struct mount *mp, struct vnode *up
 		vref(vp);
 	} else
 		*vpp = vp;
-
-unionfs_nodeget_out:
-	if (lkflags  LK_TYPE_MASK)
-		vn_lock(vp, lkflags | LK_RETRY);
-
 	return (0);
 }
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-08-08 Thread Attilio Rao
On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation

 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.

 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested it so
 late.

 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?

There are more things to fix in inode instantiation for unionfs. I
hope to make a comprehensive patch for tests in a couple of days.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-02 Thread Attilio Rao
On 8/2/12, John Baldwin j...@freebsd.org wrote:
 On Wednesday, August 01, 2012 6:48:48 pm Sean Bruno wrote:
 On Wed, 2012-08-01 at 05:53 -0700, John Baldwin wrote:
  Index: vfs_subr.c
  ===
  --- vfs_subr.c  (revision 238969)
  +++ vfs_subr.c  (working copy)
  @@ -1868,8 +1868,11 @@ sched_sync(void)
  continue;
  }
 
  -   if (first_printf == 0)
  +   if (first_printf == 0) {
  +   mtx_unlock(sync_mtx);
  wdog_kern_pat(WD_LASTVAL);
  +   mtx_lock(sync_mtx);
  +   }
 
  }
  if (!LIST_EMPTY(gslp)) {
 
 
  --
  John Baldwin

 This definitely makes the panic go away on reboot.

 Attilio, does this change seem ok to you?

Thanks for asking me to review.

I think it is safe because we are going to use LIST_EMPTY() on the
global list anyway as next check.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-01 Thread Attilio Rao
On 8/1/12, John Baldwin j...@freebsd.org wrote:
 On Tuesday, July 31, 2012 4:51:19 pm Attilio Rao wrote:
 On 7/31/12, John Baldwin j...@freebsd.org wrote:
  On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
  Working on the Dell R420 today, got most of it working, even the
  broadcom ethernet cards!  However, I get the following when I reboot
  the
  system:
 
  Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid
  9)
  owns a non-sleepable lock
  KDB: stack backtrace of thread 100107:
  sched_switch() at sched_switch+0x19f
  mi_switch() at mi_switch+0x208
  sleepq_switch() at sleepq_switch+0xfc
  sleepq_wait() at sleepq_wait+0x4d
  _sleep() at _sleep+0x3f6
  ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
  ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
  ipmi_wd_event() at ipmi_wd_event+0x8f
  kern_do_pat() at kern_do_pat+0x10f
  sched_sync() at sched_sync+0x1ea
  fork_exit() at fork_exit+0x135
  fork_trampoline() at fork_trampoline+0xe
 
  Hmmm, the watchdog pat should probably happen without holding locks if
  possible.  This is related to the IPMI watchdog being special and
  wanting
  to schedule a thread to work.

 The watchdog pat without the locks is not easy to do because we
 register the watchdog callbacks in eventhandlers, which are indeed
 locked (and you may also end up racing against watchdog detach, if you
 don't use any lock at all).

 No, eventhandlers go through several hoops to not hold any locks while
 the eventhandler functions are running.  It seems in this case that a
 lock is held in a higher layer (sched_sync()) and that is what I was
 talking about.  Yes, it is the 'sync_mtx' that is held.  Something like this

No, EVENTHANDLER_INVOKE() acquires eventhandler internal locks.
Look at eventhandler_find_list() for details.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-08-01 Thread Attilio Rao
On 8/1/12, Attilio Rao atti...@freebsd.org wrote:
 On 8/1/12, John Baldwin j...@freebsd.org wrote:
 On Tuesday, July 31, 2012 4:51:19 pm Attilio Rao wrote:
 On 7/31/12, John Baldwin j...@freebsd.org wrote:
  On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
  Working on the Dell R420 today, got most of it working, even the
  broadcom ethernet cards!  However, I get the following when I reboot
  the
  system:
 
  Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid
  9)
  owns a non-sleepable lock
  KDB: stack backtrace of thread 100107:
  sched_switch() at sched_switch+0x19f
  mi_switch() at mi_switch+0x208
  sleepq_switch() at sleepq_switch+0xfc
  sleepq_wait() at sleepq_wait+0x4d
  _sleep() at _sleep+0x3f6
  ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
  ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
  ipmi_wd_event() at ipmi_wd_event+0x8f
  kern_do_pat() at kern_do_pat+0x10f
  sched_sync() at sched_sync+0x1ea
  fork_exit() at fork_exit+0x135
  fork_trampoline() at fork_trampoline+0xe
 
  Hmmm, the watchdog pat should probably happen without holding locks if
  possible.  This is related to the IPMI watchdog being special and
  wanting
  to schedule a thread to work.

 The watchdog pat without the locks is not easy to do because we
 register the watchdog callbacks in eventhandlers, which are indeed
 locked (and you may also end up racing against watchdog detach, if you
 don't use any lock at all).

 No, eventhandlers go through several hoops to not hold any locks while
 the eventhandler functions are running.  It seems in this case that a
 lock is held in a higher layer (sched_sync()) and that is what I was
 talking about.  Yes, it is the 'sync_mtx' that is held.  Something like
 this

 No, EVENTHANDLER_INVOKE() acquires eventhandler internal locks.
 Look at eventhandler_find_list() for details.

Oh, but I guess you misunderstood me -- I didn't mean to say that
eventhandler callbacks run with eventhandlers lock held, I meant to
say that that it would be nice if EVENTHANDLER_INVOKE() could run
lockless. This would have avoided some issues in special context (I
recall I had some issues at work years ago, but they could have been
predating the STOP_SCHEDULER() patch and in DDB).

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [stable 9] panic on reboot: ipmi_wd_event()

2012-07-31 Thread Attilio Rao
On 7/31/12, John Baldwin j...@freebsd.org wrote:
 On Thursday, July 19, 2012 7:58:14 pm Sean Bruno wrote:
 Working on the Dell R420 today, got most of it working, even the
 broadcom ethernet cards!  However, I get the following when I reboot the
 system:

 Syncing disks, vnodes remaining...4 Sleeping thread (tid 100107, pid 9)
 owns a non-sleepable lock
 KDB: stack backtrace of thread 100107:
 sched_switch() at sched_switch+0x19f
 mi_switch() at mi_switch+0x208
 sleepq_switch() at sleepq_switch+0xfc
 sleepq_wait() at sleepq_wait+0x4d
 _sleep() at _sleep+0x3f6
 ipmi_submit_driver_request() at ipmi_submit_driver_request+0x97
 ipmi_set_watchdog() at ipmi_set_watchdog+0xb1
 ipmi_wd_event() at ipmi_wd_event+0x8f
 kern_do_pat() at kern_do_pat+0x10f
 sched_sync() at sched_sync+0x1ea
 fork_exit() at fork_exit+0x135
 fork_trampoline() at fork_trampoline+0xe

 Hmmm, the watchdog pat should probably happen without holding locks if
 possible.  This is related to the IPMI watchdog being special and wanting
 to schedule a thread to work.

The watchdog pat without the locks is not easy to do because we
register the watchdog callbacks in eventhandlers, which are indeed
locked (and you may also end up racing against watchdog detach, if you
don't use any lock at all).

There is a similar issue when you enter DDB o coredump, for example
but this is someway collateral due to the after-panic nature of the
situation. We should seriously looking into requirements for watchdog
patting and possibly DDB entering situations, outline correct
semantics to follow and refactor code to follow them.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Attilio Rao
On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer abo...@averesystems.com wrote:

 On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:

 For the time being I had to revert the following from my stable/9 tree.
 Otherwise I would get a kernel panic on shutdown from ipmi(4).

 http://svnweb.freebsd.org/base?view=revisionrevision=237839
 http://svnweb.freebsd.org/base?view=revisionrevision=221121



 On a somewhat related note: We noticed recently that you can't pet or disable 
 the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This means it 
 can fire unexpectedly while you're dumping core or rebooting, depending on 
 how long the timeout was on the pet before the panic.  The ipmi driver will 
 need to process the command differently if the scheduler is stopped.  I 
 haven't had time to look at a fix yet.

I recall I fixed that internally for SV, but the key here is that we
need to find an unified (or a default policy).
More specifically, do we want the watchdog also covers the kernel dump
part (because of possible deadlocks when dumping). If the answer is
yes, we likely need pat the watchdog from within the dumping cycle
itself. If the answer is no, then we can just disable it when entering
the panic path. But anyway, we need to identify a default policy that
makes sense first.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Attilio Rao
On Fri, Jul 27, 2012 at 3:55 PM, Andrew Boyer abo...@averesystems.com wrote:

 On Jul 27, 2012, at 10:42 AM, Attilio Rao wrote:

 On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer abo...@averesystems.com 
 wrote:

 On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:

 For the time being I had to revert the following from my stable/9 tree.
 Otherwise I would get a kernel panic on shutdown from ipmi(4).

 http://svnweb.freebsd.org/base?view=revisionrevision=237839
 http://svnweb.freebsd.org/base?view=revisionrevision=221121


 On a somewhat related note: We noticed recently that you can't pet or 
 disable the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This 
 means it can fire unexpectedly while you're dumping core or rebooting, 
 depending on how long the timeout was on the pet before the panic.  The 
 ipmi driver will need to process the command differently if the scheduler 
 is stopped.  I haven't had time to look at a fix yet.

 I recall I fixed that internally for SV, but the key here is that we
 need to find an unified (or a default policy).
 More specifically, do we want the watchdog also covers the kernel dump
 part (because of possible deadlocks when dumping). If the answer is
 yes, we likely need pat the watchdog from within the dumping cycle
 itself. If the answer is no, then we can just disable it when entering
 the panic path. But anyway, we need to identify a default policy that
 makes sense first.

 Attilio


 For our use case, we need the system to reset if the dump hangs.

This means we might likely go to control by hand the watchdog patting
in the panic path and more specifically I guess this reduces to
patting the watching from within the dumping cycle (there could be
other expensive points we can consider but nothing that pop off my
head right now). Maybe Ryan can share with us if SV can contribute the
code back about that specific part.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: stable/9 sandybridge reboot panic

2012-05-25 Thread Attilio Rao
2012/5/25, Sean Bruno sean...@yahoo-inc.com:
 Dell R620, getting pretty reliable panics here everytime I reboot.

 http://people.freebsd.org/~sbruno/sandybridge_reboot_panic.txt

I'm sure that if you drop hwpmc you will get rid of it.
it would be good if you however get something for Davide and George
that they can investigate and fix, would be great, in particular
because it is already ported on STABLE_9.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Complete hang on 9.0-RELEASE

2012-03-05 Thread Attilio Rao
2012/3/5, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 2:31 PM, Arnaud Lacombe lacom...@gmail.com wrote:
 Hi,

 On Wed, Feb 29, 2012 at 2:22 PM, Attilio Rao atti...@freebsd.org wrote:
 2012/2/29, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao atti...@freebsd.org
 wrote:
 2012/2/29, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi,

 On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi,

 On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao atti...@freebsd.org
 wrote:
 2012/2/27, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe
 lacom...@gmail.com
 wrote:
 Hi folks,

 For the records, I was running some tests yesterday on top of a
 9.0-RELEASE, amd64, kernel when the box hanged. At the time of
 the
 hang, the box was running a process with about 2800 threads with
 heavy
 IPC between 1400 writers and 1400 readers. The box was in single
 user
 mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the
 beginning
 of the dmesg:

 This happened a second time, now with FreeBSD 8.2-RELEASE.
 Complete
 machine hang. The machine was running about 4000 threads in a
 single
 process, all the other condition are the same.

 Arnaud,
 can you please break in your kernel via KDB, collect the following
 informations from the DDB prompt:
 - ps
 - alltrace
 - show allpcpu
 - possibly get a coredump with 'call doadump'

 Will do, but I'll need to rebuild a kernel to include DDB.

 and in the end provide all those along with kernel binary and
 possibly
 sources somewhere?

 I'll be testing a bare `release/8.2.0' with the following patch:

 diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
 index c3e0095..7bd997f 100644
 --- a/sys/amd64/conf/GENERIC
 +++ b/sys/amd64/conf/GENERIC
 @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include
 this
 file in kernel

  optionsKDB   # Kernel debugger related code
  optionsKDB_TRACE # Print a stack trace for a panic
 +optionsDDB
 +optionsBREAK_TO_DEBUGGER
 +optionsALT_BREAK_TO_DEBUGGER

  # Make an SMP-capable kernel by default
  optionsSMP   # Symmetric MultiProcessor Kernel

 ok, it happened again after 2 days, the process was running about
 3200
 threads. I'm trying to break into DDB and let you know, I'm not that
 successful for now...

 No luck. None of BREAK or ALT_BREAK are responding. I will not touch
 the system in the next few hours if you want me to test something on
 it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
 reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
 occurs on a clean 9.0-RELEASE system and re-try.

 We allow to break KBI when new releases happens, thus this may cause a
 breakage for you, even if a deadlock is really not something you want.

 Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your
 ichwd?
 if the breakage involves clocks or interrupt sources there are still
 chances they will be able to catch it though.

 However, it doesn't seem you are setup with a proper serial console?
 The serial console is working definitively fine. I can break into DDB
 at will when the test is running. I did not test with ALT_BREAK
 per-se, but BREAK does work.

 So if you try to break in DDB via serial break it doesn't work?
 That is definitively very bad...

 just to be sure, I rebooted the system and I could break into DDB at
 the first attempt with ALT_BREAK, BREAK was a bit more reluctant but
 worked too. So yes, this does not taste good :/

 Can you try with the options I mentioned earlier and see if something
 changes?

 will do, but I will first attempt to reproduce this on 9.0-RELEASE.

 9.0-RELEASE (kernel + userland) hanged today while running 2000
 threads. Next step is to reproduce it with a watchdog+textdump enabled
 kernel.

And you were still unable to break in DDB, right?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Complete hang on 9.0-RELEASE

2012-02-29 Thread Attilio Rao
2012/2/29, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe lacom...@gmail.com wrote:
 Hi,

 On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi,

 On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao atti...@freebsd.org
 wrote:
 2012/2/27, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi folks,

 For the records, I was running some tests yesterday on top of a
 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
 hang, the box was running a process with about 2800 threads with heavy
 IPC between 1400 writers and 1400 readers. The box was in single user
 mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
 of the dmesg:

 This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
 machine hang. The machine was running about 4000 threads in a single
 process, all the other condition are the same.

 Arnaud,
 can you please break in your kernel via KDB, collect the following
 informations from the DDB prompt:
 - ps
 - alltrace
 - show allpcpu
 - possibly get a coredump with 'call doadump'

 Will do, but I'll need to rebuild a kernel to include DDB.

 and in the end provide all those along with kernel binary and possibly
 sources somewhere?

 I'll be testing a bare `release/8.2.0' with the following patch:

 diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
 index c3e0095..7bd997f 100644
 --- a/sys/amd64/conf/GENERIC
 +++ b/sys/amd64/conf/GENERIC
 @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include this
 file in kernel

  optionsKDB   # Kernel debugger related code
  optionsKDB_TRACE # Print a stack trace for a panic
 +optionsDDB
 +optionsBREAK_TO_DEBUGGER
 +optionsALT_BREAK_TO_DEBUGGER

  # Make an SMP-capable kernel by default
  optionsSMP   # Symmetric MultiProcessor Kernel

 ok, it happened again after 2 days, the process was running about 3200
 threads. I'm trying to break into DDB and let you know, I'm not that
 successful for now...

 No luck. None of BREAK or ALT_BREAK are responding. I will not touch
 the system in the next few hours if you want me to test something on
 it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
 reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
 occurs on a clean 9.0-RELEASE system and re-try.

We allow to break KBI when new releases happens, thus this may cause a
breakage for you, even if a deadlock is really not something you want.

Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your ichwd?
if the breakage involves clocks or interrupt sources there are still
chances they will be able to catch it though.

However, it doesn't seem you are setup with a proper serial console?
If this is the case, you need to go with a textdump in order to
collect DDB output.
Or if you have it you might try with sending a serial break and kernel
should break in DDB.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Complete hang on 9.0-RELEASE

2012-02-29 Thread Attilio Rao
2012/2/29, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao atti...@freebsd.org wrote:
 2012/2/29, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi,

 On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi,

 On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao atti...@freebsd.org
 wrote:
 2012/2/27, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe lacom...@gmail.com
 wrote:
 Hi folks,

 For the records, I was running some tests yesterday on top of a
 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
 hang, the box was running a process with about 2800 threads with
 heavy
 IPC between 1400 writers and 1400 readers. The box was in single
 user
 mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
 of the dmesg:

 This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
 machine hang. The machine was running about 4000 threads in a single
 process, all the other condition are the same.

 Arnaud,
 can you please break in your kernel via KDB, collect the following
 informations from the DDB prompt:
 - ps
 - alltrace
 - show allpcpu
 - possibly get a coredump with 'call doadump'

 Will do, but I'll need to rebuild a kernel to include DDB.

 and in the end provide all those along with kernel binary and possibly
 sources somewhere?

 I'll be testing a bare `release/8.2.0' with the following patch:

 diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
 index c3e0095..7bd997f 100644
 --- a/sys/amd64/conf/GENERIC
 +++ b/sys/amd64/conf/GENERIC
 @@ -79,6 +79,10 @@ options  INCLUDE_CONFIG_FILE # Include this
 file in kernel

  optionsKDB   # Kernel debugger related code
  optionsKDB_TRACE # Print a stack trace for a panic
 +optionsDDB
 +optionsBREAK_TO_DEBUGGER
 +optionsALT_BREAK_TO_DEBUGGER

  # Make an SMP-capable kernel by default
  optionsSMP   # Symmetric MultiProcessor Kernel

 ok, it happened again after 2 days, the process was running about 3200
 threads. I'm trying to break into DDB and let you know, I'm not that
 successful for now...

 No luck. None of BREAK or ALT_BREAK are responding. I will not touch
 the system in the next few hours if you want me to test something on
 it. In the event of 8.2-RELEASE or 9.0-RELEASE are  not meant to work
 reliably on top of a 7.4-RELEASE userland, I will re-setup the test to
 occurs on a clean 9.0-RELEASE system and re-try.

 We allow to break KBI when new releases happens, thus this may cause a
 breakage for you, even if a deadlock is really not something you want.

 Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your ichwd?
 if the breakage involves clocks or interrupt sources there are still
 chances they will be able to catch it though.

 However, it doesn't seem you are setup with a proper serial console?
 The serial console is working definitively fine. I can break into DDB
 at will when the test is running. I did not test with ALT_BREAK
 per-se, but BREAK does work.

So if you try to break in DDB via serial break it doesn't work?
That is definitively very bad...

Can you try with the options I mentioned earlier and see if something changes?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Complete hang on 9.0-RELEASE

2012-02-27 Thread Attilio Rao
2012/2/27, Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe lacom...@gmail.com wrote:
 Hi folks,

 For the records, I was running some tests yesterday on top of a
 9.0-RELEASE, amd64, kernel when the box hanged. At the time of the
 hang, the box was running a process with about 2800 threads with heavy
 IPC between 1400 writers and 1400 readers. The box was in single user
 mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the beginning
 of the dmesg:

 This happened a second time, now with FreeBSD 8.2-RELEASE. Complete
 machine hang. The machine was running about 4000 threads in a single
 process, all the other condition are the same.

Arnaud,
can you please break in your kernel via KDB, collect the following
informations from the DDB prompt:
- ps
- alltrace
- show allpcpu
- possibly get a coredump with 'call doadump'

and in the end provide all those along with kernel binary and possibly
sources somewhere?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: lock violation in unionfs (9.0-STABLE r230270)

2012-02-15 Thread Attilio Rao
2012/2/13, Pavel Polyakov b...@kobyla.org:
 http://www.freebsd.org/cgi/query-pr.cgi?pr=165087

 Occurs simply trying to use unionfs:
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation

Pavel,
can you give a spin to this patch?:
http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

I think that the unlocking is due at that point as the vnode lock can
be switch later on.

Let me know what you think about it and what the test does.
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Custom kernel poll summary (was: Re: Reducing the need to compile a custom kernel)

2012-02-14 Thread Attilio Rao
2012/2/14, Alexander Leidinger alexan...@leidinger.net:
 Quoting Alexander Leidinger alexan...@leidinger.net (from Fri, 10
 Feb 2012 14:56:04 +0100):

 Such a kernel would cover situations where people compile their own
 kernel because they want to get rid of some unused kernel code (and
 maybe even need the memory this frees up).

 The question is, is this enough? Or asked differently, why are you
 compiling a custom kernel in a production environment (so I rule out
 debug options zhich are not enabled in GENERIC)? Are there options
 which you add which you can not add as a module (SW_WATCHDOG comes
 to my mind)? If yes, which ones and how important are they for you?

 Here is what I got, the first column is the number of requests, the
 second what is requested, and the 3rd my comments (basically it means,
 if there is a comment, it is not needed/possible to include in a
 modular kernel):

...
 2 SW_WATCHDOG

This can become a module with very little effort I guess.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server

2011-12-16 Thread Attilio Rao
2011/12/16 Arnaud Lacombe lacom...@gmail.com:
 Hi,

 On Thu, Dec 15, 2011 at 2:32 AM, O. Hartmann
 ohart...@zedat.fu-berlin.de wrote:
 Just saw this shot benchmark on Phoronix dot com today:

 http://www.phoronix.com/scan.php?page=news_itempx=MTAyNzA

 it might be worth highlighting that despite Oracle Linux 6.1 Server is
 using a kernel + compiler almost 2 years old, it still manages to
 out-perform the bleeding edge FreeBSD :-)

 Now, from what I've read so far in this thread, it seems that a lot of
 people are still in abnegation...

 my 0.2c,
  - Arnaud

Said by someone which really thinks passing __FILE__ and __LINE__ to
kernel function is going to give a mesaurable performance penalty is
really hilarious however :)

It is crystal clear you really don't understand how to make reliable
benchmarks (and likely you don't really have a grasp of nowaday's
machine contention points), so why you keep talking about it? It would
be more valuable for you and whatever project you follow if you spend
your time coding and making real benchmarking.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-16 Thread Attilio Rao
2011/12/15 Steve Kargl s...@troutmask.apl.washington.edu:
 On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote:

 I basically went through all the e-mail you just sent and identified 4
 real report on which we could work on and summarizied in the attached
 Excel file.
 I'd like that George, Steve, Doug, Andrey and Mike possibly review the
 few datas there and add more, if they want, or make more important
 clarifications in particular about the Xorg presence (or rather not)
 in their workload.

 Your summary of my observations appears correct.

 I have grabbed an up-to-date /usr/src, built and
 installed world, and built and installed a new
 kernel on one of the nodes in my cluster.  It
 has

 CPU: Dual Core AMD Opteron(tm) Processor 280 (2392.65-MHz K8-class CPU)
  Origin = AuthenticAMD  Id = 0x20f12  Family = f  Model = 21  Stepping = 2
  Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,
  MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT
  Features2=0x1SSE3
  AMD Features=0xe2500800SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!
  AMD Features2=0x3LAHF,CMP
 real memory  = 17179869184 (16384 MB)
 avail memory = 16269832192 (15516 MB)
 FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 FreeBSD/SMP: 2 package(s) x 2 core(s)

 I can perform new tests with both ULE and 4BSD, but you'll
 need to be precise in the information you want collected
 (and how to collect the data) due to the rather limited
 amount of time I currently have.

It seems a perfect environment, just please make sure you made a
debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically).

The first thing is, can you try reproducing your case? As far as I got
it, for you it was enough to run N + small_amount of CPU-bound threads
to show performance penalty, so I'd ask you to start with using dnetc
or just your preferred cpu-bound workload and verify you can reproduce
the issue.
As it happens, please monitor the threads bouncing and CPU utilization
via 'top' (you don't need to be 100% precise, jut to get an idea, and
keep an eye on things like excessive threads migration, thread binding
obsessity, low throughput on CPU).
One note: if your workloads need to do I/O please use a tempfs or
memory storage to do so, in order to reduce I/O effects at all.
Also, verify this doesn't happen with 4BSD scheduler, just in case.

Finally, if the problem is still in place, please recompile your
kernel by adding:
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)

And reproduce the issue.
When you are in the middle of the scheduling issue go with:
# ktrdump -ctf  ktr-ule-problem-YOURNAME.out

and send to the mailing list along with your dmesg and the
informations on the CPU utilization you gathered by top(1).

That should cover it all, but if you have further questions, please
just go ahead.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/9 George Mitchell george+free...@m5p.com:
 dnetc is an open-source program from http://www.distributed.net/.  It
 tries a brute-force approach to cracking RC4 puzzles and also computes
 optimal Golomb rulers.  It starts up one process per CPU and runs at
 nice 20 and is, for all intents and purposes, 100% compute bound.

[Posting on the first message of the thread]

I basically went through all the e-mail you just sent and identified 4
real report on which we could work on and summarizied in the attached
Excel file.
I'd like that George, Steve, Doug, Andrey and Mike possibly review the
few datas there and add more, if they want, or make more important
clarifications in particular about the Xorg presence (or rather not)
in their workload.

I've readed a couple of message in the thread pointing the finger to
Xorg to be excessively CPU-intensive and I think they are right, we
might try to find a solution for that at some point, but it is really
a very edge case.
Geroge's and Steve's case, instead, look very different from this and
I want to analyze them in detail.
George already provided schedgraph traces and for others, if they
cannot provide them directly, I'd really appreciate they would at
least describe in detail the workload so that I get a chance to
reproduce it.

If someone else thinks he has a specific problem that is not
characterized by one of the cases above please let me know and I will
put this in the chart.

Thanks for the hard work you guys put in pointing out ULE's problem, I
think we will get at the bottom of this if we keep up sharing thoughts
and reports.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/14 Mike Tancsa m...@sentex.net:
 On 12/13/2011 7:01 PM, m...@freebsd.org wrote:

 Has anyone experiencing problems tried to set sysctl 
 kern.sched.steal_thresh=1 ?

 I don't remember what our specific problem at $WORK was, perhaps it
 was just interrupt threads not getting serviced fast enough, but we've
 hard-coded this to 1 and removed the code that sets it in
 sched_initticks().  The same effect should be had by setting the
 sysctl after a box is up.

 FWIW, this does impact the performance of pbzip2 on an i7. Using a 1.1G file

 pbzip2 -v -c big  /dev/null

 with burnP6 running in the background,

 sysctl kern.sched.steal_thresh=1
 vs
 sysctl kern.sched.steal_thresh=3



    N           Min           Max        Median           Avg        Stddev
 x  10     38.005022      38.42238     38.194648     38.165052    0.15546188
 +   9     38.695417     40.595544     39.392127     39.435384    0.59814114
 Difference at 95.0% confidence
        1.27033 +/- 0.412636
        3.32852% +/- 1.08119%
        (Student's t, pooled s = 0.425627)

 a value of 1 is *slightly* faster.

Hi Mike,
was that just the same codebase with the switch SCHED_4BSD/SCHED_ULE?

Also, the results here should be in the 3% interval for the avg case,
which is not yet at the 'alarm level' but could still be an
indication.
I still suspect I/O plays a big role here, however, thus it could be
detemined by other factors.

Could you retry the bench checking CPU usage and possible thread
migration around for both cases?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/13 Daniel Kalchev dan...@digsys.bg:


 On 13.12.11 09:36, Jeremy Chadwick wrote:

 I personally would find it interesting if someone with a higher-end system
 (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the same test
 (changing -jX to -j{numofcores} of course).


 Is 4 way 8 core Opteron ok? That is 32 cores, 64GB RAM.

 Testing with buildworld in my opinion is not adequate, as it involves way
 too much I/O. Any advice on proper testing methodology?

I'm sure that I/O and pmap subsystem contention (because of
buildworld) and TLB shootdown overhead (because of 32 CPUs) will be so
overwhelming that you are not really going to benchmark the scheduler
activity at all.

However I still don't get what you want to verify exactly?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/13 Jeremy Chadwick free...@jdc.parodius.com:
 On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
  Not fully right, boinc defaults to run on idprio 31 so this isn't an
  issue. And yes, there are cases where SCHED_ULE shows much better
  performance then SCHED_4BSD.  [...]

 Do we have any proof at hand for such cases where SCHED_ULE performs
 much better than SCHED_4BSD? Whenever the subject comes up, it is
 mentioned, that SCHED_ULE has better performance on boxes with a ncpu 
 2. But in the end I see here contradictionary statements. People
 complain about poor performance (especially in scientific environments),
 and other give contra not being the case.

 Within our department, we developed a highly scalable code for planetary
 science purposes on imagery. It utilizes present GPUs via OpenCL if
 present. Otherwise it grabs as many cores as it can.
 By the end of this year I'll get a new desktop box based on Intels new
 Sandy Bridge-E architecture with plenty of memory. If the colleague who
 developed the code is willing performing some benchmarks on the same
 hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
 recent Suse. For FreeBSD I intent also to look for performance with both
 different schedulers available.

 This is in no way shape or form the same kind of benchmark as what
 you're planning to do, but I thought I'd throw it out there for folks to
 take in as they see fit.

 I know folks were focused mainly on buildworld.

 I personally would find it interesting if someone with a higher-end
 system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
 same test (changing -jX to -j{numofcores} of course).

 --
 | Jeremy Chadwick                                jdc at parodius.com |
 | Parodius Networking                       http://www.parodius.com/ |
 | UNIX Systems Administrator                   Mountain View, CA, US |
 | Making life hard for others since 1977.               PGP 4BD6C0CB |


 sched_ule
 ===
 - time make -j2 buildworld
  1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
 - time make -j2 buildkernel
  640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w


 sched_4bsd
 
 - time make -j2 buildworld
  1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
 - time make -j2 buildkernel
  638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w


 software
 ==
 * sched_ule test:  FreeBSD 8.2-STABLE, Thu Dec  1 04:37:29 PST 2011
 * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011

Hi Jeremy,
thanks for the time you spent on this.

However, I wanted to ask/let you note 3 things:
1) Did you use 2 different code base for the test? (one updated on
December 1 and another one on December 12)
2) Please note that you should have repeated this test several times
(basically until you don't get a standard deviation which is
acceptable with ministat) and report the ministat output
3) The difference is less than 2% which I suspect is really
statistically unuseful/the same

I'm not really even surprised ULE is not faster than 4BSD in this case
because usually buildworld/buildkernel tests are driven for the vast
majority by I/O overhead rather than scheduler capacity. It would be
more interesting to analyze how buildworld does while another type of
workload is going on.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/15 Mike Tancsa m...@sentex.net:
 On 12/15/2011 11:26 AM, Attilio Rao wrote:

 Hi Mike,
 was that just the same codebase with the switch SCHED_4BSD/SCHED_ULE?

 Hi Attilio,
        It was the same codebase.


 Could you retry the bench checking CPU usage and possible thread
 migration around for both cases?

 I can, but how do I do that ?

I'm thinking now to a better test-case for this: can you try that on a
tmpfs volume?

Also what filesystem you were using? How many CPUs were in place?
Did you reboot before to move the steal_thresh value?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/15 Mike Tancsa m...@sentex.net:
 On 12/15/2011 11:42 AM, Attilio Rao wrote:

 I'm thinking now to a better test-case for this: can you try that on a
 tmpfs volume?

 There is enough RAM in the box so that it should not touch the disk, and
 I was sending the output to /dev/null, so it was not writing to the disk.


 Also what filesystem you were using?

 UFS

 How many CPUs were in place?

 4

 Did you reboot before to move the steal_thresh value?

 No.

So, as very first thing, can you try the following:
- Same codebase, etc. etc.
- Make the test 4 times, discard the first and ministat for the other 3
- Reboot
- Change the steal_thresh value
- Make the test 4 times, discard the first and ministat for the other 3

Then report discarded values and the ministated one and we will have
more informations I guess
(also, I don't think devfs contention should play a role here, thus
nevermind about it for now).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/15 Jeremy Chadwick free...@jdc.parodius.com:
 On Thu, Dec 15, 2011 at 05:26:27PM +0100, Attilio Rao wrote:
 2011/12/13 Jeremy Chadwick free...@jdc.parodius.com:
  On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
   Not fully right, boinc defaults to run on idprio 31 so this isn't an
   issue. And yes, there are cases where SCHED_ULE shows much better
   performance then SCHED_4BSD. ??[...]
 
  Do we have any proof at hand for such cases where SCHED_ULE performs
  much better than SCHED_4BSD? Whenever the subject comes up, it is
  mentioned, that SCHED_ULE has better performance on boxes with a ncpu 
  2. But in the end I see here contradictionary statements. People
  complain about poor performance (especially in scientific environments),
  and other give contra not being the case.
 
  Within our department, we developed a highly scalable code for planetary
  science purposes on imagery. It utilizes present GPUs via OpenCL if
  present. Otherwise it grabs as many cores as it can.
  By the end of this year I'll get a new desktop box based on Intels new
  Sandy Bridge-E architecture with plenty of memory. If the colleague who
  developed the code is willing performing some benchmarks on the same
  hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
  recent Suse. For FreeBSD I intent also to look for performance with both
  different schedulers available.
 
  This is in no way shape or form the same kind of benchmark as what
  you're planning to do, but I thought I'd throw it out there for folks to
  take in as they see fit.
 
  I know folks were focused mainly on buildworld.
 
  I personally would find it interesting if someone with a higher-end
  system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
  same test (changing -jX to -j{numofcores} of course).
 
  --
  | Jeremy Chadwick ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??jdc at 
  parodius.com |
  | Parodius Networking ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
  http://www.parodius.com/ |
  | UNIX Systems Administrator ?? ?? ?? ?? ?? ?? ?? ?? ?? Mountain View, CA, 
  US |
  | Making life hard for others since 1977. ?? ?? ?? ?? ?? ?? ?? PGP 
  4BD6C0CB |
 
 
  sched_ule
  ===
  - time make -j2 buildworld
  ??1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
  - time make -j2 buildkernel
  ??640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w
 
 
  sched_4bsd
  
  - time make -j2 buildworld
  ??1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
  - time make -j2 buildkernel
  ??638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w
 
 
  software
  ==
  * sched_ule test: ??FreeBSD 8.2-STABLE, Thu Dec ??1 04:37:29 PST 2011
  * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011

 Hi Jeremy,
 thanks for the time you spent on this.

 However, I wanted to ask/let you note 3 things:
 1) Did you use 2 different code base for the test? (one updated on
 December 1 and another one on December 12)

 No; src-all (/usr/src on this system) was not updated between December
 1st and December 12th PST.  I do believe I updated it today (15th PST).
 I can/will obviously hold off so that we have a consistent code base for
 comparing numbers between schedulers during buildworld and/or
 buildkernel.

 2) Please note that you should have repeated this test several times
 (basically until you don't get a standard deviation which is
 acceptable with ministat) and report the ministat output

 This is the first time I have heard of ministat(1).  I'm pretty sure I
 see what it's for and how it applies to this situation, but boy that man
 page could use some clarification (I have 3 people looking at this thing
 right now trying to figure out what means what in the graph :-) ).
 Anyway, graph or not, I see the point.

 Regarding multiple tests: yup, you're absolutely right, the only way to
 do it would be to run a sequence of tests repeatedly (probably 10 per
 scheduler).  Reboots and rm -fr /usr/obj/* would be required after each
 test too, to guarantee empty kernel caches (of all types) consistently
 every time.

 What I posted was supposed to give people just a general idea if there
 was any gigantic difference between the two, and there really isn't.
 But, as others have stated (and you below), buildworld may not be an
 effective way to benchmark what we're trying to test.

 Hence me wondering exactly what would make for a good test.  Example:

 1. Run + background some program that beats on things (I really don't
 know what; creation/deletion of threads?  CPU benchmark?  bonnie++?),
 with output going to /dev/null.
 2. Run + background time make -j2 buildworld with output going to /dev/null
 3. Record/save output from time.
 4. rm -fr /usr/obj  shutdown -r now
 5. Repeat all steps ~10 times
 6. Adjust kernel configuration file to use other scheduler
 7. Repeat steps 1-5.

 What I'm trying to figure out is what #1 and #2 should

Re: SCHED_ULE should not be the default

2011-12-15 Thread Attilio Rao
2011/12/15 Mike Tancsa m...@sentex.net:
 On 12/15/2011 11:56 AM, Attilio Rao wrote:
 So, as very first thing, can you try the following:
 - Same codebase, etc. etc.
 - Make the test 4 times, discard the first and ministat for the other 3
 - Reboot
 - Change the steal_thresh value
 - Make the test 4 times, discard the first and ministat for the other 3

 Then report discarded values and the ministated one and we will have
 more informations I guess
 (also, I don't think devfs contention should play a role here, thus
 nevermind about it for now).


 Results and data at

 http://www.tancsa.com/ule-bsd.html

I'm not totally sure, what does burnP6 do? is it a CPU-bound workload?
Also, how many threads are spanked in your case for parallel bzip2?

Also, it would be very good if you could arrange these tests against
newer -CURRENT (with userland and kerneland debugging off).

Thanks a lot of your hard work,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao
2011/12/9 George Mitchell george+free...@m5p.com:
 dnetc is an open-source program from http://www.distributed.net/.  It
 tries a brute-force approach to cracking RC4 puzzles and also computes
 optimal Golomb rulers.  It starts up one process per CPU and runs at
 nice 20 and is, for all intents and purposes, 100% compute bound.

 Here is what happens on my system, running 9.0-PRERELEASE, with and
 without dnetc running, with SCHED_ULE and SCHED-4BSD, when I run the
 command:

 time make buildkernel KERNCONF=WONDERLAND

 (I get similar results on 8.x as well.)

 SCHED_4BSD, dnetc not running:
 1329.715u 123.739s 24:47.95 97.6%       6310+1987k 11233+11098io 419pf+0w

 SCHED_4BSD, dnetc running:
 1329.364u 115.158s 26:14.83 91.7%       6325+1987k 10912+11060io 393pf+0w

 SCHED_ULE, dnetc not running:
 1357.457u 121.526s 25:20.64 97.2%       6326+1990k 11234+11149io 419pf+0w

 SCHED_ULE, dnetc running:
 Still going after seven and a half hours of clock time, up to
 compiling netgraph/bluetooth.  (Completed in another five minutes
 after stopping dnetc so I could write this message in a reasonable
 amount of time.)

 Not everybody runs this sort of program, but there are plenty of
 similar projects out there, and people who try to participate in
 them will be mightily displeased with their FreeBSD systems when
 they do.  Is there some case where SCHED_ULE exhibits significantly
 better performance than SCHED_4BSD?  If not, I think SCHED-4BSD
 should remain the default GENERIC configuration until this is fixed.

Hi George,
are you interested in exploring more the case with SCHED_ULE and dnetc?

More precisely I'd be interested in KTR traces.
To be even more precise:
With a completely stable GENERIC configuration (or otherwise please
post your kernel config) please add the following:
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)

While you are in the middle of the slow-down (so once it is well
established) please do:
# sysclt debug.ktr.cpumask=

In the end go with:
# ktrdump -ctf  ktr-ule-problem.out

and send the file to this mailing list.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao
2011/12/10 George Mitchell george+free...@m5p.com:
 On 12/09/11 10:17, Attilio Rao wrote:

 [...]

 More precisely I'd be interested in KTR traces.
 To be even more precise:
 With a completely stable GENERIC configuration (or otherwise please
 post your kernel config) please add the following:
 options KTR
 options KTR_ENTRIES=262144
 options KTR_COMPILE=(KTR_SCHED)
 options KTR_MASK=(KTR_SCHED)

 While you are in the middle of the slow-down (so once it is well
 established) please do:
 # sysclt debug.ktr.cpumask=


 wonderland# sysctl debug.ktr.cpumask=
 debug.ktr.cpumask: 
 sysctl: debug.ktr.cpumask: Invalid argument



 In the end go with:
 # ktrdump -ctf  ktr-ule-problem.out


 It's 44MB, so it's at http://www.m5p.com/~george/ktr-ule-problem.out

What svn revision did you use for it?
What is the CPUs frequencies of machines generating this?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SCHED_ULE should not be the default

2011-12-09 Thread Attilio Rao
2011/12/10 Eitan Adler li...@eitanadler.com:
 On Fri, Dec 9, 2011 at 8:15 PM, George Mitchell geo...@m5p.com wrote:
 Hope the attached helps.                         -- George Mitchell

 You attached dmesg, not a patch.

This is what is needed for a schedgraph analysis, along with KTR
points collection.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: FreeBSD 9-Beta3 on X300 2 problems

2011-09-28 Thread Attilio Rao
2011/9/27 crsnet.pl crs...@crsnet.pl:
 Hi,

 Hello, thanks for reply.

 Please try to do this without wlan loaded at all (not just down, but
 build your wifi support as a module.)
 Then try without X, see whether it's related to that or not.

 First i make kldunload if_iwn.
 When i try to suspend from X, Xorg close, i see console and laptop suspend.
 When i resume it, i get console (any key dosent work), when i try to ALT+F9
 i get black screen and beep;/

 But when i try to suspen from console. I get :
 pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP0: AE_BAD_PARAMETER
 pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP1: AE_BAD_PARAMETER
 pci0: failed to set ACPI power state D2 \_SB_.PCI0_EXP2: AE_BAD_PARAMETER
 And laptop suspend, when i resume it. He hangs when i press any buttons it
 does nothing. And than i see on console that info :
 ugen0.2: Broadcom Corp ... disconnected
 ugen4.2: Sierra Wireless ... disconnected
 ubt0: at uhub0 ... disconnected
 then i see this presed lethers
 and
 acpi0: suspend request ignored (not ready yet) and laptops langs and beep ;/

 (And you haven't told us what your hardware is.)

 #dmesg (+WITNESS)
 Copyright (c) 1992-2011 The FreeBSD Project.
 Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
 FreeBSD is a registered trademark of The FreeBSD Foundation.
 FreeBSD 9.0-BETA3 #3: Tue Sep 27 10:47:57 CEST 2011
    cr4sh@x300:/sys/amd64/compile/GENERIC amd64
 WARNING: WITNESS option enabled, expect reduced performance.
 CPU: Intel(R) Core(TM)2 Duo CPU     L7100  @ 1.20GHz (1197.03-MHz K8-class
 CPU)
  Origin = GenuineIntel  Id = 0x6fb  Family = 6  Model = f  Stepping = 11
  Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,P

                 BE
  Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant, performance statistics
 real memory  = 2147483648 (2048 MB)
 avail memory = 2019139584 (1925 MB)
 Event timer LAPIC quality 400
 ACPI APIC Table: LENOVO TP-7T   
 FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 FreeBSD/SMP: 1 package(s) x 2 core(s)
  cpu0 (BSP): APIC ID:  0
  cpu1 (AP): APIC ID:  1
 ACPI Warning: 32/64X length mismatch in Gpe1Block: 0/32
 (20110527/tbfadt-556)
 ACPI Warning: Optional field Gpe1Block has zero address or length:
 0x102C/0x0 (20110527/tbfadt-586)
 ioapic0: Changing APIC ID to 1
 ioapic0 Version 2.0 irqs 0-23 on motherboard
 kbd1 at kbdmux0
 acpi0: LENOVO TP-7T on motherboard
 CPU0: local APIC error 0x40
 acpi_ec0: Embedded Controller: GPE 0x12, ECDT port 0x62,0x66 on acpi0
 acpi0: Power Button (fixed)
 acpi0: reservation of 0, a (3) failed
 acpi0: reservation of 10, 7ef0 (3) failed
 Timecounter ACPI-fast frequency 3579545 Hz quality 900
 acpi_timer0: 24-bit timer at 3.579545MHz port 0x1008-0x100b on acpi0
 cpu0: ACPI CPU on acpi0
 cpu1: ACPI CPU on acpi0
 acpi_lid0: Control Method Lid Switch on acpi0
 acpi_button0: Sleep Button on acpi0
 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
 pci0: ACPI PCI bus on pcib0
 vgapci0: VGA-compatible display port 0x1800-0x1807 mem
 0xfa00-0xfa0f,0xe000-0xefff irq 16 at device 2.0 on pci0
 agp0: Intel GM965 SVGA controller on vgapci0
 agp0: aperture size is 256M, detected 7676k stolen memory
 vgapci1: VGA-compatible display mem 0xfa10-0xfa1f at device 2.1 on
 pci0
 pci0: simple comms at device 3.0 (no driver attached)
 atapci0: Intel ATA controller port
 0x1828-0x182f,0x180c-0x180f,0x1820-0x1827,0x1808-0x180b,0x1810-0x181f irq 18
 at device 3.2 on pci0
 ata2: ATA channel 0 on atapci0
 ata3: ATA channel 1 on atapci0
 pci0: simple comms, UART at device 3.3 (no driver attached)
 em0: Intel(R) PRO/1000 Network Connection 7.2.3 port 0x1840-0x185f mem
 0xfa20-0xfa21,0xfa225000-0xfa225fff irq 20 at device 25.0 o

        n pci0
 em0: Using an MSI interrupt
 acquiring duplicate lock of same type: network driver
  1st dev_spec-swflag_mutex @ dev/e1000/e1000_ich8lan.c:785
  2nd dev_spec-nvm_mutex @ dev/e1000/e1000_ich8lan.c:751

I think that MTX_NETWORK_LOCK is not suitable for this case as you
will have 2 different locks with the same name in softc.

I think that this patch should be good to go (and fixes the WITNESS warning):
http://www.freebsd.org/~attilio/e1000_mutex_init.patch

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-09-01 Thread Attilio Rao
2011/9/1 Trent Nelson tr...@snakebite.org:

 On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote:

 If nobody complains about it earlier, I'll propose the patch to re@ in 8 
 hours.

 Just a friendly 'me too', for the records.  22 hours of heavy network/disk 
 I/O and no panic yet -- prior to the patch it was a panic orgy.

 Any response from re@ on the patch?  It didn't appear to be in stable/8 as of 
 yesterday:

It has been committed to STABLE_8 as r225288.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: USB/coredump hangs in 8 and 9

2011-08-19 Thread Attilio Rao
2011/8/12 Andrew Boyer abo...@averesystems.com:
 Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
 Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
 freebsd-stable)
 Re: System hang in USB umass module while processing panic  (originally on 
 freebsd-usb)

 Hello Andriy and Hans,

 Sorry for tying in so many discussions on this topic, but I think I have an 
 explanation for the problems we have been reporting* with hanging coredumps 
 on multicore systems on 8.2-RELEASE, and it has implications for Andriy's 
 proposed scheduler patch** and for USB.

 In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
 when the kernel panics, but many parts of the locking code get disabled (grep 
 on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
 syncer encountering an error.  If that happens when it's on the dumping CPU 
 everything hangs.  If it's running on a different CPU, it will be blocked and 
 hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
 every attached keyboard for a Ctl-C.

 But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
 variable that broke coredumps for us in the 7.1-8.2 transition, I think.)  
 SVN 224223 fixes a hang that would happen when dumpsys() polls the USB 
 keyboard (IPMI KVM, in our case).  That helps, but it only gets as far as 
 usb_process(), where it hangs in a loop around a cv_wait() call.  This is 
 easy to reproduce by adding code to the watchdog to break into the debugger 
 if panicstr is set.

 I am experimenting with Andriy's patch** to stop the scheduler and it seems 
 to be most of the way there, stopping the CPUs and disabling the rest of 
 locking.  There are a few places that still reference panicstr, but that's 
 minor.  These are the changes I made to the patch:
  * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is 
 true, so that we don't hang up in USB.  ukbd_yield()  locks up in 
 DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks up 
 trying to drop mutexes.
  * Changed the call to spinlock_enter() back to critical_enter(), so that 
 interrupts stay enabled and the hardclock still functions.

Which spinlock_enter() are you referring here?
I think that having interrupts fast handlers running during
panic/shutdown is something we should avoid like hell.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-19 Thread Attilio Rao
If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours.

Attilio

2011/8/19 Mike Tancsa m...@sentex.net:
 On 8/18/2011 8:37 PM, Chip Camden wrote:

 st Thanks, Attilio.  I've applied the patch and removed the extra debug
 st options I had added (though keeping debug symbols).  I'll let you know 
 if
 st I experience any more panics.

  No panic for 20 hours at this moment, FYI.  For my NFS server, I
  think another 24 hours would be sufficient to confirm the stability.
  I will see how it works...

 -- Hiroki

 Likewise:

 $ uptime
  5:37PM  up 21:45, 5 users, load averages: 0.68, 0.45, 0.63

 So far, so good (knocks on head).



 0(ns4)% uptime
  8:55AM  up 22:39, 3 users, load averages: 0.01, 0.00, 0.00
 0(ns4)%


 So far so good for me too

        ---Mike

 --
 ---
 Mike Tancsa, tel +1 519 651 3400
 Sentex Communications, m...@sentex.net
 Providing Internet services since 1994 www.sentex.net
 Cambridge, Ontario Canada   http://www.tancsa.com/




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-18 Thread Attilio Rao
2011/8/18 Andriy Gapon a...@freebsd.org:
 on 17/08/2011 23:21 Andriy Gapon said the following:

 It seems like everything starts with some kind of a race between
 terminating
 processes in a jail and termination of the jail itself.  This is where the
 details are very thin so far.  What we see is that a process (http) is in
 exit(2) syscall, in exit1() function actually, and past the place where
 P_WEXIT
 flag is set and even past the place where p_limit is freed and reset to
 NULL.
 At that place the thread calls prison_proc_free(), which calls
 prison_deref().
 Then, we see that in prison_deref() the thread gets a page fault because
 of what
 seems like a NULL pointer dereference.  That's just the start of the
 problem and
 its root cause.

 Then, trap_pfault() gets invoked and, because addresses close to NULL look
 like
 userspace addresses, vm_fault/vm_fault_hold gets called, which in its turn
 goes
 on to call vm_map_growstack.  First thing that vm_map_growstack does is a
 call
 to lim_cur(), but because p_limit is already NULL, that call results in a
 NULL
 pointer dereference and a page fault.  Goto the beginning of this
 paragraph.

 So we get this recursion of sorts, which only ends when a stack is
 exhausted and
 a CPU generates a double-fault.

 BTW, does anyone has an idea why the thread in question would disappear
 from
 the kgdb's point of view?

 (kgdb) p cpuid_to_pcpu[2]-pc_curthread-td_tid
 $3 = 102057
 (kgdb) tid 102057
 invalid tid

 info threads also doesn't list the thread.

 Is it because the panic happened while the thread was somewhere in exit1()?
 is there an easy way to examine its stack in this case?

Yes it is likely it.

'tid' command should lookup the tid_to_thread() table (or similar
name) which returns NULL, which means the thread has past beyond the
point it was in the lookup table.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/17 Hiroki Sato h...@freebsd.org:
 Hi,

 Mike Tancsa m...@sentex.net wrote
  in 4e15a08c.6090...@sentex.net:

 mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 mi 
 mi  BTW, we had a similar panic, spinlock held too long, the spinlock
 mi  is the sched lock N, on busy 8-core box recently upgraded to the
 mi  stable/8. Unfortunately, machine hung dumping core, so the stack trace
 mi  for the owner thread was not available.
 mi 
 mi  I was unable to make any conclusion from the data that was present.
 mi  If the situation is reproducable, you coulld try to revert r221937. 
 This
 mi  is pure speculation, though.
 mi 
 mi  Another crash just now after 5hrs uptime. I will try and revert r221937
 mi  unless there is any extra debugging you want me to add to the kernel
 mi  instead  ?

  I am also suffering from a reproducible panic on an 8-STABLE box, an
  NFS server with heavy I/O load.  I could not get a kernel dump
  because this panic locked up the machine just after it occurred, but
  according to the stack trace it was the same as posted one.
  Switching to an 8.2R kernel can prevent this panic.

  Any progress on the investigation?

Hiroki,
how easilly can you reproduce it?

It would be important to have a DDB textdump with these informations:
- bt
- ps
- show allpcpu
- alltrace

Alternatively, a coredump which has the stop cpu patch which Andryi can provide.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/18 Hiroki Sato h...@freebsd.org:
 Hiroki Sato h...@freebsd.org wrote
  in 20110818.043332.27079545013461535@allbsd.org:

 hr Attilio Rao atti...@freebsd.org wrote
 hr   in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
 hr
 hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
 hr at  Hi,
 hr at 
 hr at  Mike Tancsa m...@sentex.net wrote
 hr at   in 4e15a08c.6090...@sentex.net:
 hr at 
 hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 hr at  mi 
 hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
 spinlock
 hr at  mi  is the sched lock N, on busy 8-core box recently upgraded to 
 the
 hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
 stack trace
 hr at  mi  for the owner thread was not available.
 hr at  mi 
 hr at  mi  I was unable to make any conclusion from the data that was 
 present.
 hr at  mi  If the situation is reproducable, you coulld try to revert 
 r221937. This
 hr at  mi  is pure speculation, though.
 hr at  mi 
 hr at  mi  Another crash just now after 5hrs uptime. I will try and 
 revert r221937
 hr at  mi  unless there is any extra debugging you want me to add to the 
 kernel
 hr at  mi  instead  ?
 hr at 
 hr at   I am also suffering from a reproducible panic on an 8-STABLE box, 
 an
 hr at   NFS server with heavy I/O load.  I could not get a kernel dump
 hr at   because this panic locked up the machine just after it occurred, 
 but
 hr at   according to the stack trace it was the same as posted one.
 hr at   Switching to an 8.2R kernel can prevent this panic.
 hr at 
 hr at   Any progress on the investigation?
 hr at
 hr at Hiroki,
 hr at how easilly can you reproduce it?
 hr
 hr  It takes 5-10 hours.  I installed another kernel for debugging just
 hr  now, so I think I will be able to collect more detail information in
 hr  a couple of days.
 hr
 hr at It would be important to have a DDB textdump with these informations:
 hr at - bt
 hr at - ps
 hr at - show allpcpu
 hr at - alltrace
 hr at
 hr at Alternatively, a coredump which has the stop cpu patch which Andryi 
 can provide.
 hr
 hr  Okay, I will post them once I can get another panic.  Thanks!

  I got the panic with a crash dump this time.  The result of bt, ps,
  allpcpu, and traces can be found at the following URL:

  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

I'm not sure I understand it, is also a corefile available?
If yes, where I could get it? (with the relevant sources and kernel.debug).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: spin lock held too long (RELENG_8 from today)

2011-08-17 Thread Attilio Rao
2011/8/18 Hiroki Sato h...@freebsd.org:
 Hiroki Sato h...@freebsd.org wrote
  in 20110818.043332.27079545013461535@allbsd.org:

 hr Attilio Rao atti...@freebsd.org wrote
 hr   in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com:
 hr
 hr at 2011/8/17 Hiroki Sato h...@freebsd.org:
 hr at  Hi,
 hr at 
 hr at  Mike Tancsa m...@sentex.net wrote
 hr at   in 4e15a08c.6090...@sentex.net:
 hr at 
 hr at  mi On 7/7/2011 7:32 AM, Mike Tancsa wrote:
 hr at  mi  On 7/7/2011 4:20 AM, Kostik Belousov wrote:
 hr at  mi 
 hr at  mi  BTW, we had a similar panic, spinlock held too long, the 
 spinlock
 hr at  mi  is the sched lock N, on busy 8-core box recently upgraded to 
 the
 hr at  mi  stable/8. Unfortunately, machine hung dumping core, so the 
 stack trace
 hr at  mi  for the owner thread was not available.
 hr at  mi 
 hr at  mi  I was unable to make any conclusion from the data that was 
 present.
 hr at  mi  If the situation is reproducable, you coulld try to revert 
 r221937. This
 hr at  mi  is pure speculation, though.
 hr at  mi 
 hr at  mi  Another crash just now after 5hrs uptime. I will try and 
 revert r221937
 hr at  mi  unless there is any extra debugging you want me to add to the 
 kernel
 hr at  mi  instead  ?
 hr at 
 hr at   I am also suffering from a reproducible panic on an 8-STABLE box, 
 an
 hr at   NFS server with heavy I/O load.  I could not get a kernel dump
 hr at   because this panic locked up the machine just after it occurred, 
 but
 hr at   according to the stack trace it was the same as posted one.
 hr at   Switching to an 8.2R kernel can prevent this panic.
 hr at 
 hr at   Any progress on the investigation?
 hr at
 hr at Hiroki,
 hr at how easilly can you reproduce it?
 hr
 hr  It takes 5-10 hours.  I installed another kernel for debugging just
 hr  now, so I think I will be able to collect more detail information in
 hr  a couple of days.
 hr
 hr at It would be important to have a DDB textdump with these informations:
 hr at - bt
 hr at - ps
 hr at - show allpcpu
 hr at - alltrace
 hr at
 hr at Alternatively, a coredump which has the stop cpu patch which Andryi 
 can provide.
 hr
 hr  Okay, I will post them once I can get another panic.  Thanks!

  I got the panic with a crash dump this time.  The result of bt, ps,
  allpcpu, and traces can be found at the following URL:

  http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt

Actually, I think I see the bug here.

In callout_cpu_switch() if a low priority thread is migrating the
callout and gets preempted after the outcoming cpu queue lock is left
(and scheduled much later) we get this problem.

In order to fix this bug it could be enough to use a critical section,
but I think this should be really interrupt safe, thus I'd wrap them
up with spinlock_enter()/spinlock_exit(). Fortunately
callout_cpu_switch() should be called rarely and also we already do
expensive locking operations in callout, thus we should not have
problem performance-wise.

Can the guys I also CC'ed here try the following patch, with all the
initial kernel options that were leading you to the deadlock? (thus
revert any debugging patch/option you added for the moment):
http://www.freebsd.org/~attilio/callout-fixup.diff

Please note that this patch is for STABLE_8, if you can confirm the
good result I'll commit to -CURRENT and then backmarge as soon as
possible.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-11 Thread Attilio Rao
I'd really point the finger to faulty hw.

Please run all the necessary diagnostic tools for catching it.

Attilio

2011/8/11 Andriy Gapon a...@freebsd.org:
 on 10/08/2011 18:35 Steven Hartland said the following:
 Fatal double fault
 rip = 0x8052f6f1
 rsp = 0xff86ce600fb0
 rbp = 0xff86ce601210
 cpuid = 0; apic id = 00
 panic: double fault
 cpuid = 0
 KDB: stack backtrace:
 #0 0x803af91e at kdb_backtrace+0x5e
 #1 0x8037d817 at panic+0x187
 #2 0x80574316 at dblfault_handler+0x96
 #3 0x8055d06d at Xdblfault+0xad
 [snip]
 #0  sched_switch (td=0x80830bc0, newtd=0xff000a73f8c0, 
 flags=Variable
 flags is not available.)
    at /usr/src/sys/kern/sched_ule.c:1858
 1858                    cpuid = PCPU_GET(cpuid);
 (kgdb)
 #0  sched_switch (td=0x80830bc0, newtd=0xff000a73f8c0, 
 flags=Variable
 flags is not available.)
    at /usr/src/sys/kern/sched_ule.c:1858
 #1  0x80385c86 in mi_switch (flags=260, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:449
 #2  0x803b92d2 in sleepq_timedwait (wchan=0x80830760, pri=68)
    at /usr/src/sys/kern/subr_sleepqueue.c:644
 #3  0x803861e1 in _sleep (ident=0x80830760, lock=0x0,
    priority=Variable priority is not available.
 ) at /usr/src/sys/kern/kern_synch.c:230
 #4  0x80532c29 in scheduler (dummy=Variable dummy is not available.
 ) at /usr/src/sys/vm/vm_glue.c:807
 #5  0x80335d67 in mi_startup () at /usr/src/sys/kern/init_main.c:254
 #6  0x8016efac in btext () at /usr/src/sys/amd64/amd64/locore.S:81
 #7  0x808556e0 in sleepq_chains ()
 #8  0x8083b1e0 in cpu_top ()
 #9  0x in ?? ()
 #10 0x80830bc0 in proc0 ()
 #11 0x80ba4b90 in ?? ()
 #12 0x80ba4b38 in ?? ()
 #13 0xff000a73f8c0 in ?? ()
 #14 0x803a2cc9 in sched_switch (td=0x0, newtd=0x0, flags=Variable 
 flags
 is not available.
 )
    at /usr/src/sys/kern/sched_ule.c:1852
 Previous frame inner to this frame (corrupt stack?)
 (kgdb)

 Looks like this is just the first thread in the kernel.
 Perhaps 'thread apply all bt' could help to find the culprit.

 --
 Andriy Gapon
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-11 Thread Attilio Rao
2011/8/11 Jeremy Chadwick free...@jdc.parodius.com:
 On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
 That's not the issue as its happening across board over 130 machines :(

 Agreed, bad hardware sounds unlikely here.  I could believe some strange
 incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems
 en masse across many servers, but hardware issues are unlikely in this
 situation.

 [1]: I mention this because we had something similar happen at my
 workplace.  For months we used a specific model of system from our
 vendor which worked reliably, zero issues.  Then we got a new shipment
 of boxes (same model as prior) which started acting very odd (often AHCI
 timeout issues or MCEs which when decoded would usually turn out to be
 nonsensical).  It took weeks to determine the cause given how slow the
 vendor was to respond: root cause turned out to be that the vendor
 decided, on a whim, to start shipping a newer BIOS version which wasn't
 as compatible with Solaris as previous BIOSes.  Downgrading all the
 systems to the older BIOS fixed the problem.

That falls in the hw problem category for me.

Anyway, we really would need much more information in order to take a
proactive action.

Would it be possible to access to one of the panic'ing machine? Is it
always the same panic which is happening or it is variadic (like: once
page fault, once fatal double fault, once fatal trap, etc.).

Whatever informations you can provide may be valuable here.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao
2011/6/3 Nathan Whitehorn nwhiteh...@freebsd.org:
 On 06/03/11 10:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value.
 If, yes, I am very interested to learn about your usecase for it.

 I think that the default kdb behavior is the correct one, so it doesn't
 make sense
 to have a knob to turn on incorrect behavior.
 But I may be missing something obvious.

 The comment in the code doesn't really satisfy me:
 /*
  * Flag indicating whether or not to IPI the other CPUs to stop them on
  * entering the debugger.  Sometimes, this will result in a deadlock as
  * stop_cpus() waits for the other cpus to stop, so we allow it to be
  * disabled.  In order to maximize the chances of success, use a hard
  * stop for that.
  */

 The hard stop should be sufficiently mighty.
 Yes, I am aware of supposedly extremely rare situations where a deadlock
 could
 happen even when using hard stop.  But I'd rather fix that than have this
 switch.

 Oh, the commit message (from 2004) explains it:

 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.

 But that was before we started using hard stop in this context (in 2009).

 Some non-x86 platforms (e.g. PPC) don't support real NMIs, and so this still
 applies.

Well, if I get Andriy's proposal right, he just wants to trim off the
possibility to not stop the CPUs on entering KDB. I'm not entirely
sure why there is a sysctl for disabling that and I really don't want
it.

Note that the missing of the NMI/privileged Interrupt is not going to
be a factor on this request, unless you are worried a lot by the easy
deadlock that a normal stop operation may lead.
If that is the case, I think that the upcoming work on skipping
locking during KDB/panic entering is going to help a lot for this
case. At that point removing the possibility to turn off CPU stopping
will be a good idea, IMHO.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao
2011/6/4 Andriy Gapon a...@freebsd.org:
 on 03/06/2011 20:57 Robert N. M. Watson said the following:

 On 3 Jun 2011, at 16:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value. If, yes, I
 am very interested to learn about your usecase for it.

 The issue that prompted the sysctl was non-NMI IPIs being used to enter the
 debugger or reboot following a core hanging with interrupts disabled. With
 the switch to NMI IPIs in some of those circumstances, life is better -- at
 least, on hardware that supports non-maskable IPIs. I seem to recall sparc64
 doesn't, however?

 Seems to be so as Nathan has also pointed out for PPC.
 For this I also plan the following change:

 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016

    generic_stop_cpus: move timeout detection code from under DIAGNOSTIC

    ... and also increase it a bit.
    IMO it's better to detect and report the (rather serious) condition and
    allow a system to proceed somehow rather than be stuck in an endless
    loop.

 diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
 index ae52f4b..4bd766b 100644
 --- a/sys/kern/subr_smp.c
 +++ b/sys/kern/subr_smp.c
 @@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
                /* spin */
                cpu_spinwait();
                i++;
 -#ifdef DIAGNOSTIC
 -               if (i == 10) {
 +               if (i == 1) {
                        printf(timeout stopping cpus\n);
                        break;
                }
 -#endif
        }

        stopping_cpu = NOCPU;

I'd also add the ability, once the deadlock is detected, to break in
KDB, and put that under DIAGNOSTIC.
I had such a patch and I used it to debug some deadlocks on shutdown
code, but now it seems I can't find it anymore.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.2-PRERELEASE freezing on reboot (-current OK)

2010-12-14 Thread Attilio Rao
2010/12/10 Arno J. Klaassen a...@heho.snv.jussieu.fr:

 Hello,

 just FYI that on an 8-way Tyan S3992-E based box, a reboot under
 8.2-PRERELEASE (in fact, 8-stable since quite a while) makes the box
 freeze, whilst the same thing under -current works OK.

 For info the end of console output in both cases as well as dmesg.boot
 for -current.

 Feel free to contact me for more info or test patches.

Hello Arno,
I'd need you do the following things:
- Compile a new kernel including this patch:
http://www.freebsd.org/~attilio/diagno-stable8.diff

and including the kernel config options KDB, DDB, DIAGNOSTIC and WITNESS.
Please accurately skip, if present in your config file, options
WITNESS_SKIPSPIN and KDB_UNATTENDED.

These options could make the deadlock not visible anymore, at some extent.
You may repeat a lot of times the reboot in order to try to get
something but if you can't reproduce it just let me know.

- When the kernel deadlocks, this time, after a while it should be
able to resolve the deadlock alone.
If that happens you will see the DDB prompt. At the ddb prompt type
the following commands:
db ps
db show allpcpu
db allt
db show alllocks

Note that this is quite a big output and you'd need a serial console to log it.
If you can't arrange serial connections, I'll tell you what
informations I need to specifically check and you may annotate someway
and reply to me (the full logs would be valuable, but it is better
than nothing).

Are the instructions clear?

Let me know if you have any question.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: stable GENERIC kernel build fails?

2010-10-26 Thread Attilio Rao
Sorry for the mis-service, it should be fixed now.

Thanks,
Attilio

2010/10/26 Chip Camden sterl...@camdensoftware.com:
 After a csup, building the GENERIC kernel on amd64 fails with:

 make -V CFILES -V SYSTEM_CFILES -V GEN_CFILES |  MKDEP_CPP=cc -E
 CC=cc xargs mkdep -a -f .newdep -O2 -frename-registers -pipe
 -fno-strict-aliasing  -std=c99 -g -Wall -Wredundant-decls
 -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes
 -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign
 -fformat-extensions -nostdinc  -I. -I/usr/src/sys
 -I/usr/src/sys/contrib/altq -I/usr/src/sys/contrib/ipfilter
 -I/usr/src/sys/contrib/pf -I/usr/src/sys/dev/ath
 -I/usr/src/sys/dev/ath/ath_hal -I/usr/src/sys/contrib/ngatm
 -I/usr/src/sys/dev/twa -I/usr/src/sys/gnu/fs/xfs/FreeBSD
 -I/usr/src/sys/gnu/fs/xfs/FreeBSD/support -I/usr/src/sys/gnu/fs/xfs
 -I/usr/src/sys/contrib/opensolaris/compat -I/usr/src/sys/dev/cxgb
 -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common
 -finline-limit=8000 --param inline-unit-growth=100 --param
 large-function-growth=1000  -fno-omit-frame-pointer -mcmodel=kernel
 -mno-red-zone  -mfpmath=387 -mno-sse -mno-sse2 -mno-sse3 -mno-mmx
 -mno-3dnow  -msoft-float -fno-asynchronous-unwind-tables -ffreestanding
 -fstack-protector
 cc: /usr/src/sys/libkern/inet_ntop.c: No such file or directory
 cc: /usr/src/sys/libkern/inet_pton.c: No such file or directory
 mkdep: compile failed
 *** Error code 1

 Stop in /usr/obj/usr/src/sys/GENERIC.
 *** Error code 1

 Stop in /usr/src.
 *** Error code 1

 Stop in /usr/src.
 libertas/usr/src# uname -a
 FreeBSD libertas.local.camdensoftware.com 8.1-STABLE FreeBSD 8.1-STABLE #81: 
 Sun Oct 24 11:46:14 PDT 2010     
 sterl...@libertas.local.camdensoftware.com:/usr/obj/usr/src/sys/LIBERTAS  
 amd64

 --
 Sterling (Chip) Camden    | sterl...@camdensoftware.com | 2048D/3A978E4F
 http://camdensoftware.com | http://chipstips.com        | 
 http://chipsquips.com




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [releng_8 tinderbox] failure on amd64/amd64

2010-10-26 Thread Attilio Rao
This issue should be resolved by r214370 already; someone else can
validate this?

Thanks,
Attilio

2010/10/26 FreeBSD Tinderbox tinder...@freebsd.org:
 TB --- 2010-10-26 06:20:40 - tinderbox 2.6 running on 
 freebsd-current.sentex.ca
 TB --- 2010-10-26 06:20:40 - starting RELENG_8 tinderbox run for amd64/amd64
 TB --- 2010-10-26 06:20:40 - cleaning the object tree
 TB --- 2010-10-26 06:23:51 - cvsupping the source tree
 TB --- 2010-10-26 06:23:51 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
 /tinderbox/RELENG_8/amd64/amd64/supfile
 TB --- 2010-10-26 06:28:44 - building world
 TB --- 2010-10-26 06:28:44 - MAKEOBJDIRPREFIX=/obj
 TB --- 2010-10-26 06:28:44 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
 TB --- 2010-10-26 06:28:44 - TARGET=amd64
 TB --- 2010-10-26 06:28:44 - TARGET_ARCH=amd64
 TB --- 2010-10-26 06:28:44 - TZ=UTC
 TB --- 2010-10-26 06:28:44 - __MAKE_CONF=/dev/null
 TB --- 2010-10-26 06:28:44 - cd /src
 TB --- 2010-10-26 06:28:44 - /usr/bin/make -B buildworld
 World build started on Tue Oct 26 06:28:46 UTC 2010
 Rebuilding the temporary build tree
 stage 1.1: legacy release compatibility shims
 stage 1.2: bootstrap tools
 stage 2.1: cleaning up the object tree
 stage 2.2: rebuilding the object tree
 stage 2.3: build tools
 stage 3: cross tools
 stage 4.1: building includes
 stage 4.2: building libraries
 stage 4.3: make dependencies
 stage 4.4: building everything
 stage 5.1: building 32 bit shim libraries
 World build completed on Tue Oct 26 14:25:11 UTC 2010
 TB --- 2010-10-26 14:25:11 - generating LINT kernel config
 TB --- 2010-10-26 14:25:11 - cd /src/sys/amd64/conf
 TB --- 2010-10-26 14:25:11 - /usr/bin/make -B LINT
 TB --- 2010-10-26 14:25:12 - building LINT kernel
 TB --- 2010-10-26 14:25:12 - MAKEOBJDIRPREFIX=/obj
 TB --- 2010-10-26 14:25:12 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
 TB --- 2010-10-26 14:25:12 - TARGET=amd64
 TB --- 2010-10-26 14:25:12 - TARGET_ARCH=amd64
 TB --- 2010-10-26 14:25:12 - TZ=UTC
 TB --- 2010-10-26 14:25:12 - __MAKE_CONF=/dev/null
 TB --- 2010-10-26 14:25:12 - cd /src
 TB --- 2010-10-26 14:25:12 - /usr/bin/make -B buildkernel KERNCONF=LINT
 Kernel build for LINT started on Tue Oct 26 14:25:12 UTC 2010
 stage 1: configuring the kernel
 stage 2.1: cleaning up the object tree
 stage 2.2: rebuilding the object tree
 stage 2.3: build tools
 stage 3.1: making dependencies
 [...]
 awk -f /src/sys/tools/makeobjops.awk /src/sys/opencrypto/cryptodev_if.m -h
 awk -f /src/sys/tools/makeobjops.awk /src/sys/dev/acpica/acpi_if.m -h
 awk -f /src/sys/tools/makeobjops.awk /src/sys/dev/acpi_support/acpi_wmi_if.m 
 -h
 rm -f .newdep
 /usr/bin/make -V CFILES -V SYSTEM_CFILES -V GEN_CFILES |  MKDEP_CPP=cc -E 
 CC=cc xargs mkdep -a -f .newdep -O2 -frename-registers -pipe 
 -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls -Wnested-externs 
 -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith -Winline 
 -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  -I. 
 -I/src/sys -I/src/sys/contrib/altq -I/src/sys/contrib/ipfilter 
 -I/src/sys/contrib/pf -I/src/sys/dev/ath -I/src/sys/dev/ath/ath_hal 
 -I/src/sys/contrib/ngatm -I/src/sys/dev/twa -I/src/sys/gnu/fs/xfs/FreeBSD 
 -I/src/sys/gnu/fs/xfs/FreeBSD/support -I/src/sys/gnu/fs/xfs 
 -I/src/sys/contrib/opensolaris/compat -I/src/sys/dev/cxgb -D_KERNEL 
 -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common 
 -finline-limit=8000 --param inline-unit-growth=100 --param 
 large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF 
 -fno-builtin -fno-omit-frame-pointer -mcmodel=kernel -mno-red-zone  
 -mfpmath=387 -mno-sse -mno-sse2 -mno-ss!
  e3 -mno-mmx -mno-3dnow  -msoft-float -fno-asynchronous-unwind-tables 
 -ffreestanding -fstack-protector
 cc: /src/sys/libkern/inet_ntop.c: No such file or directory
 cc: /src/sys/libkern/inet_pton.c: No such file or directory
 mkdep: compile failed
 *** Error code 1

 Stop in /obj/src/sys/LINT.
 *** Error code 1

 Stop in /src.
 *** Error code 1

 Stop in /src.
 TB --- 2010-10-26 14:50:51 - WARNING: /usr/bin/make returned exit code  1
 TB --- 2010-10-26 14:50:51 - ERROR: failed to build lint kernel
 TB --- 2010-10-26 14:50:51 - 4255.74 user 16233.66 system 30610.94 real


 http://tinderbox.freebsd.org/tinderbox-releng_8-RELENG_8-amd64-amd64.full
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Kernel panic when unpluggin AC adaptor

2010-05-13 Thread Attilio Rao
2010/5/14 Giovanni Trematerra giovanni.tremate...@gmail.com:
 On Thu, May 13, 2010 at 1:09 AM, Brandon Gooch
 jamesbrandongo...@gmail.com wrote:
 On Wed, May 12, 2010 at 9:41 AM, Attilio Rao atti...@freebsd.org wrote:
 2010/5/12 David DEMELIER demelier.da...@gmail.com:
 I remove the patch, and built the kernel (I updated the src this
 morning) and it does not panic now. It's really odd. If it reappears
 soon I will tell you.

 I looked at the code with Giovanni and I have the feeling that the
 race with the idle thread may still be fatal.
 We need to fix that.

 Attilio


 That seems to be the case, as my laptop shows about an 80-85 % chance
 of experiencing a panic if left idle for long-ish periods of time (2
 to 4 hours). I usually rebuild world or big ports overnight, and more
 often than not I wake up to a panicked machine, same situation every
 time:

 ...
 rman_get_bushandle() at rman_get_bushandle+0x1
 sched_idletd() at sched_idletd+0x123
 fork_exit() at fork_exit+0x12a
 fork_trampoline() at fork_trampoline+0xe
 ...

 The kernel/userland is rebuilt, the ports are finished compiling --
 it's in the time AFTER the completion of all tasks that the machine
 gets bored and tries to kill itself :)

 I have seen the AC adapter plug/unplug hang in the past on this
 laptop, but I never made the connection between the events, as
 nowadays my laptop usually stays plugged in :(

 Attilio, I hope you can track this one down, let me know if I can do
 anything to help or test...


 Attilio and I came up with this patch. It seems ready for stress
 testing and review
 Please test and report back.

I have still to review it completely, hope to do that asap.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Kernel panic when unpluggin AC adaptor

2010-05-12 Thread Attilio Rao
2010/5/12 David DEMELIER demelier.da...@gmail.com:
 I remove the patch, and built the kernel (I updated the src this
 morning) and it does not panic now. It's really odd. If it reappears
 soon I will tell you.

I looked at the code with Giovanni and I have the feeling that the
race with the idle thread may still be fatal.
We need to fix that.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU problems after 8.0-STABLE update

2010-04-09 Thread Attilio Rao
2010/4/9 Jakub Lach jakub_l...@mailplus.pl:



 Andriy Gapon wrote:


 Really shooting in the dark here: are there any BIOS options about HPET
 and RTC on
 this system?  Can you try playing with them?



 Hello. I have similar problem. Once in few boots performance would be
 sluggish and
 top would be at 0%. It started on 4th April I think. After today's update,
 problem is persistent.
 Currently, as I type letters are appearing with considerable delay.

 I'm using HPET, 8-STABLE amd64 r206412

Ok, r206421 switches the default tunable for machdep.lapic_allclock in
order to enable atrtc usage only if it is properly turned off.
I will MFC in a week.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU problems after 8.0-STABLE update

2010-04-08 Thread Attilio Rao
2010/4/8 Andriy Gapon a...@icyb.net.ua:
 on 08/04/2010 04:29 Akephalos said the following:
 Attilio, I csup-dated several hours ago and rebuilt and installed the kernel
 (and world, in case it matters).

 %uname -a FreeBSD free.bsd369441.org 8.0-STABLE FreeBSD 8.0-STABLE #0: Thu
 Apr  8 03:01:13 EEST 2010
 r...@free.bsd369441.org:/usr/obj/usr/src/sys/GENERIC  amd64

 The problem persists without the machdep trick, I see only one processor in
 top with 0.0% CPU load.


 Interesting, I couldn't see anything obviously wrong about your hardware.
 Could you please post a verbose dmesg from a problematic boot somewhere?
 Also, output of 'vmstat -i' and Interrupt request lines portion of 'devinfo 
 -u'
 output.
 Thanks!

I watched again the patch I committed to STABLE_8 and I can't find
anything wrong with it.
Also the fact that the setting machdep.lapic_all=1 fixes this means
that this may be an atrtc working problem.
Maybe new atom machine expose a problem with it?
I'm thinking if we might switch this into an opt-in rather than an
opt-out feature.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU problems after 8.0-STABLE update

2010-04-06 Thread Attilio Rao
2010/4/6 Akephalos Akephalos akephalos.akepha...@gmail.com:
 On Sun, Apr 4, 2010 at 7:28 PM, Attilio Rao atti...@freebsd.org wrote:

 What architecture is it?
 May you try setting machdep.lapic_allclocks to 1 in /boot/loader.conf?
 May you report #dmesg | grep atrtc


 Thanks,
 Attilio


 --
 Peace can only be achieved by understanding - A. Einstein

 # dmesg | grep -B 5 -A 5 -i rtc
 acpi_button0: Sleep Button on acpi0
 acpi_button1: Power Button on acpi0
 acpi_tz0: Thermal Zone on acpi0
 battery0: ACPI Control Method Battery on acpi0
 acpi_acad0: AC Adapter on acpi0
 atrtc0: AT realtime clock port 0x70-0x71 irq 8 on acpi0
 atkbdc0: Keyboard controller (i8042) port 0x60,0x64 irq 1 on acpi0
 atkbd0: AT Keyboard irq 1 on atkbdc0
 kbd0 at atkbd0
 atkbd0: [GIANT-LOCKED]
 atkbd0: [ITHREAD]
 ---

 I set machdep.lapic_allclocks to 1 at statup - top works now!! I can see
 both processors with top -P, btw, everything looks fine, although I get core
 dump for xfce4-taskmanager and can't test it (it's probably related to
 something else).
 ---

 I started powerd - it scales the frequencies correctly now.

 This seems to be the solution, is this a bug should I report or leave things
 like this?

Uhm, may you tell me which revision did you update to? May you update
to the latest now, recompile your kernel, remove the hint
machdep.lapic_allclocks and report if it works or not?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: CPU problems after 8.0-STABLE update

2010-04-04 Thread Attilio Rao
2010/4/4 Akephalos Akephalos akephalos.akepha...@gmail.com:
 Hey,

 I installed 8.0 release and used it very briefly until updating through
 cvsup to the latest stable source. I had no problems with the release (DVD)
 version, except that my wireless card wasn't detected, so updating was the
 natural thing to do. My hardware is an ASUS dual Turion laptop (K50AB), and
 my working setup was like this:

 - /boot/loader.conf:
 cpufreq_load=YES
 hint.acpi_throttle.0.disabled=1
 - /etc/rc.conf:
 powerd_enable=YES

 It was working fine, the CPU frequency was scaling as expected, I checked it
 numerous times while working and idle with 'sysctl dev.cpu.0.freq'. Also,
 the load was displayed correctly in the taskmanager (I don't remember what
 was displayed in 'top', but I suppose it was ok).

 Now, after updating through buildworld, powerd doesn't scale the frequency
 anymore. My observations pointed out that the problem is that the CPU load
 is not detected correctly anymore:
 - I got three frequency steps: 575, 1150 and 2300 (correctly detected by
 dev.cpu.0.freq_levels while cpufreq module is loaded), but powerd scales
 down the frequency to the minimum, 575 then keeps it like that no matter of
 the load - dev.cpu.0.freq shows 575 and I got large build times because of
 it. To be able to use it fully, I have to kill powerd and set the frequency
 manually, or disable it at startup.
 - 'top -P' displays only one CPU and its load is 0% everything all the time,
 despite any load
 - I can't see anything in a taskmanager, the last time I tried with xfce and
 CURRENT (CURRENT had the same issue)
 - dev.cpu.0.cx_usage shows 100%.
 ---

 I'd like to find out the problem, why the CPU level is not detected
 correctly and how to fix this/report.

What architecture is it?
May you try setting machdep.lapic_allclocks to 1 in /boot/loader.conf?
May you report #dmesg | grep atrtc


Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS and sh(1) panic: spin lock [lock addr] (smp rendezvous) held by [sh(1) proc tid] too long

2010-02-20 Thread Attilio Rao
2010/1/27 Brandon Gooch jamesbrandongo...@gmail.com:
 The machine, a Dell Optiplex 755, has been locking up recently. The
 situation usually occurs while using VirtualBox (running a 64-bit
 Windows 7 instance) and doing anything else in another xterm (such as
 rebuilding a port).  I've been unable to reliably reproduce it (I'm in
 an X session and the machine will not panic properly).

 However, while rebuilding Xorg today at ttyv0 and runnning
 VBoxHeadless on ttyv1, I managed to trigger what I believe is the
 lockup.

 I've attached a textdump in hopes that someone may be able to take a
 look and provide clues or instruction on debugging this.

I think that jhb@ saw a similar problem while working on nVidia driver
or the like.
Not sure if he made any progress to debug this.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: run_interrupt_driven_hooks: still waiting... for xpt_config

2010-01-21 Thread Attilio Rao
2010/1/21 Willem Jan Withagen w...@digiware.nl:
 Willem Jan Withagen wrote:

 I'm trying to revive an old dual optern Tyan Tomcat S2875 board. Even
 upgraded it to the most recent BIOS. But still no go.
 Both with 8.0 and 7.2 RELEASE.

 I've also disabled P1394 and all USB in the BIOS, that did not work
 either.
 Only thing that is extra in the box is a an Areca 1120 controller.

 Moved the bootable disk to an default SATA port on the MB, and removed the
 Areca controller.
 That gets ride of the problem, but it also creates a new problem since I'd
 like to use the controller to handle a bunch of backup-disks.

 Suggestions on how to get the Areca controller passed the xpt_config test
 are welcomed.

It may be linked to sbp(4) probabilly. Do you have it in your kernel?
do you want to recompile it without if the answer is yes?
It would be interesting to try it without ACPI and possibly see, in
the hang case, on which IRQ (sharing with whatever other source) is.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-19 Thread Attilio Rao
2010/1/19 Pete French petefre...@ticketswitch.com:
 May you post your kernel config?

 sure...

        include         GENERIC
        ident           DEBUG
        options         KDB
        options         DDB
        options         WITNESS
        options         INVARIANT_SUPPORT
        options         INVARIANTS

Ok then, remove the debugging (WITNESS, INVARIANT*), leave in place
KDB and DDB, add GDB and try at least to get a coredump when it
deadlocks.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-18 Thread Attilio Rao
2010/1/18 Pete French petefre...@ticketswitch.com:
 One may never know, try without WITNESS but still the same setup.

 Well, I have been running like this for three days with no lockups
 dissapointingly. I just saw that you commited the lock patches, so
 am going to update to the latest STABLE and go back to GENERIC to see if
 that still locks up (as I can see a couple of other fixes in there).
 Will let you know what happens - at the moment it's frustrating
 as it wont lockup if I have anything diagnostic in the kernel it
 seem!

May you post your kernel config?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-15 Thread Attilio Rao
2010/1/15 Pete French petefre...@ticketswitch.com:
 Well, the machine has been running the WITNESS + INVARIANTS kernel
 for 20 hours now without locking up.This looks like what I
 saw before - compiling in WITNESS stops it locking up -(

 Is there any use in my runing a kernel with just INVARIANTS to see if
 that will lcok ? I know it locks with KDN and DDb on their own, but
 am not usre how useful that is.

One may never know, try without WITNESS but still the same setup.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-14 Thread Attilio Rao
2010/1/14 Pete French petefre...@ticketswitch.com:
 http://www.freebsd.org/~attilio/lockmgr_fix8.diff

 I'm seeking for testers here.
 Any report would be very much appreciated.

 I tested the patch on my machine which locks up, and I am afraid that it
 still locks, even with the patch applied. The last things on the console
 before the lock are.

 1) A whole load of sshd errors for one of those flood attacks which try
 multiple usersnames. This is not unusual, all my systems with an external
 ssh port see this.

 2) Four 'Watchdog timeout occurreed, resetting! messages from if_bce.c.
 These are new - without your patch I did not get these.

 I have tried rnning this machine with WITNESS in the kernel, but it
 will not deadlock then. Without WiTNESS it will lock up in about
 twelve hours. I am going to try with just KDB and DDB to see if I can get
 it into a state where we can get some useful information out of it.

Also enable INVARIANTS.
While there (with my patch applied) please setup textdump in order to
report the following DDB commands (and once it deadlocks break in
DDB):
bt, show allpcpu, ps, alltrace, show alllocks

Try also to get a coredump (and if you can't report immediately to us
and try to not turn off the machine in order to apply following
instructions).

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [PATCH] Lockmgr deadlock on STABLE_8

2010-01-14 Thread Attilio Rao
2010/1/14 Pete French petefre...@ticketswitch.com:
 INVARIANTS requires INVARIANT_SUPPORT [sic] in the kernel config (see 
 comments in GENERIC).

 Ah, right, that would explain it. Thanks!

INVARIANT_SUPPORT is made mandatory in order to allow non-INVARIANT
kernel to be able to handle INVARIANT compiled modules.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


[PATCH] Lockmgr deadlock on STABLE_8

2010-01-13 Thread Attilio Rao
As people following HEAD may have seen, around 1 month ago a fix to
lockmgr(9) has been committed that should prevent a deadlock for that
primitive (the fixup is composed by r200447,201703,201709-201710).
As long as the approach choosen in HEAD is optimal, unluckilly it does
introduce an ABI breakage.
In order to allow a MFC, a similar approach, being a bit sub-optimal,
but not breaking ABI, has been prepared for STABLE_8:
http://www.freebsd.org/~attilio/lockmgr_fix8.diff

I'm seeking for testers here.
Any report would be very much appreciated.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Possible scheduler (SCHED_ULE) bug?

2009-11-08 Thread Attilio Rao
2009/10/23 Jaime Bozza jbo...@mindsites.com:
 I believe I found a problem with the ULE scheduler - At least the fact that 
 there is a problem, but I'm not sure where to go from here.   The system 
 locks all processes, but doesn't panic, so I have no output to give.

 I was able to duplicate this on three different machines and solved it by 
 switching to the scheduler to 4BSD.

 Here's the environment:

 FreeBSD 7.2 i386, installed from bootonly ISO, Custom install, minimal, no 
 other changes other than setting timezone, changing root password, and 
 turning on sshd (allowing root and password connection).

Did you recompile your kernel? Can you show me the revision of
src/sys/kern/sched_ule.c you used?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: resource leak in fifo_vnops.c: 6.x/7.x/8.x

2009-11-06 Thread Attilio Rao
2009/11/6 Dorr H. Clark dcl...@engr.scu.edu:


 We believe we have identified a significant resource leak
 present in 6.x, 7.x, and 8.x.  We believe this is a regression
 versus FreeBSD 4.x which appears to do the Right Thing (tm).

 We have a test program (see below) which will run the system
 out of sockets by repeated exercise of the failing code
 path in the kernel.

 Our proposed fix is applied to the file usr/src/sys/fs/fifofs/fifo_vnops.c


 @@ -237,6 +237,8 @@
if (ap-a_mode  FWRITE) {
if ((ap-a_mode  O_NONBLOCK)  fip-fi_readers == 0) {
mtx_unlock(fifo_mtx);
 +   /* Exclusive VOP lock is held - safe to clean */
 +   fifo_cleanup(vp);
return (ENXIO);
}
fip-fi_writers++;

I think it should also check that fip-if_writers == 0 (and possibly
the checks within fifo_cleanup() should just be assertions, but that's
orthogonal someway) and the comment is not needed.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: interrupt threads CPU usage in FreeBSD 8.0

2009-11-03 Thread Attilio Rao
2009/10/21 Igor Sysoev i...@rambler-co.ru:
 Hi,

 for some reason in 8.0 top always shows 0% CPU usage for intr kernel
 process and active interrupt thread, irq19 bge0 in my case.

 8-0 RC1 top -PS:

 CPU 0: 27.8% user,  0.0% nice,  7.1% system,  0.0% interrupt, 65.0% idle
 CPU 1:  3.0% user,  0.0% nice,  2.3% system,  7.1% interrupt, 87.6% idle

  PID USERNAME THR PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
   11 root   2 171 ki31 0K32K RUN 0 140.7H 152.54% idle
 61371 nobody 1  69  -10   384M   289M kqread  0 105:56 17.77% nginx
 61372 nobody 1  67  -10   384M   293M CPU00 106:15 16.99% nginx
   12 root  15 -60- 0K   240K WAIT0  54:50  0.00% intr

 8.0 RC1 top -PSH:

  PID USERNAMEPRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
   11 root171 ki31 0K32K RUN 1  71.5H 81.05% {idle: cpu1}
   11 root171 ki31 0K32K CPU00  69.3H 69.19% {idle: cpu0}
 61372 nobody   68  -10   384M   294M kqread  0 107:06 18.99% nginx
 61371 nobody   68  -10   384M   291M kqread  0 106:45 16.99% nginx
   12 root-68- 0K   240K WAIT1  50:48  0.00% {irq19: bge0}
   17 root 44- 0K16K syncer  1   5:23  0.00% syncer
   12 root-32- 0K   240K WAIT1   3:06  0.00% {swi4: clock}

 7.2-STABLE top -PS:

 CPU 0:  9.0% user,  0.0% nice,  7.9% system,  9.0% interrupt, 74.1% idle
 CPU 1: 23.3% user,  0.0% nice,  8.3% system,  0.0% interrupt, 68.4% idle

  PID USERNAME  THR PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
   12 root1 171 ki31 0K16K RUN 0 275.0H 83.59% idle: cpu0
   11 root1 171 ki31 0K16K RUN 1 264.2H 76.27% idle: cpu1
 16109 nobody  1  68  -10   376M   307M CPU01  28:05 21.97% nginx
 16110 nobody  1   4  -10   376M   316M RUN 0  28:05 20.17% nginx
   26 root1 -68- 0K16K WAIT0 902:39  6.69% irq19: bge0

How old is your 7.2-STABLE?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-28 Thread Attilio Rao
2009/9/28 C. C. Tang hiyo...@gmail.com:
 C. C. Tang wrote:

 Attilio Rao wrote:

 2009/9/22 C. C. Tang hiyo...@gmail.com:

 I have patched the sched_ule.c and did a make buildkernel  make
 installkernel (is buildworld and installworld necessary?), rebooted
 and
 the
 machine is running now.
 I will post here again if there is any update.

 My server is up for 3.5 days now with HyperThreading  powerd enabled.
 No panic occured yet.

 Usually how long did it take to panic?

 Attilio


 It is rather random, but will usually panic within one week.
 Anyway my server will keep running and I will report if it has any
 problem.

 Thanks,
 C.C.

 My server is up for 9.5 days now. Seems working fine.

The patch has been committed to STABLE_7 as well.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.0-RC1 panic attaching ppc

2009-09-24 Thread Attilio Rao
2009/9/24 Daniel O'Connor docon...@gsoft.com.au:
 On Wed, 23 Sep 2009, Attilio Rao wrote:
 2009/9/23 Daniel O'Connor docon...@gsoft.com.au:
  If I enable the parallel port on this Gigabyte MA7785GM-US2H I get
  a trap 12 when booting up.
 
  I forgot to take a picture of it at the time but I should be able
  to reproduce it tomorrow.
 
  Has anyone seen anything before? (a quick google showed nothing). I
  did not see it on 7.2(ish) on the same hardware.

 Are you able to enable KDB in your kernel config and return a
 backtrace here?

 Yes, here it is..

 pmap_extract() at pmap_extract+0x13a
 isa_dmarangecheck() at isa_dmarangecheck+0x7a
 isa_dma_init() at isa_dma_init+0xda
 ppc_isa_attach() at ppc_sa_attach+0x40
 device_attach() at device_attach+0x69
 bus_generic_attach() at bus_generic_attach+0x1a
 acpi_attach() at acpi_attach+0x9f8

 (there's more but I imagine the above is probably sufficient).

 I took pictures, they are here
 http://www.gsoft.com.au/~doconnor/SNC00111.jpg
 http://www.gsoft.com.au/~doconnor/SNC00112.jpg

 If I put the parallel port in EPP mode then it works, I presume that's
 because it doesn't require a DMA channel whereas ECP doesn't. I haven't
 enumerated the possibilities though :)

Can you try to get a kernel dump that we can analyze?
You would just need to recompile the kernel with options KDB, GDB and
debugging symbols.
Then we can do more on that.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.0-RC1 panic attaching ppc

2009-09-23 Thread Attilio Rao
2009/9/23 Daniel O'Connor docon...@gsoft.com.au:
 If I enable the parallel port on this Gigabyte MA7785GM-US2H I get a
 trap 12 when booting up.

 I forgot to take a picture of it at the time but I should be able to
 reproduce it tomorrow.

 Has anyone seen anything before? (a quick google showed nothing). I did
 not see it on 7.2(ish) on the same hardware.

Are you able to enable KDB in your kernel config and return a backtrace here?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-21 Thread Attilio Rao
2009/9/22 C. C. Tang hiyo...@gmail.com:


 I have patched the sched_ule.c and did a make buildkernel  make
 installkernel (is buildworld and installworld necessary?), rebooted and
 the
 machine is running now.
 I will post here again if there is any update.

 My server is up for 3.5 days now with HyperThreading  powerd enabled.
 No panic occured yet.

Usually how long did it take to panic?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-19 Thread Attilio Rao
2009/9/19 Dan Naumov dan.nau...@gmail.com:
 On Fri, Sep 18, 2009 at 2:25 PM, C. C. Tang hiyo...@gmail.com wrote:
 Attilio Rao wrote:

 2009/9/17 C. C. Tang hiyo...@gmail.com:

 Dan, is that machine equipped with Hyperthreading?

 Attilio

 Yes. It's an Intel Atom 330, which is a dualcore CPU with HT (4 cores
 visible in top as a result)

 Yes, mine is also Atom 330.

 I cannot test the patch because my machine is also in production now.
 But I
 have tested it with hyperthreading.
 powerd with HyperThreading - spin lock hold too long
 powerd without HyperThreading - no problem
 no powerd with/without HyperThreading - no problem

 But these are with the last patch I posted in?
 (specifically, for 7.2:
 http://www.freebsd.org/~attilio/sched_ule.diff
 )

 So with the patch in, powerd and hyperthreading on you still get a
 deadlock?

 Attilio


 I have patched the sched_ule.c and did a make buildkernel  make
 installkernel (is buildworld and installworld necessary?), rebooted and the
 machine is running now.
 I will post here again if there is any update.

 Considering we are at RC1 right now, is there any chance this patch
 makes it into 8.0 release if the patch fixes the issue and doesn't
 cause any regressions? Unfortunately I can't test it myself right now,
 so I have to rely on other people experiencing the same issue to see
 if the patch fixes it.

I alredy committed it to STABLE_8 and then it will make it for sure.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-17 Thread Attilio Rao
2009/9/17 C. C. Tang hiyo...@gmail.com:
 Dan, is that machine equipped with Hyperthreading?

 Attilio

 Yes. It's an Intel Atom 330, which is a dualcore CPU with HT (4 cores
 visible in top as a result)

 Yes, mine is also Atom 330.

 I cannot test the patch because my machine is also in production now. But I
 have tested it with hyperthreading.
 powerd with HyperThreading - spin lock hold too long
 powerd without HyperThreading - no problem
 no powerd with/without HyperThreading - no problem

But these are with the last patch I posted in?
(specifically, for 7.2:
http://www.freebsd.org/~attilio/sched_ule.diff
)

So with the patch in, powerd and hyperthreading on you still get a deadlock?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-14 Thread Attilio Rao
2009/7/23 C. C. Tang hiyo...@gmail.com:
 Attilio Rao wrote:

 2009/7/22 C. C. Tang hiyo...@gmail.com:

 Could that one (on i386) be related?
 http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584

 I have no idea about it but I can tell the difference...
 My machine panic randomly rather than on shutdown and I remembered that
 it
 failed to write core dump. It also failed to reboot automatically..

 Is your problem on -CURRENT and amd64?
 At some point there has been a problem with PAT support (and
 tlb_shootdowns() could lead to a livelock hanging forever, leading to
 such a bug) but I expect it is fixed now.
 Can you try with a fresh new -CURRENT if any?

 My problem is on i386 version of 7.2-RELEASE-p2 on Intel Atom 330 CPU.
 And my system just panic randomly with spin lock held too long.
 It didn't panic at reboot or shutdown so I think it the problem is somewhat
 different from that mentioned by Barbara's PR?

 Anyway I disabled powerd and it seems become stable now.

 And I am sorry that my system has been put into service so it would be hard
 for me to switch to -CURRENT...  :(

Can you re-enable powerd and try the attached patch?:
http://www.freebsd.org/~attilio/sched_ule.diff

The patch is against STABLE_7, but I think HEAD has the same bug.
Please try it and report to me.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-09-12 Thread Attilio Rao
2009/7/7 Dan Naumov dan.nau...@gmail.com:
 I just got a panic following by a reboot a few seconds after running
 portsnap update, /var/log/messages shows the following:

 Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
 Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
 1) held by 0xff00017d8370 (tid 100054) too long
 Jul  7 03:49:38 atom kernel: panic: spin lock held too long

 /var/crash looks empty. This is a system running official 7.2-p1
 binaries since I am using freebsd-update to keep up with the patches
 (just updated to -p2 after this panic) running with very low load,
 mostly serving files to my home network over Samba and running a few
 irssi instances in a screen. What do I need to do to catch more
 information if/when this happens again?

Dan, is that machine equipped with Hyperthreading?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: spinlock held too long on reboot

2009-08-04 Thread Attilio Rao
2009/7/29 Attilio Rao atti...@freebsd.org:
 2009/5/23 Stefan Bethke s...@lassitu.de:
 I wrote:

 Syncing disks, vnodes remaining...0 done
 All buffers synced.
 GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
 Uptime: 6m32s
 GEOM_MIRROR: Device diesel_root destroyed.
 Rebooting...
 cpu_reset: Stopping other CPUs
 spin lock 0x8078c900 (sched lock 1) held by 0xff00014d4ab0
 (tid 12) too long
 panic: spin lock held too long
 cpuid = 0
 KDB: enter: panic
 [thread pid 77 tid 100090 ]
 Stopped at  kdb_enter+0x3d: movq$0,0x48bbd0(%rip)
 db bt
 Tracing pid 77 tid 100090 td 0xff000457bab0
 kdb_enter() at kdb_enter+0x3d
 panic() at panic+0x17b
 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
 _mtx_lock_spin() at _mtx_lock_spin+0x9e
 _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
 sched_balance_group() at sched_balance_group+0xc5
 sched_balance_group() at sched_balance_group+0x1f8
 sched_balance() at sched_balance+0xa2
 sched_clock() at sched_clock+0xf6
 statclock() at statclock+0xbd
 lapic_handle_timer() at lapic_handle_timer+0x197
 Xtimerint() at Xtimerint+0x8c
 --- interrupt, rip = 0x80541cc4, rsp = 0xff80771dba90, rbp =
 0xff80771dbab0 ---
 DELAY() at DELAY+0x64
 cpu_reset() at cpu_reset+0xdd
 boot() at boot+0x2e6
 reboot() at reboot+0x42
 syscall() at syscall+0x1a5
 Xfast_syscall() at Xfast_syscall+0xd0
 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
 0x7fffeca8, rbp = 0 ---


 I've only seen this once.  If I should encounter it again, is there
 something you'd like me to look at?

 [ Sorry, trying to add anyone who alredy reported such a problem even
 if I know many of you experienced it on -STABLE]

If you are experiencing this problem, you would like to test this port
from rink@ on 7.2 of the new version of the patch:
http://people.freebsd.org/~rink/tmp/ipi_7stable.diff

while the -CURRENT version that probabilly is going to be committed
soon is here:
http://www.freebsd.org/~attilio/stop_nmi2.diff

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kern/134584: [panic] spin lock held too long

2009-07-27 Thread Attilio Rao
2009/7/26 barbara barbara.xxx1...@libero.it:
 It happened again, on shutdown.
 As the previous time, it happened after a high (for a desktop) uptime and, if 
 it could matter, after running net-p2p/transmission-gtk2 for several hours.
 I don't know if it's related, but often quitting transmission, doesn't 
 terminate the process. Sometimes it end after several minutes the gui exited, 
 sometimes it's still running after hours.
 I've noticed it as the destination folder is on a manually mounted device and 
 I can't umount it as fstat reports the device used by a transmission process.
 So I often have to kill it.
 This happened both the time I had this kind of panic.

What hw is that? How many CPUs does it have?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kern/134584: [panic] spin lock held too long

2009-07-26 Thread Attilio Rao
2009/7/26 barbara barbara.xxx1...@libero.it:
 It happened again, on shutdown.
 As the previous time, it happened after a high (for a desktop) uptime and, if 
 it could matter, after running net-p2p/transmission-gtk2 for several hours.
 I don't know if it's related, but often quitting transmission, doesn't 
 terminate the process. Sometimes it end after several minutes the gui exited, 
 sometimes it's still running after hours.
 I've noticed it as the destination folder is on a manually mounted device and 
 I can't umount it as fstat reports the device used by a transmission process.
 So I often have to kill it.
 This happened both the time I had this kind of panic.

Can you try to reproduce it with WITNESS and *without*
WITNESS_SKIPSPIN? I would need to look at show alllocks and
possibily ps because it seems that the lock owner is preempted but
it should not happen while holding a spinlock (unless the acquired
spinlock is the one in the preempting path, in this case thought it
should drop inside sched_switch() and we can try to understand why
that doesn't happen).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-22 Thread Attilio Rao
2009/7/22 C. C. Tang hiyo...@gmail.com:
 Could that one (on i386) be related?
 http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584


 I have no idea about it but I can tell the difference...
 My machine panic randomly rather than on shutdown and I remembered that it
 failed to write core dump. It also failed to reboot automatically..

Is your problem on -CURRENT and amd64?
At some point there has been a problem with PAT support (and
tlb_shootdowns() could lead to a livelock hanging forever, leading to
such a bug) but I expect it is fixed now.
Can you try with a fresh new -CURRENT if any?

Thanks,
Attilio



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-22 Thread Attilio Rao
2009/7/23 NAKAJI Hiroyuki nak...@jp.freebsd.org:
 In 4a667469.1080...@gmail.com
   C. C. Tang hiyo...@gmail.com wrote:
  Could that one (on i386) be related?
  http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134584
 

 I have no idea about it but I can tell the difference...
 My machine panic randomly rather than on shutdown and I remembered
 that it failed to write core dump. It also failed to reboot
 automatically..

 I also have trouble like yours.
 http://lists.freebsd.org/pipermail/freebsd-stable/2009-June/050526.html

 I've heard from Attilio Rao that he had found the problem and is working
 on it.

Your problem should be linked to a well known deadlock in the VM. kib@
and jeff@ were alredy working on a patch for that so I just passed
them the ball.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smbfs panic when lost connection or unmount --force

2009-07-10 Thread Attilio Rao
2009/7/10 Oliver Pinter oliver.p...@gmail.com:
 Hello!

 Here is the bt:
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01845.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01846.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01847.JPG

Could you please add in this informations registers state and locked vnodes?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smbfs panic when lost connection or unmount --force

2009-07-10 Thread Attilio Rao
2009/7/11 Oliver Pinter oliver.p...@gmail.com:
 regs and vnodes:

 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01854.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01855.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01856.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01857.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01858.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01859.JPG
 http://centaur.sch.bme.hu/~oliverp/freebsd/smbfs_panic/DSC01860.JPG

Sorry, maybe I wasn't clear, you should spell them 'lockedvnods'.

Thanks,
Attilio



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smbfs panic when lost connection or unmount --force

2009-07-09 Thread Attilio Rao
2009/7/10 Oliver Pinter oliver.p...@gmail.com:
 Hi all!

 It is a kernel panic, when force unmount the smbfs volume or lost the
 connection with the samba server.

 --
 Thes OS is:


 kern.ostype: FreeBSD
 kern.osrelease: 7.2-STABLE
 kern.osrevision: 199506
 kern.version: FreeBSD 7.2-STABLE #4: Sat Jun 27 21:44:32 CEST 2009
r...@oliverp:/usr/obj/usr/src/sys/stable
 kern.osreldate: 702103

 --
 make.conf:


 CPUTYPE?=core2
 CFLAGS= -O2 -fno-strict-aliasing -pipe
 MODULES_OVERRIDE=smbfs libiconv libmchain zfs opensolaris drm cd9660
 cd9660_iconv

 --
 panic message:

 Jul 10 01:58:39 oliverp syslogd: kernel boot file is /boot/kernel/kernel
 Jul 10 01:58:39 oliverp kernel: kernel trap 12 with interrupts disabled
 Jul 10 01:58:39 oliverp kernel:
 Jul 10 01:58:39 oliverp kernel:
 Jul 10 01:58:39 oliverp kernel: Fatal trap 12: page fault while in kernel mode
 Jul 10 01:58:39 oliverp kernel: cpuid = 2; apic id = 02
 Jul 10 01:58:39 oliverp kernel: fault virtual address   = 0x30
 Jul 10 01:58:39 oliverp kernel: fault code  = supervisor read 
 data,
 page not present
 Jul 10 01:58:39 oliverp kernel: instruction pointer = 
 0x8:0x80327fd0
 Jul 10 01:58:39 oliverp kernel: stack pointer   = 
 0x10:0xff8078360940
 Jul 10 01:58:39 oliverp kernel: frame pointer   = 
 0x10:0xff0004c31390
 Jul 10 01:58:39 oliverp kernel: code segment= base 0x0, limit
 0xf, type 0x1b
 Jul 10 01:58:39 oliverp kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
 Jul 10 01:58:39 oliverp kernel: processor eflags= resume, IOPL = 0
 Jul 10 01:58:39 oliverp kernel: current process = 60406 (smbiod0)
 Jul 10 01:58:39 oliverp kernel: trap number = 12
 Jul 10 01:58:39 oliverp kernel: panic: page fault
 Jul 10 01:58:39 oliverp kernel: cpuid = 2
 Jul 10 01:58:39 oliverp kernel: Uptime: 6h51m16s
 Jul 10 01:58:39 oliverp kernel: Physical memory: 4087 MB
 Jul 10 01:58:39 oliverp kernel: Dumping 2448 MB:Copyright (c)
 1992-2009 The FreeBSD Project.

Can you at least produce a backtrace for that?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-08 Thread Attilio Rao
2009/7/8 Dan Naumov dan.nau...@gmail.com:
 On Wed, Jul 8, 2009 at 3:57 AM, Dan Naumovdan.nau...@gmail.com wrote:
 On Tue, Jul 7, 2009 at 4:27 AM, Attilio Raoatti...@freebsd.org wrote:
 2009/7/7 Dan Naumov dan.nau...@gmail.com:
 On Tue, Jul 7, 2009 at 4:18 AM, Attilio Raoatti...@freebsd.org wrote:
 2009/7/7 Dan Naumov dan.nau...@gmail.com:
 I just got a panic following by a reboot a few seconds after running
 portsnap update, /var/log/messages shows the following:

 Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
 Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
 1) held by 0xff00017d8370 (tid 100054) too long
 Jul  7 03:49:38 atom kernel: panic: spin lock held too long

 That's a known bug, affecting -CURRENT as well.
 The cpustop IPI is handled though an NMI, which means it could
 interrupt a CPU in any moment, even while holding a spinlock,
 violating one well known FreeBSD rule.
 That means that the cpu can stop itself while the thread was holding
 the sched lock spinlock and not releasing it (there is no way, modulo
 highly hackish, to fix that).
 In the while hardclock() wants to schedule something else to run and
 got stuck on the thread lock.

 Ideal fix would involve not using a NMI for serving the cpustop while
 having a cheap way (not making the common path too hard) to tell
 hardclock() to avoid scheduling while cpustop is in flight.

 Thanks,
 Attilio

 Any idea if a fix is being worked on and how unlucky must one be to
 run into this issue, should I expect it to happen again? Is it
 basically completely random?

 I'd like to work on that issue before BETA3 (and backport to
 STABLE_7), I'm just time-constrained right now.
 it is completely random.

 Thanks,
 Attilio

 Ok, this is getting pretty bad, 23 hours later, I get the same kind of
 panic, the only difference is that instead of portsnap update, this
 was triggered by portsnap cron which I have running between 3 and 4
 am every day:

 Jul  8 03:03:49 atom kernel: ssppiinn  lloocckk
 00xx8800bb33eeddc400  ((sscchheedd  lloocck k1 )0 )h
 ehledl db yb y 0x0xfff0f1081735339760e 0( t(itdi d
 1016070)5 )t otoo ol olnogng
 Jul  8 03:03:49 atom kernel: p
 Jul  8 03:03:49 atom kernel: anic: spin lock held too long
 Jul  8 03:03:49 atom kernel: cpuid = 0
 Jul  8 03:03:49 atom kernel: Uptime: 23h2m38s

 I have now tried repeating the problem by running stress --cpu 8 --io
 8 --vm 4 --vm-bytes 1024M --timeout 600s --verbose which pushed
 system load into the 15.50 ballpark and simultaneously running
 portsnap fetch and portsnap update but I couldn't manually trigger
 the panic, it seems that this problem is indeed random (although it
 baffles me why is it specifically portsnap triggering it). I have now
 disabled powerd to check whether that makes any difference to system
 stability.

But is that happening at reboot time?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-06 Thread Attilio Rao
2009/7/7 Dan Naumov dan.nau...@gmail.com:
 I just got a panic following by a reboot a few seconds after running
 portsnap update, /var/log/messages shows the following:

 Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
 Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
 1) held by 0xff00017d8370 (tid 100054) too long
 Jul  7 03:49:38 atom kernel: panic: spin lock held too long

That's a known bug, affecting -CURRENT as well.
The cpustop IPI is handled though an NMI, which means it could
interrupt a CPU in any moment, even while holding a spinlock,
violating one well known FreeBSD rule.
That means that the cpu can stop itself while the thread was holding
the sched lock spinlock and not releasing it (there is no way, modulo
highly hackish, to fix that).
In the while hardclock() wants to schedule something else to run and
got stuck on the thread lock.

Ideal fix would involve not using a NMI for serving the cpustop while
having a cheap way (not making the common path too hard) to tell
hardclock() to avoid scheduling while cpustop is in flight.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.2-release/amd64: panic, spin lock held too long

2009-07-06 Thread Attilio Rao
2009/7/7 Dan Naumov dan.nau...@gmail.com:
 On Tue, Jul 7, 2009 at 4:18 AM, Attilio Raoatti...@freebsd.org wrote:
 2009/7/7 Dan Naumov dan.nau...@gmail.com:
 I just got a panic following by a reboot a few seconds after running
 portsnap update, /var/log/messages shows the following:

 Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
 Jul  7 03:49:38 atom kernel: spin lock 0x80b3edc0 (sched lock
 1) held by 0xff00017d8370 (tid 100054) too long
 Jul  7 03:49:38 atom kernel: panic: spin lock held too long

 That's a known bug, affecting -CURRENT as well.
 The cpustop IPI is handled though an NMI, which means it could
 interrupt a CPU in any moment, even while holding a spinlock,
 violating one well known FreeBSD rule.
 That means that the cpu can stop itself while the thread was holding
 the sched lock spinlock and not releasing it (there is no way, modulo
 highly hackish, to fix that).
 In the while hardclock() wants to schedule something else to run and
 got stuck on the thread lock.

 Ideal fix would involve not using a NMI for serving the cpustop while
 having a cheap way (not making the common path too hard) to tell
 hardclock() to avoid scheduling while cpustop is in flight.

 Thanks,
 Attilio

 Any idea if a fix is being worked on and how unlucky must one be to
 run into this issue, should I expect it to happen again? Is it
 basically completely random?

I'd like to work on that issue before BETA3 (and backport to
STABLE_7), I'm just time-constrained right now.
it is completely random.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

Can you also show the value of ps?
A precise map of what processes are doing would give an help.
Also would be useful to printout traces for other threads and not only
the stucked one.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

 Can you also show the value of ps?
 A precise map of what processes are doing would give an help.
 Also would be useful to printout traces for other threads and not only
 the stucked one.


 From another run:

I'm unable to see who would be locking the buffer object in question.
Do you have INVARIANT_SUPPORT/INVARIANTS on?
What revision of /usr/src/sys/kern/vfs_bio.c are you running with?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

 Can you also show the value of ps?
 A precise map of what processes are doing would give an help.
 Also would be useful to printout traces for other threads and not only
 the stucked one.


 From another run:

 I'm unable to see who would be locking the buffer object in question.
 Do you have INVARIANT_SUPPORT/INVARIANTS on?

 Yes, I do both.

 What revision of /usr/src/sys/kern/vfs_bio.c are you running with?


 As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.

Please try this patch and report.

Thanks,
Attilio

--- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
+++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
@@ -33,7 +33,7 @@
  */

 #include sys/cdefs.h
-__FBSDID($FreeBSD:
/usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
2008/02/13 20:44:18 attilio Exp $);
+__FBSDID($FreeBSD:
/usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
2008/03/22 09:15:15 jeff Exp $);

 /*
  * vnode op calls for Sun NFS version 2 and 3
@@ -2736,11 +2736,12 @@ nfs_flush(struct vnode *vp, int waitfor,
int i;
struct buf *nbp;
struct nfsmount *nmp = VFSTONFS(vp-v_mount);
-   int s, error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos;
+   int error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos;
int passone = 1;
u_quad_t off, endoff, toff;
struct ucred* wcred = NULL;
struct buf **bvec = NULL;
+   struct bufobj *bo;
 #ifndef NFS_COMMITBVECSIZ
 #define NFS_COMMITBVECSIZ  20
 #endif
@@ -2751,6 +2752,7 @@ nfs_flush(struct vnode *vp, int waitfor,
slpflag = PCATCH;
if (!commit)
passone = 0;
+   bo = vp-v_bufobj;
/*
 * A b_flags == (B_DELWRI | B_NEEDCOMMIT) block has been written to the
 * server, but has not been committed to stable storage on the server
@@ -2763,15 +2765,14 @@ again:
endoff = 0;
bvecpos = 0;
if (NFS_ISV3(vp)  commit) {
-   s = splbio();
if (bvec != NULL  bvec != bvec_on_stack)
free(bvec, M_TEMP);
/*
 * Count up how many buffers waiting for a commit.
 */
bveccount = 0;
-   VI_LOCK(vp);
-   TAILQ_FOREACH_SAFE(bp, vp-v_bufobj.bo_dirty.bv_hd, b_bobufs, 
nbp) {
+   BO_LOCK(bo);
+   TAILQ_FOREACH_SAFE(bp, bo-bo_dirty.bv_hd, b_bobufs, nbp) {
if (!BUF_ISLOCKED(bp) 
(bp-b_flags  (B_DELWRI | B_NEEDCOMMIT))
== (B_DELWRI | B_NEEDCOMMIT))
@@ -2788,11 +2789,11 @@ again:
 * Release the vnode interlock to avoid a lock
 * order reversal.
 */
-   VI_UNLOCK(vp);
+   BO_UNLOCK(bo);
bvec = (struct buf **)
malloc(bveccount * sizeof(struct buf *),
   M_TEMP, M_NOWAIT);
-   VI_LOCK(vp);
+   BO_LOCK(bo);
if (bvec == NULL) {
bvec = bvec_on_stack;
bvecsize = NFS_COMMITBVECSIZ;
@@ -2802,7 +2803,7 @@ again:
bvec = bvec_on_stack;
bvecsize = NFS_COMMITBVECSIZ;
}
-   TAILQ_FOREACH_SAFE(bp, vp-v_bufobj.bo_dirty.bv_hd, b_bobufs, 
nbp) {
+   TAILQ_FOREACH_SAFE(bp, bo-bo_dirty.bv_hd, b_bobufs, nbp) {
if (bvecpos = bvecsize)
break;
if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL)) {
@@ -2815,7 +2816,7 @@ again:
nbp = TAILQ_NEXT(bp, b_bobufs);
continue;
}
-   VI_UNLOCK(vp);
+   BO_UNLOCK(bo);
bremfree(bp);
/*
 * Work out if all buffers are using the same cred
@@ -2834,7 +2835,7 @@ again:
wcred = NOCRED;
vfs_busy_pages(bp, 1);

-   VI_LOCK(vp);
+   BO_LOCK(bo);
/*
 * bp is protected by being

Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

 Can you also show the value of ps?
 A precise map of what processes are doing would give an help.
 Also would be useful to printout traces for other threads and not only
 the stucked one.


 From another run:

 I'm unable to see who would be locking the buffer object in question.
 Do you have INVARIANT_SUPPORT/INVARIANTS on?

 Yes, I do both.

 What revision of /usr/src/sys/kern/vfs_bio.c are you running with?


 As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.

 Please try this patch and report.

 Thanks,
 Attilio

 --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
 +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
 @@ -33,7 +33,7 @@
  */

  #include sys/cdefs.h
 -__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
 2008/02/13 20:44:18 attilio Exp $);
 +__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
 2008/03/22 09:15:15 jeff Exp $);


 Do you refer to the whole svn r177493, or is its nfs part will be enough?
 This only vfs_vnops.c diff seems not applicable without underneath
 kernel part changes.

 I'll try. Thanks.

The NFS part should be enough, though I don't understand why it
doesn't trigger a panic on STABLE_6 as long as, at least in my
revision, there is an assert for the buffer object lock to be held in
bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

 Can you also show the value of ps?
 A precise map of what processes are doing would give an help.
 Also would be useful to printout traces for other threads and not only
 the stucked one.


 From another run:

 I'm unable to see who would be locking the buffer object in question.
 Do you have INVARIANT_SUPPORT/INVARIANTS on?

 Yes, I do both.

 What revision of /usr/src/sys/kern/vfs_bio.c are you running with?


 As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.

 Please try this patch and report.

 Thanks,
 Attilio

 --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
 +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
 @@ -33,7 +33,7 @@
  */

  #include sys/cdefs.h
 -__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
 2008/02/13 20:44:18 attilio Exp $);
 +__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
 2008/03/22 09:15:15 jeff Exp $);


 Do you refer to the whole svn r177493, or is its nfs part will be enough?
 This only vfs_vnops.c diff seems not applicable without underneath
 kernel part changes.

 I'll try. Thanks.

 The NFS part should be enough, though I don't understand why it
 doesn't trigger a panic on STABLE_6 as long as, at least in my
 revision, there is an assert for the buffer object lock to be held in
 bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?


 As of 6.4-R.
 $FreeBSD: src/sys/kern/vfs_bio.c,v 1.491.2.12.4.1 2008/10/02 02:57:24
 kensmith Exp $

That's it, the revision doesn't have the assert.
If it does fix the problem for you, I will let you test a more
comprehensive patch as there is also at least another fix I want to
bring in along with this one (and the relative asserts).

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [nfs] process locks in bo_wwait on 6.4

2009-06-29 Thread Attilio Rao
2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/29 Attilio Rao atti...@freebsd.org:
 2009/6/29 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 2009/6/26 pluknet pluk...@gmail.com:
 Hello.

 While building a module on nfs mounted /usr/src
 I got an unkillable process waiting forever in bo_wwait.

 Small note: iface on NFS server has mtu changed from 1500 to 1450.
 Can this be a source of the problem?

 This is 100% reproducible. Lock in the same place. Any hints?

 Can you also show the value of ps?
 A precise map of what processes are doing would give an help.
 Also would be useful to printout traces for other threads and not only
 the stucked one.


 From another run:

 I'm unable to see who would be locking the buffer object in question.
 Do you have INVARIANT_SUPPORT/INVARIANTS on?

 Yes, I do both.

 What revision of /usr/src/sys/kern/vfs_bio.c are you running with?


 As of 6.4-R: CVS rev 1.491.2.12.4.1 / SVN rev 183531.

 Please try this patch and report.

 Thanks,
 Attilio

 --- src/sys/nfsclient/nfs_vnops.c   2008/02/13 20:44:18 1.281
 +++ src/sys/nfsclient/nfs_vnops.c   2008/03/22 09:15:15 1.282
 @@ -33,7 +33,7 @@
  */

  #include sys/cdefs.h
 -__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.281
 2008/02/13 20:44:18 attilio Exp $);
 +__FBSDID($FreeBSD:
 /usr/local/www/cvsroot/FreeBSD/src/sys/nfsclient/nfs_vnops.c,v 1.282
 2008/03/22 09:15:15 jeff Exp $);


 Do you refer to the whole svn r177493, or is its nfs part will be enough?
 This only vfs_vnops.c diff seems not applicable without underneath
 kernel part changes.

 I'll try. Thanks.

 The NFS part should be enough, though I don't understand why it
 doesn't trigger a panic on STABLE_6 as long as, at least in my
 revision, there is an assert for the buffer object lock to be held in
 bufobj_wwait(). What's your sys/kern/vfs_bio.c rev?


 As of 6.4-R.
 $FreeBSD: src/sys/kern/vfs_bio.c,v 1.491.2.12.4.1 2008/10/02 02:57:24
 kensmith Exp $

 That's it, the revision doesn't have the assert.
 If it does fix the problem for you, I will let you test a more
 comprehensive patch as there is also at least another fix I want to
 bring in along with this one (and the relative asserts).

Uhm, wait, after better looking at the code I don't think this patch
can fix your problem.
I will let you know with a bit of more time to study the deadlock.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problem still remains with 7.2-STABLE locking up

2009-06-09 Thread Attilio Rao
2009/6/10 NAKAJI Hiroyuki nak...@jp.freebsd.org:
 Thanks Attilio,

 I set up dcons target/host pair. Target is 7.2-STABLE and host is
 6.4-STABLE.

 Dcons session was recorded with script.
 http://www.heimat.gr.jp/localhost/dcons.log

I'm following up privately with the user, news to come hopefully.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Big problem still remains with 7.2-STABLE locking up

2009-06-06 Thread Attilio Rao
2009/6/6 NAKAJI Hiroyuki nak...@jp.freebsd.org:
 Hi,

 I noticed, some months ago, frequent lockups on my RELENG_6 server with
 ECS PM800-M2, Celeron 2.6GHz (UP), 2GB ram, ATA HDDs and 3Com NIC(xl0),
 and then I gave up this old server.

 Last month, I replaced this 'unstable' server to the new one with
 7.2-RELEASE which worked very well until I setup it as 'a server'. The
 problem began just after it started 'the services'.

 My story is very similar to Pete's.
 http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047487.html

 I followed some instructions in the list thread. But unfortunately, the
 big problem still remains. 7.2-STABLE server locks up frequently.

 Help! :-(

 The server is NEC Express5800 S70/SD.

 o CPU: Intel(R) Celeron(R) CPU 440 @ 2.00GHz (2280.25-MHz K8-class CPU)
 o 6GB RAM
 o ACPI APIC Table: NEC DT20
 o 80GB and 250GB SATA HDDs
 o http://www.heimat.gr.jp/~nakaji/localhost/dmesg.boot

 The kernel configuration is:

 include GENERIC
 ident   HEIMAT
 options MSGBUF_SIZE=81920
 makeoptions     DEBUG=-g
 options KDB
 options DDB
 options BREAK_TO_DEBUGGER
 options QUOTA

Were you unmounting any of the QUOTA'ed filesystems?
I'm aware of a possible deadlock between quota and unmount path which
is very difficult to trigger though.

Anyways, the only one way we have to debug this is getting some help
by the user.
1) Drop the option WITNESS_SPIKSPIN (as we would like to debug
spinlocks too) and LOCK_PROFILING (in order to create higher
contention and kill some barriers)
2) Once you get the deadlock break in the DDB debugger
3) Once you are in DDB informations which could be very useful are:
db show allpcpu
db show alllocks
db show lockedvnods
db ps
db allthreads

Note that this is a lot of printout so you won't be able of collecting
all these informations if not with a serial connection.
4) Dump the content so that we can further look at locks structure
states once we identify something useful (ideally, keeping the machine
up in DDB for that would be very useful, but often not viable)

Let me know.
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

2009-05-21 Thread Attilio Rao
2009/5/21 Riccardo Torrini riccardo.torr...@esaote.com:
 On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:

 Try this.  It reverts the single-CCB part of the previous
 commit while keeping the other fixes.  I missed that the
 CCB might still be in flight when we schedule another rescan.

 Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it
 differ only for line position but adiacent lines are the same).
 Also redone a diff -u4 to verify, recompiled, installed, and...

 YOO-HOO.  Now it rebuild _without_ crashing.

 May 20 17:39:08 horse kernel: \
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining

 Let me test against a 7.2-STABLE (and even to some -CURRENT)...

 [some times ahead]

 Bad news: I removed the second disk during rebuilding and it
 still crash.  I take a screen shapshot with camera because of
 too many messages for write down by hand  :)

 Image, src tarball and info here (about 2.2MB):
 ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/

Please try the patch here:
http://www.freebsd.org/~attilio/notify.diff

I think it is perfectly fine this approach because the devctl_notify()
also will silently fail if no memory is available.
Note that this is a CAM bug more that the driver arises.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


[HEADS UP] lockmgr needing of sys/lockmgr.h on thirdy part codes

2008-05-05 Thread Attilio Rao
Hello,
after MFC'ed the usage of LOCK_FILE and LOCK_LINE for lockmgr(9), now
thirdy part code needs to include sys/lock.h just priorior than
sys/lockmgr.
Even if the patch doesn't break ABI / KPI (so it doesn't need thirdy
part KLD to be recompiled), it worths noting that the new code needs
this extra-care in order to be fully compliant.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Lock Order Reversal on 7.0-STABLE with pf and ipfw / dummynet (traces)

2008-03-27 Thread Attilio Rao
2008/3/25, Max Laier [EMAIL PROTECTED]:
 Hi Alex,

  so it's basically back to square one.  We only have LORs between the pfil
  R/W lock (read instance) and mutexes that don't have any lock order with
  the pfil R/W lock (write instance) at all.  This means the deadlock can't
  be explained by the LORs that are reported (unless there is something I'm
  missing).  Unless somebody who is seeing these kind of deadlocks can
  actually break into a debugger to identify the locks at play, everything
  else is just speculation.

  I will fix the fastroute LOR with the patch you have been testing,
  eventhough it didn't fix your problem.  For the remaining issue, we need
  more IPFW or lock primitives knowledge (extending CC-list).

  Note that the first LOR features a recursive pickup of the pfil R/W lock.
  I remember that Attilio committed a patch to forbid this for CURRENT.
  Could this be the cause of a deadlock?  Would it make sense to MFC
  rm_locks and try if they hold up under this scenario?

I decided to not commit this patch to CURRENT basing on the Robert's
feedback that read recursion in network stack is (will be?)
fundamental.
Likely, it should not explain the deadlock still.
As you point out, the better thing would be using a machine with stock
CVS + DDB + INVARIANTS and check the state of threads and the state of
locks.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: kqueue LOR

2006-12-12 Thread Attilio Rao

2006/12/12, Kostik Belousov [EMAIL PROTECTED]:

On Tue, Dec 12, 2006 at 12:44:54AM -0800, Suleiman Souhlal wrote:
 Kostik Belousov wrote:
 On Sun, Nov 26, 2006 at 09:30:39AM +0100, V??clav Haisman wrote:
 
 Hi,
 the attached lor.txt contains LOR I got this yesterday. It is FreeBSD 6.1
 with relatively recent kernel, from last week or so.
 
 --
 VH
 
 
 +lock order reversal:
 + 1st 0xc537f300 kqueue (kqueue) @ /usr/src/sys/kern/kern_event.c:1547
 + 2nd 0xc45c22dc struct mount mtx (struct mount mtx) @
 /usr/src/sys/ufs/ufs/ufs_vnops.c:138
 +KDB: stack backtrace:
 +kdb_backtrace(c07f9879,c45c22dc,c07fd31c,c07fd31c,c080c7b2,...) at
 kdb_backtrace+0x2f
 +witness_checkorder(c45c22dc,9,c080c7b2,8a,c07fc6bd,...) at
 witness_checkorder+0x5fe
 +_mtx_lock_flags(c45c22dc,0,c080c7b2,8a,e790ba20,...) at
 _mtx_lock_flags+0x32
 +ufs_itimes(c47a0dd0,c47a0e90,e790ba78,c060e1cc,c47a0dd0,...) at
 ufs_itimes+0x6c
 +ufs_getattr(e790ba54,e790baec,c0622af6,c0896f40,e790ba54,...) at
 ufs_getattr+0x20
 +VOP_GETATTR_APV(c0896f40,e790ba54,c08a5760,c47a0dd0,e790ba74,...) at
 VOP_GETATTR_APV+0x3a
 +filt_vfsread(c4cf261c,6,c07f445e,60b,0,...) at filt_vfsread+0x75
 +knote(c4f57114,6,1,1f30c2af,1f30c2af,...) at knote+0x75
 +VOP_WRITE_APV(c0896f40,e790bbec,c47a0dd0,227,e790bcb4,...) at
 VOP_WRITE_APV+0x148
 +vn_write(c45d5120,e790bcb4,c5802a00,0,c4b73a80,...) at vn_write+0x201
 +dofilewrite(c4b73a80,1b,c45d5120,e790bcb4,,...) at
 dofilewrite+0x84
 +kern_writev(c4b73a80,1b,e790bcb4,8220c71,0,...) at kern_writev+0x65
 +write(c4b73a80,e790bd04,c,c07d899c,3,...) at write+0x4f
 +syscall(3b,3b,bfbf003b,0,bfbfeae4,...) at syscall+0x295
 +Xint0x80_syscall() at Xint0x80_syscall+0x1f
 +--- syscall (4, FreeBSD ELF32, write), eip = 0x2831d727, esp =
 0xbfbfea1c, ebp = 0xbfbfea48 ---
 
 
 Thank you for the report. The LOR is caused by my commit into
 sys/ufs/ufs/ufs_vnops.c, rev. 1.280.

 Is the mount lock really required, if all we're doing is a single read of a
 single word (mnt_kern_flags) (v_mount should be read-only for the whole
 lifetime of the vnode, I believe)? After all, reads of a single word are
 atomic on all our supported architectures.
 The only situation I see where there MIGHT be problems are forced unmounts,
 but I think there are bigger issues with those.
 Sorry for noticing this email only now.

The problem is real with snapshotting. Ignoring
MNTK_SUSPEND/MNTK_SUSPENDED flags (in particular, reading stale value of
mnt_kern_flag) while setting IN_MODIFIED caused deadlock at ufs vnode
inactivation time. This was the big trouble with nfsd and snapshots. As
such, I think that precise value of mmnt_kern_flag is critical there,
and mount interlock is needed.


This can be avoided using a memory barrier when setting flags.
Even if memory barriers usage is not encouraged, some critical code
should really use them replacing a mutex semantic (if that worths it).

Attilio

--
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


  1   2   >