Re: make release fails on find

2012-10-31 Thread Andreas Nilsson
On Wed, Oct 31, 2012 at 4:07 AM, Glen Barber g...@freebsd.org wrote:

 On Tue, Oct 30, 2012 at 08:11:15PM -0400, Glen Barber wrote:
   Oops, my bad. Yes exact same behavior;  make -C release cdrom fails
 with
   ...
   find //tank/cvs/9.1/src/release/dist/doc -empty -delete
   find //tank/cvs/9.1/src/release/dist/games -empty -delete
   find: -delete: //tank/cvs/9.1/src/release/dist/games: relative path
   potentially not safe
   *** [distributeworld] Error code 1
   on 9.1-RC3. I can try with 9-stable as well (tomorrow).
  
 
  Ok, thanks.  I do not want to assume anything more at this point.
 
  I am still waiting for my build machine to finish a few queued things.
  Once it frees up, I will roll a release using sudo (just for my own
  sanity), and without sudo, with your src.conf and make.conf.
 
  Anyway, thanks for all of the details you have provided.  It is all
  helpful, and hopefully this will finally be tracked down.
 

 Ugh...  Ok, so this is my fault.

 I do not remember why, specifically, but the change in question was not
 merged to the releng/9.1 branch.

 Please try the following, in the top-level directory of your releng/9.1
 source checkout:

 svn merge -c240077 ^/head/Makefile.inc1 Makefile.inc1

 It worked for me fine.  Unfortunately, it is far too late in the release
 cycle for that change to make it into 9.1-RELEASE.

Great that you found the bug!


 Unfortunately, this does not have anything to do with the recursing in
 the usr/src tarball.  Please let me know if you continue to see that
 happen, as this is the _single_ most reported issue that I have had zero
 luck reproducing...

 With just the merge above now 9.1-RC3 ends up recursing. (  Just tried
them one at a time ).

 Thanks.

 Glen

 PS:  Sorry about being the cause of your release build failure...

 No problem really :)  Thanks for hunting this down now.

/A
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: make release fails on find

2012-10-31 Thread Andreas Nilsson
On Wed, Oct 31, 2012 at 4:35 AM, Glen Barber g...@freebsd.org wrote:

 On Tue, Oct 30, 2012 at 11:19:12PM -0400, Glen Barber wrote:
  So, please also do:
 
  svn merge -c241451 ^/head/release release
 

 You'll want to merge one more revision:

 svn merge -c241596 ^/head/release release

 Same as before - I _think_ this should work. :-)

 Glen



Excelent :) That did the trick, ie no recursion :) Thank you very much for
finding the bugs.

Will this be merge to 9-stable?

On a more whislist topic: I'd really appreciate if  .zfs dirs would be
excluded from  the tarballs.


Best regards
Andreas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


9-Stable panic: resource_list_unreserve: can't find resource

2012-10-31 Thread Tom Lislegaard
Hi

I'm running
FreeBSD stingray 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #3: Mon Oct 29 16:11:35 
CET 2012 tl@stingray:/usr/obj/usr/src/sys/stingray  amd64
on a new Dell laptop and keep getting these panics (typically once or twice per 
day)

(kgdb) set pagination off
(kgdb) bt
#0  doadump (textdump=Variable textdump is not available.
) at pcpu.h:229
#1  0x80425e64 in kern_reboot (howto=260) at 
/usr/src/sys/kern/kern_shutdown.c:448
#2  0x8042634c in panic (fmt=0x1 Address 0x1 out of bounds) at 
/usr/src/sys/kern/kern_shutdown.c:636
#3  0x8045773e in resource_list_unreserve (rl=Variable rl is not 
available.
) at /usr/src/sys/kern/subr_bus.c:3338
#4  0x802c3ee4 in acpi_delete_resource (bus=0xfe00052c1100, 
child=0xfe00052c1500, type=4, rid=3323) at 
/usr/src/sys/dev/acpica/acpi.c:1405
#5  0x802c62bc in acpi_bus_alloc_gas (dev=0xfe00052c1500, 
type=0xfe00052b786c, rid=0xfe00052b7978, gas=Variable gas is not 
available.
) at /usr/src/sys/dev/acpica/acpi.c:1450
#6  0x802d1663 in acpi_PkgGas (dev=0xfe00052c1500, res=Variable 
res is not available.
) at /usr/src/sys/dev/acpica/acpi_package.c:120
#7  0x802cbf6b in acpi_cpu_cx_cst (sc=0xfe00052b7800) at 
/usr/src/sys/dev/acpica/acpi_cpu.c:782
#8  0x802cc3a4 in acpi_cpu_notify (h=Variable h is not available.
) at /usr/src/sys/dev/acpica/acpi_cpu.c:1050
#9  0x802a3fca in AcpiEvNotifyDispatch (Context=0x0) at 
/usr/src/sys/contrib/dev/acpica/events/evmisc.c:283
#10 0x802c26c3 in acpi_task_execute (context=0xfe00051d6800, 
pending=Variable pending is not available.
) at /usr/src/sys/dev/acpica/Osd/OsdSchedule.c:134
#11 0x804683c4 in taskqueue_run_locked (queue=0xfe00052bc100) at 
/usr/src/sys/kern/subr_taskqueue.c:308
#12 0x80469366 in taskqueue_thread_loop (arg=Variable arg is not 
available.
) at /usr/src/sys/kern/subr_taskqueue.c:497
#13 0x803f762f in fork_exit (callout=0x80469320 
taskqueue_thread_loop, arg=0x80a20cc8, frame=0xff80002cdb00) at 
/usr/src/sys/kern/kern_fork.c:992
#14 0x806be6be in fork_trampoline () at 
/usr/src/sys/amd64/amd64/exception.S:602
#15 0x in ?? ()
#16 0x in ?? ()
#17 0x in ?? ()
#18 0x in ?? ()
#19 0x in ?? ()
#20 0x in ?? ()
#21 0x in ?? ()
#22 0x in ?? ()
#23 0x in ?? ()
#24 0x in ?? ()
#25 0x in ?? ()
#26 0x in ?? ()
#27 0x in ?? ()
#28 0x in ?? ()
#29 0x in ?? ()
#30 0x in ?? ()
#31 0x in ?? ()
#32 0x in ?? ()
#33 0x in ?? ()
#34 0x in ?? ()
#35 0x in ?? ()
#36 0x in ?? ()
#37 0x in ?? ()
#38 0x in ?? ()
#39 0x00ff in ?? ()
#40 0x in ?? ()
#41 0xfe00051e5920 in ?? ()
#42 0xfe00051e5920 in ?? ()
#43 0xff80002cd740 in ?? ()
#44 0xff80002cd6e8 in ?? ()
#45 0xfe00051c1490 in ?? ()
#46 0x8044e9b9 in sched_switch (td=0x80469320, 
newtd=0x80a20cc8, flags=Variable flags is not available.
) at /usr/src/sys/kern/sched_ule.c:1913
Previous frame inner to this frame (corrupt stack?)

Hardware details are as follows

Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.1-PRERELEASE #3: Mon Oct 29 16:11:35 CET 2012
tl@stingray:/usr/obj/usr/src/sys/stingray amd64
CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz (2591.64-MHz K8-class CPU)
  Origin = GenuineIntel  Id = 0x306a9  Family = 0x6  Model = 0x3a  Stepping = 
9
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  
Features2=0x7fbae3ffSSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND
  AMD Features=0x28100800SYSCALL,NX,RDTSCP,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 8166604800 (7788 MB)
Event timer LAPIC quality 600
ACPI APIC Table: DELL   CBX3   
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 SMT threads
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
 cpu4 (AP): APIC ID:  4
 cpu5 (AP): APIC ID:  5
 cpu6 (AP): APIC ID:  6
 cpu7 (AP): APIC ID:  7
ioapic0 Version 2.0 irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: DELL CBX3on motherboard
acpi0: Power Button (fixed)
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI 

Re: lock violation in unionfs (9.0-STABLE r230270)

2012-10-31 Thread Harald Schmalzbauer
 schrieb Attilio Rao am 29.10.2012 23:02 (localtime):
 On Mon, Oct 29, 2012 at 7:37 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 27.10.2012 23:07 (localtime):
 On Sat, Oct 27, 2012 at 9:46 PM, Attilio Rao atti...@freebsd.org wrote:
 On Sat, Sep 8, 2012 at 12:48 AM, Attilio Rao atti...@freebsd.org wrote:
 On Thu, Sep 6, 2012 at 4:52 PM, Harald Schmalzbauer
 h.schmalzba...@omnilan.de wrote:
  schrieb Attilio Rao am 09.08.2012 20:26 (localtime):
 On 8/8/12, Harald Schmalzbauer h.schmalzba...@omnilan.de wrote:
  schrieb Pavel Polyakov am 06.03.2012 11:20 (localtime):
 mount -t unionfs -o noatime /usr /mnt

 insmntque: mp-safe fs and non-locked vp: 0xfe01d96704f0 is not
 exclusive locked but should be
 KDB: enter: lock violation
 Pavel,
 can you give a spin to this patch?:
 http://www.freebsd.org/~attilio/unionfs_missing_insmntque_lock.patch

 I think that the unlocking is due at that point as the vnode lock can
 be switch later on.

 Let me know what you think about it and what the test does.
 Thanks!
 This patch fixes the problem with lock violation. Sorry I've tested 
 it so
 late.
 Hello,

 this patch still applies cleanly to RELENG_9_1. Was there another fix
 for the issue or has it just not been PR-sent and thus forgotten?
 Can you and Pavel try the attached patch? Unfortunately I had no time
 to test it, I just made in 5 free mins from a non-FreeBSD workstation,
 Sorry, couldn't test earlier, but now I did:
 With this patch applied the machine hangs without debug kernel and the
 latter gives the following panic:
 System call nmount returning with the following locks held:
 exclusive lockmgr ufs (ufs) r = 0 (0xc5438278) locked @
 src/sys/fs/unionfs/union_vnops.c:1938
 panic: witness_warn
 cpuid = 0
 KDB: stack backtrace:
 db_trace_self_wrapper(c0a04f7f,c0c112c4,d1de3bb4,c097aa8c,fc,...) at
 db_trace_self_wrapper+0x26
 kdb_backtrace(c0a4965f,0,c09c2ede3c1c,0,...) at kdb_backtrace+0x2a
 witness_warn(2,0,c0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) ar syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21
 --- syscall (0, FreeBSD ELF32, nosys), eip = 0x280b883f,esp =
 0xbfbfe46c, ebp = 0xbfbfede8 ---
 KDB: enter: panic
 [ thread pid 86 tid 100054 ]
 Stopped adkdb_enter+0x3a: movl $0,kdb_why
 db bt
 Tracing pid 86 tid 100054 td 0xc541b000
 kdb_enter(c0a00d16,c0a09130,0,0,0,...) at panix+0x190
 witness_warn(2,0,x0a4ac34,c0a0990a,286,...) at witness_warn+0x1e4
 syscall(d1de3d08) at syscall+0x415
 Xint0x80_syscall() at Xint0x80_syscall+0x21

 Hmm, I guess I forgot to install kernel debug symbols...
 Coming back if I have more
 Unfortunately unionfs does very wrong things with the insmntque() locking.
 It basically expects the vnode to return locked in the same way
 requested by the precedent namei() (when that happens) but when you do
 insmntque() you can only have an LK_EXCLUSIVE lock on the vnode.
 Hello,
 the following patch should workout the issues around unionfs_nodeget() a 
 bit:
 http://www.freebsd.org/~attilio/unionfs_nodeget2.patch

 Unfortunately unionfs code is rather messy in the lookup path about
 locking requirements so follow what it needs to be done there is a bit
 difficult.
 I have no way to test this patch, so it is just test-compiled at the
 moment, but I would need that you also test lookup path (so directory
 ls, find(1) on the whole unionfs volume, etc.) to validate it
 someway.
 On a second thought, I think that locking in lookup (and also other
 operations) is so fragile and difficult to follow that it makes all
 vnops real locking landmines.
 I think that the following patch fixes the insmntque insertion and
 follows the old approach well enough to be committed separately:
 http://www.freebsd.org/~attilio/unionfs_nodeget3.patch

 Unfortunately I have no idea about all those locking strategies and
 implementations.
 Applying unionfs_nodeget3.patch results in:
 sys/fs/unionfs/union_subr.c: In function 'unionfs_nodeget':
 sys/fs/unionfs/union_subr.c:332: error: expected statement
 before ')' token
 *** [union_subr.o] Error code 1

 I guess there is a typo in this chunk:
 @@ -317,11 +328,11 @@ unionfs_nodeget(struct mount *mp, struct vnode *up

 vref(vp);
 } else
 *vpp = vp;
 -
 -unionfs_nodeget_out:
 -   if (lkflags  LK_TYPE_MASK)
 -   vn_lock(vp, lkflags | LK_RETRY);
 -
 +   if (lkflags  LK_TYPE_MASK) {
 +   if (lkflags == LK_SHARED))
  ^
 +   vn_lock(vp, LK_DOWNGRADE | LK_RETRY);
 +   } else
 +   VOP_UNLOCK(vp, LK_RELEASE);
 return (0);
  }

 After removing the second right parenthesis kernel compiles.
 But it still crashes:
 panic: Lock (lockmgr) ufs not locked @ sys/kern/vfs_default.c:512
 cpuid = 1
 KDB: stack backtrace:
 ...
 If you can use the bt info I'll transcribe - no serial console available :-(

 Am I right that I should only 

Re: make release fails on find

2012-10-31 Thread Andreas Nilsson
First, late me state status more clearly: solved :) Big thanks for fixing
it.

On a side note, how has re-team not run into this?

Best regards
Andreas
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: make release fails on find

2012-10-31 Thread Glen Barber
On Wed, Oct 31, 2012 at 02:07:56PM +0100, Andreas Nilsson wrote:
 First, late me state status more clearly: solved :) Big thanks for fixing
 it.
 

Glad to help.  To answer one of your previous questions, I've already
merged this to stable/9.

 On a side note, how has re-team not run into this?
 

No, the releases are built within a chroot, and this issue is specific
to a few edge-cases outside of that environment.

Glen



pgpGlkk4SQM1i.pgp
Description: PGP signature


Re: make release fails on find

2012-10-31 Thread Glen Barber
On Wed, Oct 31, 2012 at 08:30:29AM +0100, Andreas Nilsson wrote:
 On a more whislist topic: I'd really appreciate if  .zfs dirs would be
 excluded from  the tarballs.
 

Hmm, I didn't realize this was happening.

So I can verify my change works for all environments, are you using any
local zfs dataset properties, specifically unhiding the snapshot
directory?

Glen



pgpoJO4lrMq43.pgp
Description: PGP signature


ACPI Error: No handler for Region [POWS] (0xffffff000994f380) [IPMI] on Cisco UCS C200 M2

2012-10-31 Thread Miroslav Lachman

Hi,

I am getting the following error on server Cisco UCS C200 M2 running 
FreeBSD 8.3 amd64



Oct 31 02:15:22 ucs200 kernel: ACPI Error: No handler for Region [POWS] 
(0xff000994f380) [IPMI] (20101013/evregion-487)
Oct 31 02:15:22 ucs200 kernel: ACPI Error: Region IPMI(0x7) has no 
handler (20101013/exfldio-382)
Oct 31 02:15:22 ucs200 kernel: ACPI Error: Method parse/execution failed 
[\_SB_.PCI0.LPC0.P111._PSR] (Node 0xff0009934080), AE_NOT_EXIST 
(20101013/psparse-633)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: No handler for Region [POWS] 
(0xff000994f380) [IPMI] (20101013/evregion-487)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: Region IPMI(0x7) has no 
handler (20101013/exfldio-382)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: Method parse/execution failed 
[\_SB_.PCI0.LPC0.P111._PSR] (Node 0xff0009934080), AE_NOT_EXIST 
(20101013/psparse-633)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: No handler for Region [POWS] 
(0xff000994f380) [IPMI] (20101013/evregion-487)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: Region IPMI(0x7) has no 
handler (20101013/exfldio-382)
Oct 31 02:15:23 ucs200 kernel: ACPI Error: Method parse/execution failed 
[\_SB_.PCI0.LPC0.P111._PSR] (Node 0xff0009934080), AE_NOT_EXIST 
(20101013/psparse-633)



# uname -srmi
FreeBSD 8.3-RELEASE amd64 GENERIC

I don't know what it means. Should I be worried about it or should I 
ignore it?
Is there something that I can tune to turn this message off or is there 
something which need to be fixed on FreeBSD side?


We are planing to push this machine in to a production in one or two 
weeks, but until this time I can test patches etc.


Miroslav Lachman


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS corruption due to lack of space?

2012-10-31 Thread Steven Hartland

Other info:
zpool list tank2
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
tank219T  18.7T   304G98%  1.00x  ONLINE  -

zfs list tank2
NAMEUSED  AVAIL  REFER  MOUNTPOINT
tank2  13.3T  0  13.3T  /tank2

Running: 8.3-RELEASE-p4, zpool: v28, zfs: v5


- Original Message - 
From: Steven Hartland ste...@multiplay.co.uk

To: freebsd-stable@freebsd.org; freebsd...@freebsd.org
Sent: Wednesday, October 31, 2012 5:25 PM
Subject: ZFS corruption due to lack of space?



Been running some tests on new hardware here to verify all
is good. One of the tests was to fill the zfs array which
seems like its totally corrupted the tank.

The HW is 7 x 3TB disks in RAIDZ2 with dual 13GB ZIL
partitions and dual 100GB L2ARC on Enterprise SSD's.

All disks are connected to an LSI 2208 RAID controller
run by mfi driver. HD's via a SAS2X28 backplane and
SSD's via a passive blackplane backplane.

The file system has 31 test files most random data from
/dev/random and one blank from /dev/zero.

The test running was multiple ~20 dd's under screen with
all but one from /dev/random and to final one from /dev/zero

e.g. dd if=/dev/random bs=1m of=/tank2/random10

No hardware errors have raised, so no disk timeouts etc.

On completion each dd reported no space as you would expect
e.g. dd if=/dev/random bs=1m of=/tank2/random13
dd: /tank2/random13: No space left on device
503478+0 records in
503477+0 records out
527933898752 bytes transferred in 126718.731762 secs (4166187 bytes/sec)
You have new mail.

At that point with the test seemingly successful I went
to delete test files which resulted in:-
rm random*
rm: random1: Unknown error: 122
rm: random10: Unknown error: 122
rm: random11: Unknown error: 122
rm: random12: Unknown error: 122
rm: random13: Unknown error: 122
rm: random14: Unknown error: 122
rm: random18: Unknown error: 122
rm: random2: Unknown error: 122
rm: random3: Unknown error: 122
rm: random4: Unknown error: 122
rm: random5: Unknown error: 122
rm: random6: Unknown error: 122
rm: random7: Unknown error: 122
rm: random9: Unknown error: 122

Error 122 I assume is ECKSUM

At this point the pool was showing checksum errors
zpool status
 pool: tank
state: ONLINE
 scan: none requested
config:

   NAMESTATE READ WRITE 
CKSUM
   tankONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   gptid/41fb7e5c-21cf-11e2-92a3-002590881138  ONLINE   0 0 0
   gptid/42a1b53c-21cf-11e2-92a3-002590881138  ONLINE   0 0 0

errors: No known data errors

 pool: tank2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scan: none requested
config:

   NAME   STATE READ WRITE CKSUM
   tank2  ONLINE   0 0 4.22K
 raidz2-0 ONLINE   0 0 16.9K
   mfisyspd0  ONLINE   0 0 0
   mfisyspd1  ONLINE   0 0 0
   mfisyspd2  ONLINE   0 0 0
   mfisyspd3  ONLINE   0 0 0
   mfisyspd4  ONLINE   0 0 0
   mfisyspd5  ONLINE   0 0 0
   mfisyspd6  ONLINE   0 0 0
   logs
 mfisyspd7p3  ONLINE   0 0 0
 mfisyspd8p3  ONLINE   0 0 0
   cache
 mfisyspd9ONLINE   0 0 0
 mfisyspd10   ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   tank2:0x3
   tank2:0x8
   tank2:0x9
   tank2:0xa
   tank2:0xb
   tank2:0xf
   tank2:0x10
   tank2:0x11
   tank2:0x12
   tank2:0x13
   tank2:0x14
   tank2:0x15

So I tried a scrub, which looks like its going to
take 5 days to complete and is reporting many many more
errors:-

 pool: tank2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scan: scrub in progress since Wed Oct 31 16:13:53 2012
   118G scanned out of 18.7T at 42.2M/s, 128h19m to go
   49.0M repaired, 0.62% done
config:

   NAME   STATE READ WRITE CKSUM
   tank2  ONLINE   0 0  596K
 raidz2-0 ONLINE   0 0 1.20M
   mfisyspd0  ONLINE   0 0 0  (repairing)
   mfisyspd1  ONLINE   0 0 0  (repairing)
   mfisyspd2  ONLINE   0 0 0  (repairing)
   mfisyspd3  ONLINE   0 0 2  (repairing)
   mfisyspd4  ONLINE   

Panic during kernel boot, igb-init related? (8.3-RELEASE)

2012-10-31 Thread Charles Owens

Hello,

We're seeing boot-time panics in about 4% of cases when upgrading from 
FreeBSD 8.1 to 8.3-RELEASE (i386).  This problem is subtle enough that 
it escaped detection during our regular testing cycle... now with over 
100 systems upgraded we're convinced there's a real issue.  Our kernel 
config is essentially PAE (ie. static modules ... with a few drivers 
added/removed).  The hardware is Intel Server System SR1625UR.


This appears to match a finding discussed in these threads, having to do 
with timing of initialization of the igb(4)-based NICs (if I'm 
understanding it properly):


http://lists.freebsd.org/pipermail/freebsd-stable/2011-May/062596.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062949.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063867.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063958.html


These threads include some potential patches and possibility of 
commit/MFC... but it isn't clear that there was ever final resolution 
(and MFC to 8-stable).  I've cc'd a few folks from back then.


A real challenge here is the frequency of occurrence. As mentioned, it 
only hit's a fraction of our systems.  When it _does_ hit, the system 
may enter a reboot loop for days and then mysteriously break out of 
it... and thereafter seem to work fine.


I'd be very grateful for any help.  Some questions:

 * Was there ever a final blessed patch?
 o if so, will it apply to RELENG_8_3?
 * Is there anything that could be said that might help us with
   reproducing-the-problem / testing / validating-a-fix?


Panic message is --

panic: m_getzone: m_getjcl: invalid cluster type
cpuid = 0
KDB: stack backtrace:
#0 0xc059c717 at kdb_backtrace+0x47
#1 0xc056caf7 at panic+0x117
#2 0xc03c979e at igb_refresh_mbufs+0x25e
#3 0xc03c9f98 at igb_rxeof+0x638
#4 0xc03ca135 at igb_msix_que+0x105
#5 0xc0541e2b at intr_event_execute_handlers+0x13b
#6 0xc05434eb at ithread_loop+0x6b
#7 0xc053efb7 at fork_exit+0x97
#8 0xc0806744 at fork_trampoline+0x8

Thanks very much,

Charles


--
Charles Owens
Great Bay Software, Inc.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9-Stable panic: resource_list_unreserve: can't find resource

2012-10-31 Thread Andriy Gapon
on 31/10/2012 12:14 Tom Lislegaard said the following:
 Hi
 
 I'm running
 FreeBSD stingray 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #3: Mon Oct 29 
 16:11:35 CET 2012 tl@stingray:/usr/obj/usr/src/sys/stingray  amd64
 on a new Dell laptop and keep getting these panics (typically once or twice 
 per day)
 
 (kgdb) set pagination off
 (kgdb) bt
 #0  doadump (textdump=Variable textdump is not available.
 ) at pcpu.h:229
 #1  0x80425e64 in kern_reboot (howto=260) at 
 /usr/src/sys/kern/kern_shutdown.c:448
 #2  0x8042634c in panic (fmt=0x1 Address 0x1 out of bounds) at 
 /usr/src/sys/kern/kern_shutdown.c:636
 #3  0x8045773e in resource_list_unreserve (rl=Variable rl is not 
 available.
 ) at /usr/src/sys/kern/subr_bus.c:3338
 #4  0x802c3ee4 in acpi_delete_resource (bus=0xfe00052c1100, 
 child=0xfe00052c1500, type=4, rid=3323) at 
 /usr/src/sys/dev/acpica/acpi.c:1405
 #5  0x802c62bc in acpi_bus_alloc_gas (dev=0xfe00052c1500, 
 type=0xfe00052b786c, rid=0xfe00052b7978, gas=Variable gas is not 
 available.
 ) at /usr/src/sys/dev/acpica/acpi.c:1450
 #6  0x802d1663 in acpi_PkgGas (dev=0xfe00052c1500, res=Variable 
 res is not available.
 ) at /usr/src/sys/dev/acpica/acpi_package.c:120
 #7  0x802cbf6b in acpi_cpu_cx_cst (sc=0xfe00052b7800) at 
 /usr/src/sys/dev/acpica/acpi_cpu.c:782
 #8  0x802cc3a4 in acpi_cpu_notify (h=Variable h is not available.
 ) at /usr/src/sys/dev/acpica/acpi_cpu.c:1050
 #9  0x802a3fca in AcpiEvNotifyDispatch (Context=0x0) at 
 /usr/src/sys/contrib/dev/acpica/events/evmisc.c:283
 #10 0x802c26c3 in acpi_task_execute (context=0xfe00051d6800, 
 pending=Variable pending is not available.
 ) at /usr/src/sys/dev/acpica/Osd/OsdSchedule.c:134
 #11 0x804683c4 in taskqueue_run_locked (queue=0xfe00052bc100) at 
 /usr/src/sys/kern/subr_taskqueue.c:308
 #12 0x80469366 in taskqueue_thread_loop (arg=Variable arg is not 
 available.
 ) at /usr/src/sys/kern/subr_taskqueue.c:497
 #13 0x803f762f in fork_exit (callout=0x80469320 
 taskqueue_thread_loop, arg=0x80a20cc8, frame=0xff80002cdb00) at 
 /usr/src/sys/kern/kern_fork.c:992
 #14 0x806be6be in fork_trampoline () at 
 /usr/src/sys/amd64/amd64/exception.S:602

Could you please provide *sc from frame 7?

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS corruption due to lack of space?

2012-10-31 Thread Artem Belevich
On Wed, Oct 31, 2012 at 10:55 AM, Steven Hartland
kill...@multiplay.co.uk wrote:
 At that point with the test seemingly successful I went
 to delete test files which resulted in:-
 rm random*
 rm: random1: Unknown error: 122

ZFS is a logging filesystem. Even removing a file apparently requires
some space to write a new record saying that the file is not
referenced any more.

One way out of this jam is to try truncating some large file in place.
Make sure that file is not part of any snapshot.
Something like this may do the trick:
#dd if=/dev/null of=existing_large_file

Or, perhaps even something as simple as 'echo -n  large_file' may work.

Good luck,
--Artem
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS corruption due to lack of space?

2012-10-31 Thread Peter Jeremy
On 2012-Oct-31 17:25:09 -, Steven Hartland ste...@multiplay.co.uk wrote:
Been running some tests on new hardware here to verify all
is good. One of the tests was to fill the zfs array which
seems like its totally corrupted the tank.

I've accidently filled a pool, and had multiple processes try to
write to the full pool, without either emptying the free space reserve
(so I could still delete the offending files) or corrupting the pool.

Had you tried to read/write the raw disks before you tried the
ZFS testing?  Do you have compression and/or dedupe enabled on
the pool?

1. Given the information it seems like the multiple writes filling
the disk may have caused metadata corruption?

I don't recall seeing this reported before.

2. Is there anyway to stop the scrub?

Other than freeing up some space, I don't think so.  If this is a test
pool that you don't need, you could try destroying it and re-creating
it - that may be quicker and easier than recovering the existing pool.

3. Surely low space should never prevent stopping a scrub?

As Artem noted, ZFS is a copy-on-write filesystem.  It is supposed to
reserve some free space to allow metadata updates (stop scrubs, delete
files, etc) even when it is full but I have seen reports of this not
working correctly in the past.  A truncate-in-place may work.

You could also try asking on zfs-disc...@opensolaris.org 

-- 
Peter Jeremy


pgptbOF1VVAh4.pgp
Description: PGP signature


Re: ZFS corruption due to lack of space?

2012-10-31 Thread Ryan Stone
On Wed, Oct 31, 2012 at 4:48 PM, Artem Belevich a...@freebsd.org wrote:
 One way out of this jam is to try truncating some large file in place.
 Make sure that file is not part of any snapshot.
 Something like this may do the trick:
 #dd if=/dev/null of=existing_large_file

 Or, perhaps even something as simple as 'echo -n  large_file' may work.

truncate -s 0?
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS corruption due to lack of space?

2012-10-31 Thread Steven Hartland

On 2012-Oct-31 17:25:09 -, Steven Hartland ste...@multiplay.co.uk wrote:

Been running some tests on new hardware here to verify all
is good. One of the tests was to fill the zfs array which
seems like its totally corrupted the tank.


I've accidently filled a pool, and had multiple processes try to
write to the full pool, without either emptying the free space reserve
(so I could still delete the offending files) or corrupting the pool.


Same here but its the first time I've had ZIL in place at the time so
wondering if that may be playing a factor.


Had you tried to read/write the raw disks before you tried the
ZFS testing? 


Yes, didn't see any issues but then it wasn't checksuming so tbh I
wouldn't have noticed if it was silently corrupting data.


Do you have compression and/or dedupe enabled on the pool?


Nope bog standard raidz2 no additional settings


1. Given the information it seems like the multiple writes filling
the disk may have caused metadata corruption?


I don't recall seeing this reported before.


Nore me and we've been using ZFS for years, but never filled a pool
with such known simultanious access + ZIL before


2. Is there anyway to stop the scrub?


Other than freeing up some space, I don't think so.  If this is a test
pool that you don't need, you could try destroying it and re-creating
it - that may be quicker and easier than recovering the existing pool.


Artems trick of cat /dev/null  /tank2/bigfile worked and I've now
managed to stop the scrub :)


3. Surely low space should never prevent stopping a scrub?


As Artem noted, ZFS is a copy-on-write filesystem.  It is supposed to
reserve some free space to allow metadata updates (stop scrubs, delete
files, etc) even when it is full but I have seen reports of this not
working correctly in the past.  A truncate-in-place may work.


Yes it did thanks, but as you said if this metadata update was failing
due to out of space lends credability to the fact that the same lack of
space and hence failure to update metadata could have also caused the
corruption in the first place.

Its interesting to note that the zpool is reporting pleanty of free space
even when the root zfs volume was showing 0, so you would expect there
to be pleanty of space for it be able to stop the scrub but it appears
not which is definitely interesting and could point to the underlying
cause?

zpool list tank2
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
tank219T  18.7T   304G98%  1.00x  ONLINE  -

zfs list tank2
NAMEUSED  AVAIL  REFER  MOUNTPOINT
tank2  13.3T  0  13.3T  /tank2

Current state is:-
 scan: scrub in progress since Wed Oct 31 16:13:53 2012
   1.64T scanned out of 18.7T at 62.8M/s, 79h12m to go
   280M repaired, 8.76% done

Something else that was interesting is while the scrub was running
devd was using a good amount of CPU 40% of a 3.3Ghz core, which I've
never seen before. Any ideas why its usage would be so high?

   Regards
   Steve




This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org