Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-10-04 Thread Stuart Anderson
On Mon, Jul 16, 2007 at 09:36:06PM -0700, Stuart Anderson wrote:
 Running Solaris 10 Update 3 on an X4500 I have found that it is possible
 to reproducibly block all writes to a ZFS pool by running chgrp -R
 on any large filesystem in that pool.  As can be seen below in the zpool
 iostat output below, after about 10-sec of running the chgrp command all
 writes to the pool stop, and the pool starts exclusively running a slow
 background task of 1kB reads.
 
 At this point the chgrp -R command is not killable via root kill -9,
 and in fact even the command halt -d does not do anything.
 

For posterity this appears to have been fixed in S10U4, at least I am
unable to reproduce the problem that was easy to trigger with S10U3.

Thanks.

-- 
Stuart Anderson  [EMAIL PROTECTED]
http://www.ligo.caltech.edu/~anderson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-18 Thread Robert Milkowski
Hello Stuart,

  Looks like crash dumped went ok.
  Check logs after system booted up again if there's a warning that
  there's no enough space in /var/crash/x4500gc to save crashdump.
  When using zfs on a file servers crashdumps usually will be almost
  of server's memory size...

  Eventually just run 'savecore path_to_dir' where path_to_dir is a
  path to a directory with enough free space.
  Of course assuming you haven't touch swap device up-to this time.




-- 
Best regards,
 Robert Milkowski  mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

   
Tuesday, July 17, 2007, 9:04:55 PM, you wrote:

SA It looks like there is a problem dumping a kernel panic on an X4500.
SA During the self induced panic, there where additional syslog messages
SA that indicate a problem writing to the two disks that make up
SA /dev/md/dsk/d2 in my case.  It is as if the SATA controllers are being
SA reset during the crash dump.

SA At any rate I will send this all to Sun support.

SA Thanks.


SA Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] 
SA Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=9823c460: 
SA Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced
SA crash dump initiated at user request
SA Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] 
SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA fe8000e18d60 genunix:kadmin+4b4 ()
SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA fe8000e18ec0 genunix:uadmin+93 ()
SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice]
SA fe8000e18f10 unix:sys_syscall32+101 ()
SA Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] 
SA Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file 
systems...
SA Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice]  1
SA Jul 17 12:27:37 x4500gc last message repeated 1 time
SA Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice]  done
SA Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping
SA to /dev/md/dsk/d2, offset 3436511232, content: kernel
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning]
SA WARNING: marvell88sx3: error on port 0:
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
disconnected
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
connected
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  SError 
interrupt
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]  SErrors:
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Recovered 
communication error
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]PHY ready 
change
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]10-bit to 
8-bit decode error
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Disparity 
error
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning]
SA WARNING: marvell88sx3: error on port 4:
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
disconnected
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
connected
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  SError 
interrupt
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]  SErrors:
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Recovered 
communication error
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]PHY ready 
change
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]10-bit to 
8-bit decode error
SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Disparity 
error
SA Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100%
SA done: 3268790 pages dumped, compression ratio 12.39, 
SA Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded
SA Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS
SA Release 5.10 Version Generic_125101-10 64-bit
SA Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice]
SA Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.




SA On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson wrote:
 On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote:
  Log a new case with Sun, and make sure you supply
  a crash dump so people who know ZFS can analyze
  the issue.
  
  You can use stop-A sync, break sync, or
  
  reboot -dq
  
 
 That does appear to have caused a panic/kernel dump. However, I cannot
 find the dump image after rebooting to Solaris even thought savecore
 appears to be configured,
 
 # reboot -dq
 Jul 17 12:27:35 x4500gc reboot: rebooted by root
 
 panic[cpu2]/thread=9823c460: forced crash dump initiated at user 
 request
 
 fe8000e18d60 genunix:kadmin+4b4 ()
 fe8000e18ec0 genunix:uadmin+93 ()
 fe8000e18f10 unix:sys_syscall32+101 ()
 
 syncing file 

Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-17 Thread Joshua . Goodall
[EMAIL PROTECTED] wrote on 17/07/2007 02:36:06 PM:

 Running Solaris 10 Update 3 on an X4500 I have found that it is possible
 to reproducibly block all writes to a ZFS pool by running chgrp -R
 on any large filesystem in that pool.  As can be seen below in the zpool
 iostat output below, after about 10-sec of running the chgrp command all
 writes to the pool stop, and the pool starts exclusively running a slow
 background task of 1kB reads.

Related or not, I can hang all reads on a nv_65 zpool simply by dd'ing a 
5GB server image to a zvol.

- JG



This email, including any attachments, is intended only for the use of the 
individual or entity named above and may contain information that is 
confidential and privileged. Any information contained in this email is not to 
be used or disclosed for any purpose other than the purpose for which you 
received it. If you are not the intended recipient you are notified that 
disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. If you have received this 
email by mistake, please delete this email permanently from your system. 
WARNING: Although Editure has taken reasonable precautions to ensure no viruses 
are present in this email, Editure can not accept responsibility for any losses 
or damages whatsoever, arising from the use of this email and/or its 
attachments.
www.editure.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-17 Thread Stuart Anderson
On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote:
 Log a new case with Sun, and make sure you supply
 a crash dump so people who know ZFS can analyze
 the issue.
 
 You can use stop-A sync, break sync, or
 
 reboot -dq
 

That does appear to have caused a panic/kernel dump. However, I cannot
find the dump image after rebooting to Solaris even thought savecore
appears to be configured,

# reboot -dq
Jul 17 12:27:35 x4500gc reboot: rebooted by root

panic[cpu2]/thread=9823c460: forced crash dump initiated at user request

fe8000e18d60 genunix:kadmin+4b4 ()
fe8000e18ec0 genunix:uadmin+93 ()
fe8000e18f10 unix:sys_syscall32+101 ()

syncing file systems... 1 1 done
dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel
100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded
rebooting...


# dumpadm
  Dump content: kernel pages
   Dump device: /dev/md/dsk/d2 (swap)
Savecore directory: /var/crash/x4500gc
  Savecore enabled: yes

# ls -laR /var/crash/x4500gc/
/var/crash/x4500gc/:
total 2
drwx--  2 root root 512 Jul 12 16:26 .
drwxr-xr-x  3 root root 512 Jul 12 16:26 ..


Thanks.


-- 
Stuart Anderson  [EMAIL PROTECTED]
http://www.ligo.caltech.edu/~anderson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-17 Thread Stuart Anderson
It looks like there is a problem dumping a kernel panic on an X4500.
During the self induced panic, there where additional syslog messages
that indicate a problem writing to the two disks that make up
/dev/md/dsk/d2 in my case.  It is as if the SATA controllers are being
reset during the crash dump.

At any rate I will send this all to Sun support.

Thanks.


Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] 
Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=9823c460: 
Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced crash dump 
initiated at user request
Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] 
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18d60 
genunix:kadmin+4b4 ()
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18ec0 
genunix:uadmin+93 ()
Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18f10 
unix:sys_syscall32+101 ()
Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] 
Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file systems...
Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice]  1
Jul 17 12:27:37 x4500gc last message repeated 1 time
Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice]  done
Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping to 
/dev/md/dsk/d2, offset 3436511232, content: kernel
Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING: 
marvell88sx3: error on port 0:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
disconnected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device connected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  SError interrupt
Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]  SErrors:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  
Recovered communication error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  PHY 
ready change
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  10-bit 
to 8-bit decode error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  
Disparity error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING: 
marvell88sx3: error on port 4:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device 
disconnected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  device connected
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  SError interrupt
Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info]  SErrors:
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  
Recovered communication error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  PHY 
ready change
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  10-bit 
to 8-bit decode error
Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]  
Disparity error
Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100% done: 3268790 
pages dumped, compression ratio 12.39, 
Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded
Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 
Version Generic_125101-10 64-bit
Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice] Copyright 1983-2007 
Sun Microsystems, Inc.  All rights reserved.




On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson wrote:
 On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote:
  Log a new case with Sun, and make sure you supply
  a crash dump so people who know ZFS can analyze
  the issue.
  
  You can use stop-A sync, break sync, or
  
  reboot -dq
  
 
 That does appear to have caused a panic/kernel dump. However, I cannot
 find the dump image after rebooting to Solaris even thought savecore
 appears to be configured,
 
 # reboot -dq
 Jul 17 12:27:35 x4500gc reboot: rebooted by root
 
 panic[cpu2]/thread=9823c460: forced crash dump initiated at user 
 request
 
 fe8000e18d60 genunix:kadmin+4b4 ()
 fe8000e18ec0 genunix:uadmin+93 ()
 fe8000e18f10 unix:sys_syscall32+101 ()
 
 syncing file systems... 1 1 done
 dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel
 100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded
 rebooting...
 
 
 # dumpadm
   Dump content: kernel pages
Dump device: /dev/md/dsk/d2 (swap)
 Savecore directory: /var/crash/x4500gc
   Savecore enabled: yes
 
 # ls -laR /var/crash/x4500gc/
 /var/crash/x4500gc/:
 total 2
 drwx--  2 root root 512 Jul 12 16:26 .
 drwxr-xr-x  3 root root 512 Jul 12 16:26 ..
 
 
 Thanks.
 
 
 -- 
 Stuart Anderson  [EMAIL PROTECTED]
 http://www.ligo.caltech.edu/~anderson

-- 
Stuart Anderson  [EMAIL PROTECTED]
http://www.ligo.caltech.edu/~anderson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-16 Thread James C. McPherson
Stuart Anderson wrote:
 Running Solaris 10 Update 3 on an X4500 I have found that it is possible
 to reproducibly block all writes to a ZFS pool by running chgrp -R
 on any large filesystem in that pool.  As can be seen below in the zpool
 iostat output below, after about 10-sec of running the chgrp command all
 writes to the pool stop, and the pool starts exclusively running a slow
 background task of 1kB reads.
 
 At this point the chgrp -R command is not killable via root kill -9,
 and in fact even the command halt -d does not do anything.
 
 In at lest one instance I have seen the chgrp command eventually
 respond to the kill command after ~30 minutes, and the pool was
 writable again. However, while waiting for this to happen the
 kernel was generating No more processes. when simple commands
 where attempted to be run in pre-existing shells, e.g., uname or uptime.
...

 There is nothing in the output of dmesg, svcs -xv, or fmdump associated
 with this event.
 
 Is this a known issue or should I open a new case with Sun?

Log a new case with Sun, and make sure you supply
a crash dump so people who know ZFS can analyze
the issue.

You can use stop-A sync, break sync, or

reboot -dq




cheers,
James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-16 Thread Stuart Anderson
On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson wrote:
 Stuart Anderson wrote:
 Running Solaris 10 Update 3 on an X4500 I have found that it is possible
 to reproducibly block all writes to a ZFS pool by running chgrp -R
 on any large filesystem in that pool.  As can be seen below in the zpool
 iostat output below, after about 10-sec of running the chgrp command all
 writes to the pool stop, and the pool starts exclusively running a slow
 background task of 1kB reads.
 

...

 
 Is this a known issue or should I open a new case with Sun?
 
 Log a new case with Sun, and make sure you supply
 a crash dump so people who know ZFS can analyze
 the issue.
 
 You can use stop-A sync, break sync, or
 
 reboot -dq
 

In previous attempts, neither halt -d nor reboot (with no arguments)
where able to shutdown the machine. Is reboot -dq really a bigger hammer
than halt -d?

Sorry to be pedantic, but what is the exact key sequence on a Sun
USB keyboard one should use to force a kernel dump on Solx86?
Since there is no OBP on an X4500 where do I type the sync command?

Thanks.

-- 
Stuart Anderson  [EMAIL PROTECTED]
http://www.ligo.caltech.edu/~anderson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-16 Thread James C. McPherson
Stuart Anderson wrote:
 On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson wrote:
 Stuart Anderson wrote:
 Running Solaris 10 Update 3 on an X4500 I have found that it is possible
 to reproducibly block all writes to a ZFS pool by running chgrp -R
 on any large filesystem in that pool.  As can be seen below in the zpool
 iostat output below, after about 10-sec of running the chgrp command all
 writes to the pool stop, and the pool starts exclusively running a slow
 background task of 1kB reads.
 Is this a known issue or should I open a new case with Sun?
 Log a new case with Sun, and make sure you supply
 a crash dump so people who know ZFS can analyze
 the issue.

 You can use stop-A sync, break sync, or

 reboot -dq

 
 In previous attempts, neither halt -d nor reboot (with no arguments)
 where able to shutdown the machine. Is reboot -dq really a bigger hammer
 than halt -d?

Kindasorta - the q option tells reboot to do its stuff with
all guns blazing, as it were.

 Sorry to be pedantic, but what is the exact key sequence on a Sun
 USB keyboard one should use to force a kernel dump on Solx86?
 Since there is no OBP on an X4500 where do I type the sync command?

first, either boot with -k or shortly after you get to
multiuser, run mdb -K on the console (and hit :c enter).

Then you can use F1A to drop to kmdb, and then run

::systemdump

or

0rip
:c
:c

or for 32bit mode

0eip
:c
:c


cheers,
James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chgrp -R hangs all writes to pool

2007-07-16 Thread Rayson Ho
I found a very nice doc. that describes the steps to create a kernel dump:

The Solaris Operating System on x86 Platforms - Crashdump Analysis
Operating System Internals

http://opensolaris.org/os/community/documentation/files/book.pdf

- 7.2.2.Forcing system crashdumps

Rayson



On 7/17/07, James C. McPherson [EMAIL PROTECTED] wrote:
 first, either boot with -k or shortly after you get to
 multiuser, run mdb -K on the console (and hit :c enter).

 Then you can use F1A to drop to kmdb, and then run

 ::systemdump

 or

 0rip
 :c
 :c

 or for 32bit mode

 0eip
 :c
 :c


 cheers,
 James C. McPherson
 --
 Solaris kernel software engineer
 Sun Microsystems
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss