Re: [zfs-discuss] chgrp -R hangs all writes to pool
On Mon, Jul 16, 2007 at 09:36:06PM -0700, Stuart Anderson wrote: Running Solaris 10 Update 3 on an X4500 I have found that it is possible to reproducibly block all writes to a ZFS pool by running chgrp -R on any large filesystem in that pool. As can be seen below in the zpool iostat output below, after about 10-sec of running the chgrp command all writes to the pool stop, and the pool starts exclusively running a slow background task of 1kB reads. At this point the chgrp -R command is not killable via root kill -9, and in fact even the command halt -d does not do anything. For posterity this appears to have been fixed in S10U4, at least I am unable to reproduce the problem that was easy to trigger with S10U3. Thanks. -- Stuart Anderson [EMAIL PROTECTED] http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
Hello Stuart, Looks like crash dumped went ok. Check logs after system booted up again if there's a warning that there's no enough space in /var/crash/x4500gc to save crashdump. When using zfs on a file servers crashdumps usually will be almost of server's memory size... Eventually just run 'savecore path_to_dir' where path_to_dir is a path to a directory with enough free space. Of course assuming you haven't touch swap device up-to this time. -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com Tuesday, July 17, 2007, 9:04:55 PM, you wrote: SA It looks like there is a problem dumping a kernel panic on an X4500. SA During the self induced panic, there where additional syslog messages SA that indicate a problem writing to the two disks that make up SA /dev/md/dsk/d2 in my case. It is as if the SATA controllers are being SA reset during the crash dump. SA At any rate I will send this all to Sun support. SA Thanks. SA Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] SA Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=9823c460: SA Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced SA crash dump initiated at user request SA Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] SA fe8000e18d60 genunix:kadmin+4b4 () SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] SA fe8000e18ec0 genunix:uadmin+93 () SA Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] SA fe8000e18f10 unix:sys_syscall32+101 () SA Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] SA Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file systems... SA Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice] 1 SA Jul 17 12:27:37 x4500gc last message repeated 1 time SA Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice] done SA Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping SA to /dev/md/dsk/d2, offset 3436511232, content: kernel SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] SA WARNING: marvell88sx3: error on port 0: SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device disconnected SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device connected SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] SError interrupt SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info] SErrors: SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Recovered communication error SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]PHY ready change SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]10-bit to 8-bit decode error SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Disparity error SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] SA WARNING: marvell88sx3: error on port 4: SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device disconnected SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device connected SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] SError interrupt SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info] SErrors: SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Recovered communication error SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]PHY ready change SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]10-bit to 8-bit decode error SA Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info]Disparity error SA Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100% SA done: 3268790 pages dumped, compression ratio 12.39, SA Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded SA Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS SA Release 5.10 Version Generic_125101-10 64-bit SA Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice] SA Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved. SA On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson wrote: On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote: Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq That does appear to have caused a panic/kernel dump. However, I cannot find the dump image after rebooting to Solaris even thought savecore appears to be configured, # reboot -dq Jul 17 12:27:35 x4500gc reboot: rebooted by root panic[cpu2]/thread=9823c460: forced crash dump initiated at user request fe8000e18d60 genunix:kadmin+4b4 () fe8000e18ec0 genunix:uadmin+93 () fe8000e18f10 unix:sys_syscall32+101 () syncing file
Re: [zfs-discuss] chgrp -R hangs all writes to pool
[EMAIL PROTECTED] wrote on 17/07/2007 02:36:06 PM: Running Solaris 10 Update 3 on an X4500 I have found that it is possible to reproducibly block all writes to a ZFS pool by running chgrp -R on any large filesystem in that pool. As can be seen below in the zpool iostat output below, after about 10-sec of running the chgrp command all writes to the pool stop, and the pool starts exclusively running a slow background task of 1kB reads. Related or not, I can hang all reads on a nv_65 zpool simply by dd'ing a 5GB server image to a zvol. - JG This email, including any attachments, is intended only for the use of the individual or entity named above and may contain information that is confidential and privileged. Any information contained in this email is not to be used or disclosed for any purpose other than the purpose for which you received it. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email by mistake, please delete this email permanently from your system. WARNING: Although Editure has taken reasonable precautions to ensure no viruses are present in this email, Editure can not accept responsibility for any losses or damages whatsoever, arising from the use of this email and/or its attachments. www.editure.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote: Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq That does appear to have caused a panic/kernel dump. However, I cannot find the dump image after rebooting to Solaris even thought savecore appears to be configured, # reboot -dq Jul 17 12:27:35 x4500gc reboot: rebooted by root panic[cpu2]/thread=9823c460: forced crash dump initiated at user request fe8000e18d60 genunix:kadmin+4b4 () fe8000e18ec0 genunix:uadmin+93 () fe8000e18f10 unix:sys_syscall32+101 () syncing file systems... 1 1 done dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel 100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded rebooting... # dumpadm Dump content: kernel pages Dump device: /dev/md/dsk/d2 (swap) Savecore directory: /var/crash/x4500gc Savecore enabled: yes # ls -laR /var/crash/x4500gc/ /var/crash/x4500gc/: total 2 drwx-- 2 root root 512 Jul 12 16:26 . drwxr-xr-x 3 root root 512 Jul 12 16:26 .. Thanks. -- Stuart Anderson [EMAIL PROTECTED] http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
It looks like there is a problem dumping a kernel panic on an X4500. During the self induced panic, there where additional syslog messages that indicate a problem writing to the two disks that make up /dev/md/dsk/d2 in my case. It is as if the SATA controllers are being reset during the crash dump. At any rate I will send this all to Sun support. Thanks. Jul 17 12:27:35 x4500gc unix: [ID 836849 kern.notice] Jul 17 12:27:35 x4500gc ^Mpanic[cpu2]/thread=9823c460: Jul 17 12:27:35 x4500gc genunix: [ID 156897 kern.notice] forced crash dump initiated at user request Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18d60 genunix:kadmin+4b4 () Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18ec0 genunix:uadmin+93 () Jul 17 12:27:35 x4500gc genunix: [ID 655072 kern.notice] fe8000e18f10 unix:sys_syscall32+101 () Jul 17 12:27:35 x4500gc unix: [ID 10 kern.notice] Jul 17 12:27:35 x4500gc genunix: [ID 672855 kern.notice] syncing file systems... Jul 17 12:27:35 x4500gc genunix: [ID 733762 kern.notice] 1 Jul 17 12:27:37 x4500gc last message repeated 1 time Jul 17 12:27:38 x4500gc genunix: [ID 904073 kern.notice] done Jul 17 12:27:39 x4500gc genunix: [ID 111219 kern.notice] dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx3: error on port 0: Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device disconnected Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device connected Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] SError interrupt Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info] SErrors: Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] Recovered communication error Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] PHY ready change Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] Disparity error Jul 17 12:27:39 x4500gc marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx3: error on port 4: Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device disconnected Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] device connected Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] SError interrupt Jul 17 12:27:39 x4500gc marvell88sx: [ID 131198 kern.info] SErrors: Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] Recovered communication error Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] PHY ready change Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Jul 17 12:27:39 x4500gc marvell88sx: [ID 517869 kern.info] Disparity error Jul 17 12:28:39 x4500gc genunix: [ID 409368 kern.notice] ^M100% done: 3268790 pages dumped, compression ratio 12.39, Jul 17 12:28:39 x4500gc genunix: [ID 851671 kern.notice] dump succeeded Jul 17 12:30:38 x4500gc genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_125101-10 64-bit Jul 17 12:30:38 x4500gc genunix: [ID 943907 kern.notice] Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved. On Tue, Jul 17, 2007 at 12:40:16PM -0700, Stuart Anderson wrote: On Tue, Jul 17, 2007 at 03:08:44PM +1000, James C. McPherson wrote: Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq That does appear to have caused a panic/kernel dump. However, I cannot find the dump image after rebooting to Solaris even thought savecore appears to be configured, # reboot -dq Jul 17 12:27:35 x4500gc reboot: rebooted by root panic[cpu2]/thread=9823c460: forced crash dump initiated at user request fe8000e18d60 genunix:kadmin+4b4 () fe8000e18ec0 genunix:uadmin+93 () fe8000e18f10 unix:sys_syscall32+101 () syncing file systems... 1 1 done dumping to /dev/md/dsk/d2, offset 3436511232, content: kernel 100% done: 3268790 pages dumped, compression ratio 12.39, dump succeeded rebooting... # dumpadm Dump content: kernel pages Dump device: /dev/md/dsk/d2 (swap) Savecore directory: /var/crash/x4500gc Savecore enabled: yes # ls -laR /var/crash/x4500gc/ /var/crash/x4500gc/: total 2 drwx-- 2 root root 512 Jul 12 16:26 . drwxr-xr-x 3 root root 512 Jul 12 16:26 .. Thanks. -- Stuart Anderson [EMAIL PROTECTED] http://www.ligo.caltech.edu/~anderson -- Stuart Anderson [EMAIL PROTECTED] http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
Stuart Anderson wrote: Running Solaris 10 Update 3 on an X4500 I have found that it is possible to reproducibly block all writes to a ZFS pool by running chgrp -R on any large filesystem in that pool. As can be seen below in the zpool iostat output below, after about 10-sec of running the chgrp command all writes to the pool stop, and the pool starts exclusively running a slow background task of 1kB reads. At this point the chgrp -R command is not killable via root kill -9, and in fact even the command halt -d does not do anything. In at lest one instance I have seen the chgrp command eventually respond to the kill command after ~30 minutes, and the pool was writable again. However, while waiting for this to happen the kernel was generating No more processes. when simple commands where attempted to be run in pre-existing shells, e.g., uname or uptime. ... There is nothing in the output of dmesg, svcs -xv, or fmdump associated with this event. Is this a known issue or should I open a new case with Sun? Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq cheers, James C. McPherson -- Solaris kernel software engineer Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson wrote: Stuart Anderson wrote: Running Solaris 10 Update 3 on an X4500 I have found that it is possible to reproducibly block all writes to a ZFS pool by running chgrp -R on any large filesystem in that pool. As can be seen below in the zpool iostat output below, after about 10-sec of running the chgrp command all writes to the pool stop, and the pool starts exclusively running a slow background task of 1kB reads. ... Is this a known issue or should I open a new case with Sun? Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq In previous attempts, neither halt -d nor reboot (with no arguments) where able to shutdown the machine. Is reboot -dq really a bigger hammer than halt -d? Sorry to be pedantic, but what is the exact key sequence on a Sun USB keyboard one should use to force a kernel dump on Solx86? Since there is no OBP on an X4500 where do I type the sync command? Thanks. -- Stuart Anderson [EMAIL PROTECTED] http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
Stuart Anderson wrote: On Tue, Jul 17, 2007 at 02:49:08PM +1000, James C. McPherson wrote: Stuart Anderson wrote: Running Solaris 10 Update 3 on an X4500 I have found that it is possible to reproducibly block all writes to a ZFS pool by running chgrp -R on any large filesystem in that pool. As can be seen below in the zpool iostat output below, after about 10-sec of running the chgrp command all writes to the pool stop, and the pool starts exclusively running a slow background task of 1kB reads. Is this a known issue or should I open a new case with Sun? Log a new case with Sun, and make sure you supply a crash dump so people who know ZFS can analyze the issue. You can use stop-A sync, break sync, or reboot -dq In previous attempts, neither halt -d nor reboot (with no arguments) where able to shutdown the machine. Is reboot -dq really a bigger hammer than halt -d? Kindasorta - the q option tells reboot to do its stuff with all guns blazing, as it were. Sorry to be pedantic, but what is the exact key sequence on a Sun USB keyboard one should use to force a kernel dump on Solx86? Since there is no OBP on an X4500 where do I type the sync command? first, either boot with -k or shortly after you get to multiuser, run mdb -K on the console (and hit :c enter). Then you can use F1A to drop to kmdb, and then run ::systemdump or 0rip :c :c or for 32bit mode 0eip :c :c cheers, James C. McPherson -- Solaris kernel software engineer Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chgrp -R hangs all writes to pool
I found a very nice doc. that describes the steps to create a kernel dump: The Solaris Operating System on x86 Platforms - Crashdump Analysis Operating System Internals http://opensolaris.org/os/community/documentation/files/book.pdf - 7.2.2.Forcing system crashdumps Rayson On 7/17/07, James C. McPherson [EMAIL PROTECTED] wrote: first, either boot with -k or shortly after you get to multiuser, run mdb -K on the console (and hit :c enter). Then you can use F1A to drop to kmdb, and then run ::systemdump or 0rip :c :c or for 32bit mode 0eip :c :c cheers, James C. McPherson -- Solaris kernel software engineer Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss