Hi Cluster Team, We are currently running in to a ZFS hang regularly and after this happens the node can end up with corrupted pool causing complete data loss. Other than that since reboot doesn't work we end up with corrupted boot partition causing boot to panic. We are using clustering with a single node configuration and aim to expand to 2 node HA configuration. Looking at the stack traces and our application logs it is clear that a "zfs list -t all" command causes the pool to be stuck. The system works all the time with out any issues except when we run in to this hang.
I tried to analyze the root cause and i see that the zfs list thread was stuck in i/o wait. It seems that this is a ZFS hand and not related to clustering. We even tried to disable cluster disk path monitoring and still run in to this issue. If we can get some insight as to why this is a invalid cluster configuration and how this can lead to the ZFS hang we would appreciate that. I have filed 2 bugs but the following bug explains the situation much better: http://defect.opensolaris.org/bz/show_bug.cgi?id=15058 Any insight/help with this issue is highly appreciated. Thanks, Satya ---------- Forwarded message ---------- From: <bugzi...@defect.opensolaris.org> Date: Thu, Mar 11, 2010 at 10:19 AM Subject: [Bug 13774] zfs hangs on a 10 disk raidz pool To: opensolarisuser2009 at gmail.com http://defect.opensolaris.org/bz/show_bug.cgi?id=13774 manthavish <vishwanath.mantha at sun.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |vishwanath.mantha at sun.com -- Configure bugmail: http://defect.opensolaris.org/bz/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20100311/f88d0da9/attachment-0001.html>