Thank you Ashu. Do you know how we can get minimal support for Opensolaris. >From the website i seem to be going in circles trying to find this information. Any pointers in the right direction is highly appreciated.
On Mon, Mar 22, 2010 at 12:37 PM, Ashutosh Tripathi < Ashutosh.Tripathi at sun.com> wrote: > Hi Satya, > > Hmmm... indeed sounds like people from ZFS and SC should > be providing you with a little more justification/information > about your situation then they have. > > If i may, it think that apart from looking at this from > Storage perspective, on the Solaris/ZFS/SC side, you might wanna > open an Escalation with SUN using your support contract, assuming > you have one. Otherwise, you are liable to just being thrown around, > given that this is a rather deep issue to debug. > > SUN (Oracle now), support engineers are trained for situations > like these and know how to collect detailed data from the system to > help with deeper analysis. Just looking at kernel thread stacks > takes you only so far. Brute force kernel coredump analysis takes > you a little further at the cost of LOT of effort. Doing targetted > debugging (with dtrace scripts for example, debug kernel modules > for another example), with several iterations of back and forth > is what it takes to nail down deeper issues like this. > > Hope that didn't sound too much like a pushback. It was > intended to be good faith feedback on how you are trying to go > about this problem. > > Regards, > > -ashu > > opensolaris_user hello wrote: > >> Hi Ashu, >> >> Thank you very much for the response. >> >> In http://defect.opensolaris.org/bz/show_bug.cgi?id=15058 i indicated >> that "zfs list -t all" thread was the oldest idle thread and it seem to be >> stuck for whatever reason. All the threads that are in biowait() seem to be >> much later than that thread. It could be that the arc_read_no_lock() thread >> could be the cause of the other i/o waits. While i agree that biowait() on >> the scdpmd could indicate some thing got stuck at the scsi layer which may >> or may not be because of the zfs list thread, but since it was the oldest >> idle thread i was interested in knowing why it is stuck there for ever. >> >> The cluster team had indicated it is a invalid configuration, but did not >> give any further details as to why that is the case and how we can modify >> the configuration to prevent this. If you think that ZFS team needs to take >> a look please assign it to ZFS and i can follow up on zfs-discuss. >> >> Once again appreciate your response. We will try to follow up from the >> storage perspective as well. >> > > >> Regards, >> Satya >> >> >> >> >> On Fri, Mar 19, 2010 at 4:57 PM, Ashutosh Tripathi < >> Ashutosh.Tripathi at sun.com <mailto:Ashutosh.Tripathi at sun.com>> wrote: >> >> Hi Satya, >> >> While i don't know why is ZFS I/O hung in biowait(), >> from past experience, i can tell you that biowait() issues >> tend to be very hard to debug. In many cases, these actually >> turn out to be issues related to storage, ie the storage/storage-driver >> simply take too long (or loose track of) for a given I/O. At the upper >> layer (Solaris/Filesystem), there is nothing the system can do, >> except to wait for the I/O to complete. >> >> Note that this is different from a SCSI timeout, ie >> that a SCSI packet sent by the server to the storage gets lost, >> so the host never gets a ACK back. In that case, the SCSI >> command is retried. Here, i am talking about a case where >> the SCSI command has been ACKed properly by the storage, >> it just never gets back with the completed I/O. >> >> While your mention of the "zfs list -t all" command >> sounds a bit suspicious, when i actually look at the thread >> stack you posted in the CR, it lists a bunch of java threads >> and scdpmd threads stuck behind a biowait(). So, at least in >> that case, the hang could be independent of the zfs list >> command (it is always possible that the zfs list is triggering >> a particular pattern of I/O which leads to this...). >> >> Anyhow, where does that leave you... Have you tried >> approaching your storage vendor with this problem? The leading >> question to them would be: Why isn't the storage completing this >> particular I/O request from the host? >> >> HTH, >> -ashu >> >> >> opensolaris_user hello wrote: >> >> >> Hi Cluster Team, >> >> We are currently running in to a ZFS hang regularly and after >> this happens the node can end up with corrupted pool causing >> complete data loss. Other than that since reboot doesn't work we >> end up with corrupted boot partition causing boot to panic. We >> are using clustering with a single node configuration and aim to >> expand to 2 node HA configuration. Looking at the stack traces >> and our application logs it is clear that a "zfs list -t all" >> command causes the pool to be stuck. The system works all the >> time with out any issues except when we run in to this hang. >> >> I tried to analyze the root cause and i see that the zfs list >> thread was stuck in i/o wait. It seems that this is a ZFS hang >> and not related to clustering. We even tried to disable cluster >> disk path monitoring and still run in to this issue. >> >> If we can get some insight as to why this is a invalid cluster >> configuration and how this can lead to the ZFS hang we would >> appreciate that. I have filed 2 bugs but the following bug >> explains the situation much better: >> >> http://defect.opensolaris.org/bz/show_bug.cgi?id=15058 >> >> Any insight/help with this issue is highly appreciated. >> >> Thanks, >> Satya >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> ha-clusters-discuss mailing list >> ha-clusters-discuss at opensolaris.org >> <mailto:ha-clusters-discuss at opensolaris.org> >> >> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20100323/1dd96161/attachment.html>