Well, something must have changed. Is the logfile growing in size? Is the volume close to full? Are there a lot of files in a directory?
If the ioutil is high, the io subsystem is being saturated. That's a given. Question is, who is the culprit. If the webserver load has not changed, then the fs itself must be contributing to the load. strace -T one of the processes. That should narrow down the scope of the problem. Sunil David Johle wrote: > Well it does pretty much make the system (or at least anything doing > I/O to the volume) unresponsive, but it does recover after 10-15 > seconds typically. I guess that is considered a "temporary slowdown" > rather than a hang? > > Yes, the log files are being written to the OCFS2 volume, and are > actually being written to by both nodes in parallel. I did much > testing on this before going into production and never saw any > problems or slowdowns, even on much less powerful systems. And, as I > mentioned, there were no problems on these systems for over a month in > production (same load all along). > > I do wonder if the 1.4 release would be any better for my situation, > and would like to put it on my test environment first of course. > However, I do have an issue in that I am making use of the CDSL > feature that was removed after 1.2.x, and thus I will have to figure > some way to accomplish the desired configuration w/o them before I can > upgrade. > > The problem is continuing, and getting really annoying as it's > tripping up our monitoring system like crazy. Is there anything else > I try doing to get more details about what is going on to help find a > solution? Any parameters that could be tweaked to account for the > fact that there is a steady stream of small writes from all nodes to > this volume? > > > > At 04:28 PM 1/23/2009, Sunil Mushran wrote: >> The two issues are different. For starters, the issue in bugzilla#882 >> was a hang. Not a temporary slowdown. And the stack showed it was >> related >> to flock() that ocfs2 1.2 did not support. Well, not cluster-aware >> flock(). >> Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 >> 1.4. >> >> Are the apache logs also hitting the iscsi storage? If so, one >> explanation >> could be that the log flush is saturating the network. That would cause >> the ioutil to jump higher affecting the httpd ios. >> >> >> David Johle wrote: >>> [snip] >>> The cluster has been in a production environment for about 1.5 >>> months now, and just in the past last week it has started to have >>> problems. The user experience is an occasional lag of 5 to 15 >>> seconds, after which everything appears normal. Digging deeper into >>> the problem I had narrowed it to an I/O issue, and iostat shows near >>> 100% utilization on said device during the lag. Once it clears the >>> utilization is back down to a consistent 0-5% average. Also, when >>> the lag is happening, a process listing shows the affected processes >>> in the D state. >>> [snip] >>> PID STAT COMMAND WIDE-WCHAN-COLUMN >>> 8511 D cronolog ocfs2_wait_for_status_completion >>> 8510 D cronolog ocfs2_wait_for_status_completion >>> [snip] > > > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users