Well it does pretty much make the system (or at least anything doing I/O to the volume) unresponsive, but it does recover after 10-15 seconds typically. I guess that is considered a "temporary slowdown" rather than a hang?
Yes, the log files are being written to the OCFS2 volume, and are actually being written to by both nodes in parallel. I did much testing on this before going into production and never saw any problems or slowdowns, even on much less powerful systems. And, as I mentioned, there were no problems on these systems for over a month in production (same load all along). I do wonder if the 1.4 release would be any better for my situation, and would like to put it on my test environment first of course. However, I do have an issue in that I am making use of the CDSL feature that was removed after 1.2.x, and thus I will have to figure some way to accomplish the desired configuration w/o them before I can upgrade. The problem is continuing, and getting really annoying as it's tripping up our monitoring system like crazy. Is there anything else I try doing to get more details about what is going on to help find a solution? Any parameters that could be tweaked to account for the fact that there is a steady stream of small writes from all nodes to this volume? At 04:28 PM 1/23/2009, Sunil Mushran wrote: >The two issues are different. For starters, the issue in bugzilla#882 >was a hang. Not a temporary slowdown. And the stack showed it was related >to flock() that ocfs2 1.2 did not support. Well, not cluster-aware flock(). >Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 1.4. > >Are the apache logs also hitting the iscsi storage? If so, one explanation >could be that the log flush is saturating the network. That would cause >the ioutil to jump higher affecting the httpd ios. > > >David Johle wrote: >>[snip] >>The cluster has been in a production environment for about 1.5 >>months now, and just in the past last week it has started to have >>problems. The user experience is an occasional lag of 5 to 15 >>seconds, after which everything appears normal. Digging deeper >>into the problem I had narrowed it to an I/O issue, and iostat >>shows near 100% utilization on said device during the lag. Once it >>clears the utilization is back down to a consistent 0-5% >>average. Also, when the lag is happening, a process listing shows >>the affected processes in the D state. >>[snip] >> PID STAT COMMAND WIDE-WCHAN-COLUMN >> 8511 D cronolog ocfs2_wait_for_status_completion >> 8510 D cronolog ocfs2_wait_for_status_completion >>[snip] _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users