I have been running a 2 node cluster of web servers handling around 800,000 - 1,200,000 requests on a typical weekday. The systems are each dual quad core xeon w/4GB ram, and generally maintain a fairly low load average. The servers are sharing a device via iSCSI SAN for the OCFS2 volume. They are running RHEL5 U2 (2.6.18-92.1.10.el5) and OCFS2 1.2.9-1 binaries of the matching kernel version.
The cluster has been in a production environment for about 1.5 months now, and just in the past last week it has started to have problems. The user experience is an occasional lag of 5 to 15 seconds, after which everything appears normal. Digging deeper into the problem I had narrowed it to an I/O issue, and iostat shows near 100% utilization on said device during the lag. Once it clears the utilization is back down to a consistent 0-5% average. Also, when the lag is happening, a process listing shows the affected processes in the D state. After some reading of the mailing list archives and much google searching, I found this old thread which seems quite similar: http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html And the corresponding bug from this same author: http://oss.oracle.com/bugzilla/show_bug.cgi?id=882 I see it has been a long time since this bug was last updated, but there seems to have been no [documented] resolution. I wrote a small script which hits a very simple web page at 10 second intervals. If the server responds in less than 1 second it outputs a dot; if the page takes more than 1 second to load it outputs a line indicating how long it took, which node processed it, and a timestamp. Here is some sample output over the course of about an hour: ................ Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009 ............................................................ Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009 ..... Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009 ................................................................................. Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009 ................................... Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009 .................... Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009 ... Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009 ....................... Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009 ....................................... Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009 .. Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009 ............................. Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009 .................................... Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009 There are no real patterns here, but the output is interesting nonetheless -- it does show that it is intermittent, but quite frequent. I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" command to see what processes were hanging on each system during the time of the lags. Like Stephan's output, I see a few processes in the D (uninterruptable sleep) state, which makes sense if they are performing I/O. And also like Stephan, one node shows the "-" output, and the other shows "ocfs2_wait_for_status_completion" as the function. node cow: PID STAT COMMAND WIDE-WCHAN-COLUMN 8511 D cronolog ocfs2_wait_for_status_completion 8510 D cronolog ocfs2_wait_for_status_completion node wolf: PID STAT COMMAND WIDE-WCHAN-COLUMN 9439 D cronolog - In my case, however, it was not the httpd process that was listed, but rather cronolog (which is what I use with apache for automatic daily log rotations). I assume that in his case it the http was directly writing to log files at the time of the hangs. Also, mine typically was only a single process on each system in the D state, only sometimes more. Another thing I tried was to shut down httpd on one of the nodes to see if the lags went away -- the result was unchanged, still had similar lags at similar intervals. This time, the "ps" output on the non-participating node was clean, but the other had multiple processes, and not all web server related: node wolf: PID STAT COMMAND WIDE-WCHAN-COLUMN 9436 D cronolog - 9438 D cronolog - 9439 D cronolog - 9448 D cronolog - 9457 D cronolog - 31201 D httpd - 31202 D httpd - 31710 D vsftpd sync_buffer So apparently this problem can happen between processes of the same node, not just between nodes. Watching a "top" command on each box during the lags only shows CPU usage on iowait, no significant user/system CPU. _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users