Please can you add this info to a bugzilla to make it easier to track. I delete emails freq and thus lose context.
To that, attach the output of the following script. http://oss.oracle.com/~smushran/debug/scripts/stat_sysdir.sh Also, can you isolate the issue to a file. As in, even if you are seeing perf down across the board, find one example and dump its layout and attach it to the bz. debugfs.ocfs2 -R "stat /path/to/file" /dev/sdX Note the path should be minus the the mount path. Sunil On 12/13/2010 01:13 PM, Daniel McDonald wrote: > Sunil, > > I'm back in close proximity to my machines. We are still experiencing > extremely odd behavior. I've reduced the cluster down to a single node > and recently rebooted that machine as well. I'm currently observing > highly variable write speeds to the volume. This volume is the only > volume on the disk array and this machine is the only machine > accessing the disk array right now. This is also the only machine > powered of that is connected to the fibre SAN. I'm seeing write speeds > from 33kB/s to 50MB/s. When the rate was really low, it drove the > system to become unresponsive. I've previously been able to obtain > consistant write speeds> 150MB/s. > > The initial email said that we were using 1.4.4, however, that was the > tools release. We are using ocfs2 1.4.7. I've also attached the stats > output for the device. I'm not sure how to proceed from here. > > $ yum info ocfs2-2.6.18-164.el5.x86_64 > Loaded plugins: security > Installed Packages > Name : ocfs2-2.6.18-164.el5 > Arch : x86_64 > Version : 1.4.7 > Release : 1.el5 > Size : 1.3 M > Repo : installed > Summary : The Oracle Cluster Filesystem 2. > URL : http://oss.oracle.com/projects/ocfs2/ > License : GPL > Description: OCFS2 is the Oracle Cluster Filesystem 2. This package > is compiled for > : the default kernel. > > $ debugfs.ocfs2 -R "stats" > /dev/disk/by-id/scsi-3600a0b80005bc2ab000016a04bbb0f88-part1 > Revision: 0.90 > Mount Count: 0 Max Mount Count: 20 > State: 0 Errors: 0 > Check Interval: 0 Last Check: Wed Apr 7 15:42:39 2010 > Creator OS: 0 > Feature Compat: 3 backup-super strict-journal-super > Feature Incompat: 80 sparse inline-data > Tunefs Incomplete: 0 > Feature RO compat: 1 unwritten > Root Blknum: 5 System Dir Blknum: 6 > First Cluster Group Blknum: 3 > Block Size Bits: 12 Cluster Size Bits: 12 > Max Node Slots: 20 > Extended Attributes Inline Size: 0 > Label: home > UUID: 84B7C6421A6C4280AB87F569035C5368 > Hash: 0 (0x0) > Cluster stack: classic o2cb > Inode: 2 Mode: 00 Generation: 1165331413 (0x45758bd5) > FS Generation: 1165331413 (0x45758bd5) > CRC32: 00000000 ECC: 0000 > Type: Unknown Attr: 0x0 Flags: Valid System Superblock > Dynamic Features: (0x0) > User: 0 (root) Group: 0 (root) Size: 0 > Links: 0 Clusters: 1464357111 > ctime: 0x4bbcfc4f -- Wed Apr 7 15:42:39 2010 > atime: 0x0 -- Wed Dec 31 17:00:00 1969 > mtime: 0x4bbcfc4f -- Wed Apr 7 15:42:39 2010 > dtime: 0x0 -- Wed Dec 31 17:00:00 1969 > ctime_nsec: 0x00000000 -- 0 > atime_nsec: 0x00000000 -- 0 > mtime_nsec: 0x00000000 -- 0 > Last Extblk: 0 Orphan Slot: 0 > Sub Alloc Slot: Global Sub Alloc Bit: 65535 > > Thanks, > -Daniel > > On Fri, Dec 10, 2010 at 4:56 AM, Daniel McDonald<was...@gmail.com> wrote: >> Thank you. I'll investigate as soon as I can get decent access. >> Rebooting nodes seemed to have temporarily solved the issue, however, >> things are not completely back to normal yet. >> -Daniel >> >> On Thu, Dec 9, 2010 at 8:49 AM, Sunil Mushran<sunil.mush...@oracle.com> >> wrote: >>> http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commitdiff;h=1f667766cb67ed05b4d706aa82e8ad0b12eaae8b >>> >>> That specific error has been addressed in the upcoming 1.4.8. >>> >>> Attach the logs and all other info to a bugzilla. >>> >>> On 12/08/2010 05:07 PM, Daniel McDonald wrote: >>>> Hello, >>>> >>>> I'm writing from the otherside of the world from where my systems are, >>>> so details are coming in slow. We have a 6TB OCFS2 volume across 20 or >>>> so nodes all running OEL5.4 running ocfs2-1.4.4. The system has worked >>>> fairly well for the last 6-8 months. Something has happened over the >>>> last few weeks which has driven write performance nearly to a halt. >>>> I'm not sure how to proceed, and very poor internet is hindering my >>>> abilities further. I've verified that the disk array is in good >>>> health. I'm seeing a few awkward kernel log messages, an example of >>>> one follows. I have not been able to verify all nodes due to limited >>>> time and slow internet in my present location. Any assistance would be >>>> greatly appreciated. I should be able to provide log files in about 12 >>>> hours. At this moment, loadavgs on each node are 0.00 to 0.09. >>>> >>>> Here is a test write and associated iostat -xm 5 output. Previously I >>>> was obtaining> 90MB/s: >>>> >>>> $ dd if=/dev/zero of=/home/testdump count=1000 bs=1024k >>>> >>>> ...and associated iostat output: >>>> >>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>> 0.10 0.00 0.43 12.25 0.00 87.22 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz >>>> avgqu-sz await svctm %util >>>> sda 0.00 1.80 0.00 8.40 0.00 0.04 9.71 >>>> 0.01 0.64 0.05 0.04 >>>> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda3 0.00 1.80 0.00 8.40 0.00 0.04 9.71 >>>> 0.01 0.64 0.05 0.04 >>>> sdc 0.00 0.00 115.80 0.60 0.46 0.00 >>>> 8.04 0.99 8.48 8.47 98.54 >>>> sdc1 0.00 0.00 115.80 0.60 0.46 0.00 >>>> 8.04 0.99 8.48 8.47 98.54 >>>> >>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>> 0.07 0.00 0.55 12.25 0.00 87.13 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz >>>> avgqu-sz await svctm %util >>>> sda 0.00 0.40 0.00 0.80 0.00 0.00 12.00 >>>> 0.00 2.00 1.25 0.10 >>>> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda3 0.00 0.40 0.00 0.80 0.00 0.00 12.00 >>>> 0.00 2.00 1.25 0.10 >>>> sdc 0.00 0.00 112.80 0.40 0.44 0.00 >>>> 8.03 0.98 8.68 8.69 98.38 >>>> sdc1 0.00 0.00 112.80 0.40 0.44 0.00 >>>> 8.03 0.98 8.68 8.69 98.38 >>>> >>>> Here is a test read and associated iostat output. I'm intentionally >>>> reading from a different test file as to avoid caching effects: >>>> >>>> $ dd if=/home/someothertestdump of=/dev/null bs=1024k >>>> >>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>> 0.10 0.00 3.60 10.85 0.00 85.45 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz >>>> avgqu-sz await svctm %util >>>> sda 0.00 3.79 0.00 1.40 0.00 0.02 29.71 >>>> 0.00 1.29 0.43 0.06 >>>> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda3 0.00 3.79 0.00 1.40 0.00 0.02 29.71 >>>> 0.00 1.29 0.43 0.06 >>>> sdc 7.98 0.20 813.17 1.00 102.50 0.00 >>>> 257.84 1.92 2.34 1.19 96.71 >>>> sdc1 7.98 0.20 813.17 1.00 102.50 0.00 >>>> 257.84 1.92 2.34 1.19 96.67 >>>> >>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>> 0.07 0.00 3.67 10.22 0.00 86.03 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz >>>> avgqu-sz await svctm %util >>>> sda 0.00 0.20 0.00 0.40 0.00 0.00 12.00 >>>> 0.00 0.50 0.50 0.02 >>>> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 >>>> sda3 0.00 0.20 0.00 0.40 0.00 0.00 12.00 >>>> 0.00 0.50 0.50 0.02 >>>> sdc 6.60 0.20 829.00 1.00 104.28 0.00 >>>> 257.32 1.90 2.31 1.17 97.28 >>>> sdc1 6.60 0.20 829.00 1.00 104.28 0.00 >>>> 257.32 1.90 2.31 1.17 97.28 >>>> >>>> I'm seeing a few weird kernel messages, such as: >>>> >>>> Dec 7 14:07:50 growler kernel: >>>> (dlm_wq,4793,4):dlm_deref_lockres_worker:2344 ERROR: >>>> 84B7C6421A6C4280AB87F569035C5368:O0000000000000016296ce900000000: node >>>> 14 trying to drop ref but it is already dropped! >>>> Dec 7 14:07:50 growler kernel: lockres: >>>> O0000000000000016296ce900000000, owner=0, state=0 >>>> Dec 7 14:07:50 growler kernel: last used: 0, refcnt: 6, on purge list: >>>> no >>>> Dec 7 14:07:50 growler kernel: on dirty list: no, on reco list: no, >>>> migrating pending: no >>>> Dec 7 14:07:50 growler kernel: inflight locks: 0, asts reserved: 0 >>>> Dec 7 14:07:50 growler kernel: refmap nodes: [ 21 ], inflight=0 >>>> Dec 7 14:07:50 growler kernel: granted queue: >>>> Dec 7 14:07:50 growler kernel: type=3, conv=-1, node=21, >>>> cookie=21:213370, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), >>>> pending=(conv=n,lock=n,cancel=n,unlock=n) >>>> Dec 7 14:07:50 growler kernel: converting queue: >>>> Dec 7 14:07:50 growler kernel: blocked queue: >>>> >>>> >>>> Here is df output: >>>> >>>> r...@growler:~$ df >>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>> /dev/sda3 245695888 29469416 203544360 13% / >>>> /dev/sda1 101086 15133 80734 16% /boot >>>> tmpfs 33005580 0 33005580 0% /dev/shm >>>> /dev/sdc1 5857428444 5234400436 623028008 90% /home >>>> >>>> Thanks >>>> -Daniel >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users@oss.oracle.com >>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>> _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users