Dear Sunil, On Tue, 2010-12-07 at 09:07 -0800, Sunil Mushran wrote: > Check the kernel stack of the D state processes. > > cat /proc/PID/stack > > The kernel stack will tell us where it is waiting. My guess is that > the io stack is slow. Slow ios appear as temporary hangs to the > users.
Thanks. First need to sort out a packaging issue with the latest Debian testing version that causes the System.map to be out-of-sync with the running kernel, so there are no symbols available :-( In doubt this is the problem. Both systems use one gigabit link to the NAS device for aoe and are connected with another gigabit link to a switch that links them to the outside world. There are no errors/collisions in /proc on the network devices and in general network performance between the systems is excellent. Let us wait for stack dumps and see whether that shines new light. Regards --- Jan > > On 12/07/2010 07:45 AM, Jan Wielemaker wrote: > > Hi, > > > > I'm pretty new to ocfs2 and a bit stuck. I have two Debian/Squeeze > > (testing) machines accessing an ocfs2 filesystem over aoe. The > > filesystem sits on an lvm2 volume, but I guess that is irrelevant. > > > > Even when mostly idle, everything accessing the cluster sometimes hangs > > for about 20 seconds. This happens rather frequently, say every 5 > > minutes, but the interval seems irregular while the time that it hangs > > is quite similar. This behavior seems pretty much independent from > > the (IO) load of the nodes (as long as not really high). > > > > I tried a ps, grepping for D repeated every second on both nodes. > > When hanging, both show this: > > > > 1649 D< o2hb-02BC250CDB ? > > 3507 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3511 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3515 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3519 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3523 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3527 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 3531 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 1670 D jbd2/dm-4-18 ? > > 3535 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 1670 D jbd2/dm-4-18 ? > > 3539 R+ ps - > > 1649 D< o2hb-02BC250CDB ? > > 1670 D jbd2/dm-4-18 ? > > 3543 R+ ps - > > > > ocfs2-tools is at version 1.4.4-3. Kernel is version 2.6.32-5-amd64. > > The kernel log of the mount at boot is here: > > > > [ 18.911452] aoe: AoE v47 initialised. > > [ 19.686358] fuse init (API version 7.13) > > [ 29.000017] eth2: no IPv6 routers present > > [ 36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors > > [ 36.212218] etherd/e1.1: unknown partition table > > [ 59.715506] OCFS2 Node Manager 1.5.0 > > [ 59.732002] OCFS2 DLM 1.5.0 > > [ 59.733343] ocfs2: Registered cluster interface o2cb > > [ 59.749185] OCFS2 DLMFS 1.5.0 > > [ 59.749304] OCFS2 User DLM kernel interface loaded > > [ 65.347517] o2net: accepted connection from node eculture (num 1) at > > 130.37.193.11:7777 > > [ 67.884256] OCFS2 1.5.0 > > [ 67.886984] ocfs2_dlm: Nodes in domain > > ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1 > > [ 67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with > > ordered data mode. > > > > Installation and formatting are totally standard. > > > > I've been spending quite a bit of time getting a clue on what might be > > wrong, but sofar I failed. Today I played a fair bit with the debugfs, > > but I'm do not have enough experience to see what is odd. Dumping all > > the locks showed just over 100,000 of them, which I though might be a > > lot, but posts suggest it isn't. No busy or very few (-B) locks. > > > > Checked cabling and low-level network activity. Seems ok. > > > > Does anyone has similar experiences and/or an idea where to look? > > > > Thanks --- Jan > > > > > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users@oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users