Re: [Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Jan Wielemaker Tue, 07 Dec 2010 12:35:39 -0800

Dear Sunil,

On Tue, 2010-12-07 at 09:07 -0800, Sunil Mushran wrote:
> Check the kernel stack of the D state processes.
> 
> cat /proc/PID/stack
> 
> The kernel stack will tell us where it is waiting. My guess is that
> the io stack is slow. Slow ios appear as temporary hangs to the
> users.


Thanks.  First need to sort out a packaging issue with the latest
Debian testing version that causes the System.map to be out-of-sync
with the running kernel, so there are no symbols available :-(

In doubt this is the problem.  Both systems use one gigabit link
to the NAS device for aoe and are connected with another gigabit
link to a switch that links them to the outside world.  There are
no errors/collisions in /proc on the network devices and in general
network performance between the systems is excellent.

Let us wait for stack dumps and see whether that shines new light.

        Regards --- Jan


> 
> On 12/07/2010 07:45 AM, Jan Wielemaker wrote:
> > Hi,
> >
> > I'm pretty new to ocfs2 and a bit stuck.  I have two Debian/Squeeze
> > (testing) machines accessing an ocfs2 filesystem over aoe.  The
> > filesystem sits on an lvm2 volume, but I guess that is irrelevant.
> >
> > Even when mostly idle, everything accessing the cluster sometimes hangs
> > for about 20 seconds.  This happens rather frequently, say every 5
> > minutes, but the interval seems irregular while the time that it hangs
> > is quite similar.  This behavior seems pretty much independent from
> > the (IO) load of the nodes (as long as not really high).
> >
> > I tried a ps, grepping for D repeated every second on both nodes.
> > When hanging, both show this:
> >
> >   1649 D<    o2hb-02BC250CDB ?
> >   3507 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3511 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3515 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3519 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3523 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3527 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   3531 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   1670 D    jbd2/dm-4-18    ?
> >   3535 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   1670 D    jbd2/dm-4-18    ?
> >   3539 R+   ps              -
> >   1649 D<    o2hb-02BC250CDB ?
> >   1670 D    jbd2/dm-4-18    ?
> >   3543 R+   ps              -
> >
> > ocfs2-tools is at version 1.4.4-3.  Kernel is version 2.6.32-5-amd64.
> > The kernel log of the mount at boot is here:
> >
> > [   18.911452] aoe: AoE v47 initialised.
> > [   19.686358] fuse init (API version 7.13)
> > [   29.000017] eth2: no IPv6 routers present
> > [   36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
> > [   36.212218]  etherd/e1.1: unknown partition table
> > [   59.715506] OCFS2 Node Manager 1.5.0
> > [   59.732002] OCFS2 DLM 1.5.0
> > [   59.733343] ocfs2: Registered cluster interface o2cb
> > [   59.749185] OCFS2 DLMFS 1.5.0
> > [   59.749304] OCFS2 User DLM kernel interface loaded
> > [   65.347517] o2net: accepted connection from node eculture (num 1) at
> > 130.37.193.11:7777
> > [   67.884256] OCFS2 1.5.0
> > [   67.886984] ocfs2_dlm: Nodes in domain
> > ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1
> > [   67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
> > ordered data mode.
> >
> > Installation and formatting are totally standard.
> >
> > I've been spending quite a bit of time getting a clue on what might be
> > wrong, but sofar I failed.  Today I played a fair bit with the debugfs,
> > but I'm do not have enough experience to see what is odd.  Dumping all
> > the locks showed just over 100,000 of them, which I though might be a
> > lot, but posts suggest it isn't.  No busy or very few (-B) locks.
> >
> > Checked cabling and low-level network activity.  Seems ok.
> >
> > Does anyone has similar experiences and/or an idea where to look?
> >
> >     Thanks --- Jan
> >
> >
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users@oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 



_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Reply via email to