Re: [Ocfs2-users] Ocfs2-users Digest, Vol 30, Issue 7

Gopal Mukkamala Sun, 11 Jun 2006 09:20:04 -0700

Not sure if it would help , but try kernel 2.4.21

On 6/10/06, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote:

Send Ocfs2-users mailing list submissions to
       [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
       http://oss.oracle.com/mailman/listinfo/ocfs2-users
or, via email, send a message with subject or body 'help' to
       [EMAIL PROTECTED]

You can reach the person managing the list at
       [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Ocfs2-users digest..."

Today's Topics:

  1. RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)
  2. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)
  3. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)
  4. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)

----------------------------------------------------------------------

Message: 1
Date: Fri, 09 Jun 2006 13:38:58 -0400
From: Brian Long <[EMAIL PROTECTED]>
Subject: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
To: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain

Hello,

I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
fencing) and I get a full vmcore on my netdump server.  The netdump log
file shows the shared filesystem LUN (/dev/dm-6) did not respond within
12000ms.  I have not changed the default heartbeat values
in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
happens, but they are HP Proliant servers running the Insight Manager
agents.

Why would the heartbeat fail roughly once a week?  Should I open a
bugzilla and upload my netdump log file?

Thanks.

/Brian/
--
      Brian Long                      |         |           |
      IT Data Center Systems          |       .|||.       .|||.
      Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
      Phone: (919) 392-7363           |   C i s c o   S y s t e m s

------------------------------

Message: 2
Date: Fri, 09 Jun 2006 10:49:48 -0700
From: Sunil Mushran < [EMAIL PROTECTED]>
Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
To: Brian Long <[EMAIL PROTECTED]>
Cc: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

The hb failure is just the effect of the ios not completing within 12 secs.
The full oops trace gives the last 24 ops and their timings.

One solution is to double up the hb timeout. Set,
O2CB_HEARTBEAT_THRESHOLD = 14

Brian Long wrote:
> Hello,
>
> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> fencing) and I get a full vmcore on my netdump server.  The netdump log
> file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> 12000ms.  I have not changed the default heartbeat values
> in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> happens, but they are HP Proliant servers running the Insight Manager
> agents.
>
> Why would the heartbeat fail roughly once a week?  Should I open a
> bugzilla and upload my netdump log file?
>
> Thanks.
>
> /Brian/
>

------------------------------

Message: 3
Date: Fri, 09 Jun 2006 15:30:05 -0400
From: Brian Long <[EMAIL PROTECTED]>
Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
To: Sunil Mushran < [EMAIL PROTECTED]>
Cc: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain

Understood, but how do I determine why once a week I'm failing the 12
second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
is gone for 12 seconds?  The last 24 ops are as follows:

(7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
dm-6 after 12000 milliseconds
Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
Heartbeat thread stuck at waiting for read completion, stuffing current
time into that blocker (index 3)
Index 4: took 0 ms to do submit_bio for read
Index 5: took 0 ms to do waiting for read completion
Index 6: took 0 ms to do bio alloc write
Index 7: took 0 ms to do bio add page write
Index 8: took 0 ms to do submit_bio for write
Index 9: took 0 ms to do checking slots
Index 10: took 0 ms to do waiting for write completion
Index 11: took 1998 ms to do msleep
Index 12: took 0 ms to do allocating bios for read
Index 13: took 0 ms to do bio alloc read
Index 14: took 0 ms to do bio add page read
Index 15: took 0 ms to do submit_bio for read
Index 16: took 0 ms to do waiting for read completion
Index 17: took 0 ms to do bio alloc write
Index 18: took 0 ms to do bio add page write
Index 19: took 0 ms to do submit_bio for write
Index 20: took 0 ms to do checking slots
Index 21: took 0 ms to do waiting for write completion
Index 22: took 1999 ms to do msleep
Index 23: took 0 ms to do allocating bios for read
Index 0: took 0 ms to do bio alloc read
Index 1: took 0 ms to do bio add page read
Index 2: took 0 ms to do submit_bio for read
Index 3: took 9998 ms to do waiting for read completion
(7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing

/Brian/

On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
> The hb failure is just the effect of the ios not completing within 12 secs.
> The full oops trace gives the last 24 ops and their timings.
>
> One solution is to double up the hb timeout. Set,
> O2CB_HEARTBEAT_THRESHOLD = 14
>
> Brian Long wrote:
> > Hello,
> >
> > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> > 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> > fencing) and I get a full vmcore on my netdump server.  The netdump log
> > file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> > 12000ms.  I have not changed the default heartbeat values
> > in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> > happens, but they are HP Proliant servers running the Insight Manager
> > agents.
> >
> > Why would the heartbeat fail roughly once a week?  Should I open a
> > bugzilla and upload my netdump log file?
> >
> > Thanks.
> >
> > /Brian/
> >
--
      Brian Long                      |         |           |
      IT Data Center Systems          |       .|||.       .|||.
      Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
      Phone: (919) 392-7363           |   C i s c o   S y s t e m s

------------------------------

Message: 4
Date: Fri, 09 Jun 2006 13:00:48 -0700
From: Sunil Mushran <[EMAIL PROTECTED]>
Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
To: Brian Long < [EMAIL PROTECTED]>
Cc: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

This dump is very much like the one we used to see with the
cfq io scheduler. The very last io op would consume all the time.
I am assuming that you are running with the DEADLINE io sched.

Is there any other common factors in all the crashes. Like, happens
on one node? Or, around the same time? How do you know there is
no other io happening at that time? What about cron jobs?

Also, is the shared disk connected to some other nodes which
could be the cause of the io spike?

Brian Long wrote:
> Understood, but how do I determine why once a week I'm failing the 12
> second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
> is gone for 12 seconds?  The last 24 ops are as follows:
>
> (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> dm-6 after 12000 milliseconds
> Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
> Heartbeat thread stuck at waiting for read completion, stuffing current
> time into that blocker (index 3)
> Index 4: took 0 ms to do submit_bio for read
> Index 5: took 0 ms to do waiting for read completion
> Index 6: took 0 ms to do bio alloc write
> Index 7: took 0 ms to do bio add page write
> Index 8: took 0 ms to do submit_bio for write
> Index 9: took 0 ms to do checking slots
> Index 10: took 0 ms to do waiting for write completion
> Index 11: took 1998 ms to do msleep
> Index 12: took 0 ms to do allocating bios for read
> Index 13: took 0 ms to do bio alloc read
> Index 14: took 0 ms to do bio add page read
> Index 15: took 0 ms to do submit_bio for read
> Index 16: took 0 ms to do waiting for read completion
> Index 17: took 0 ms to do bio alloc write
> Index 18: took 0 ms to do bio add page write
> Index 19: took 0 ms to do submit_bio for write
> Index 20: took 0 ms to do checking slots
> Index 21: took 0 ms to do waiting for write completion
> Index 22: took 1999 ms to do msleep
> Index 23: took 0 ms to do allocating bios for read
> Index 0: took 0 ms to do bio alloc read
> Index 1: took 0 ms to do bio add page read
> Index 2: took 0 ms to do submit_bio for read
> Index 3: took 9998 ms to do waiting for read completion
> (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
> regions.
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> system by panicing
>
> /Brian/
>
> On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
>
>> The hb failure is just the effect of the ios not completing within 12 secs.
>> The full oops trace gives the last 24 ops and their timings.
>>
>> One solution is to double up the hb timeout. Set,
>> O2CB_HEARTBEAT_THRESHOLD = 14
>>
>> Brian Long wrote:
>>
>>> Hello,
>>>
>>> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
>>> 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
>>> fencing) and I get a full vmcore on my netdump server.  The netdump log
>>> file shows the shared filesystem LUN (/dev/dm-6) did not respond within
>>> 12000ms.  I have not changed the default heartbeat values
>>> in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
>>> happens, but they are HP Proliant servers running the Insight Manager
>>> agents.
>>>
>>> Why would the heartbeat fail roughly once a week?  Should I open a
>>> bugzilla and upload my netdump log file?
>>>
>>> Thanks.
>>>
>>> /Brian/
>>>
>>>

------------------------------

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

End of Ocfs2-users Digest, Vol 30, Issue 7
******************************************

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2-users Digest, Vol 30, Issue 7

Reply via email to