Re: [Linux-HA] heartbeat 2.0.8: causing nfs kernel oops

Gerry Reno Mon, 14 May 2007 12:44:56 -0700

Alan Robertson wrote:

Gerry Reno wrote:

Alan Robertson wrote:

Gerry Reno wrote:

Alan Robertson wrote:

Gerry Reno wrote:

I'm seeing some very strange things lately.  Whenever heartbeat is
running there are these messages in the log:
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write
failure on
bcast eth0.: No such device
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
send bcast [-1] packet(len=214): No such device
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
message with 10 fields
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
[t=NS_ackmsg]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
[dest=grp-01-30-02]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
[ackseq=40cd2]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
[(1)destuuid=0x835cfc8(37 28)]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
[src=grp-01-30-01]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
[(1)srcuuid=0x8361848(36 27)]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] :
[hg=a1]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
[ts=46367de0]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] :
[ttl=4]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] :
[auth=1
dcf0feb393f46354b060306713eb72adc15eecf3]


But yet, in most other respects eth0 seems to behave perfectly
normal. I even went so far as to swap out the NIC card for eth0 and
same

result. I can ping, ftp, ssh, etc. using eth0 with no problems.Where

I do see a problem is with using NFS.  If I mount a remote NFS
mount and
try to push a compressed tar to the NFS mounted directory, after about
1GB of transfer I get a kernel oops in the NFS code.  Now, if I
shutdown
heartbeat and perform the same compressed tar it completes correctly
without any oops.  So I'm baffled by this.  Is there any known problem
that would cause the above log messages on an otherwise perfectly good
network connection and also cause some type of interaction with NFS?
This problem seems to follow the primary node.  In other words the
lockup occurs on whichever node has the primary IPaddr.  I can post
the
log, but it's hundreds of megabytes of this same message.

Yes.

Running DHCP on a network link.  Taking the link down manually.  Other
things that involve messing around with eth0.

Alan,
 Where do you think this problem lies?  Is it a kernel problem; a
heartbeat problem?  Is this something that is/has been/can be addressed
by the heartbeat team?  Is there a workaround/fix?  This problem greatly
interferes with other network activities that need to take place on our
servers such as backups and that is how I discovered it because none of
the backups were completing overnight and the whole machine would be
locked up due to the kernel oops.

It's either a misconfiguration (like running dhcp on a link without
dhcp), or someone is causing it by hand or with a script.

It's probably not a kernel bug, and it's probably not a heartbeat bug.

Easiest mistake to make: you're running dhcpcd or dhclient on eth0, and
that you shouldn't be.

I saw someone doing that with exactly the same symptoms as you're seeing.

In Red Hat, you have to disable DHCP AND you have to disable the ifplugd
from managing that interface.  In the past just disabling DHCP would
have done it.  So, it's easy to be running DHCP without meaning to...

What this does is periodically drop the link, then bring it back up.  It
does this over and over and over and over...

I checked and no dhcp client or server running and we don't use
ifplugd.  Isn't ifplugd mainly for laptops?  I filtered the logs and the
only entries for eth0 are during the heartbeat error logging and from
avahi-daemon during bootup.  That's it.  We don't have any scripts that
we run to manage the network interfaces manually.  And I have not run
any manual commands for doing this with the exception of when I swapped
out the NIC card on the one machine.  But the problem was already
happening for several weeks by then.  Also, I have not changed any of
the heartbeat config files for a long time.  No one else has been on
these machines recently.  I don't understand why heartbeat log says "no
such device" for eth0 when it is there.  I look at the link light and
you can see the heartbeats.  As I said, rather baffling.  Could this be
some type of uuid issue on the heartbeat connections?  Like it's
connected but doesn't know that it's connected?


It is printing "no such device" because that's the message that goes
with the errno it's getting from the kernel.  This is a real system call
error.  It's printing that message because it's getting that errno.

ifplugd gets used by default on RH I think.  The other place I saw this
was servers too...

Can you give me the complete logs starting from an hour before the first
problem, and 100 lines after the first error?  Send them to [EMAIL PROTECTED]

Alan,

Did you ever receive my email and log file? I sent it on May 2. Ihaven't heard anything back so I thought I would check.


Thanks,
Gerry


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] heartbeat 2.0.8: causing nfs kernel oops

Reply via email to