Gerry Reno wrote:
> Alan Robertson wrote:
>> Gerry Reno wrote:
>>
>>> I'm seeing some very strange things lately. Whenever heartbeat is
>>> running there are these messages in the log:
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write failure on
>>> bcast eth0.: No such device
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
>>> send bcast [-1] packet(len=214): No such device
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
>>> message with 10 fields
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
>>> [t=NS_ackmsg]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
>>> [dest=grp-01-30-02]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
>>> [ackseq=40cd2]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
>>> [(1)destuuid=0x835cfc8(37 28)]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
>>> [src=grp-01-30-01]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
>>> [(1)srcuuid=0x8361848(36 27)]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] : [hg=a1]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
>>> [ts=46367de0]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] : [ttl=4]
>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] : [auth=1
>>> dcf0feb393f46354b060306713eb72adc15eecf3]
>>>
>>> But yet, in most other respects eth0 seems to behave perfectly
>>> normal. I even went so far as to swap out the NIC card for eth0 and same
>>> result. I can ping, ftp, ssh, etc. using eth0 with no problems. Where
>>> I do see a problem is with using NFS. If I mount a remote NFS mount and
>>> try to push a compressed tar to the NFS mounted directory, after about
>>> 1GB of transfer I get a kernel oops in the NFS code. Now, if I shutdown
>>> heartbeat and perform the same compressed tar it completes correctly
>>> without any oops. So I'm baffled by this. Is there any known problem
>>> that would cause the above log messages on an otherwise perfectly good
>>> network connection and also cause some type of interaction with NFS?
>>> This problem seems to follow the primary node. In other words the
>>> lockup occurs on whichever node has the primary IPaddr. I can post the
>>> log, but it's hundreds of megabytes of this same message.
>>>
>>
>> Yes.
>>
>> Running DHCP on a network link. Taking the link down manually. Other
>> things that involve messing around with eth0.
>>
> Alan,
> Where do you think this problem lies? Is it a kernel problem; a
> heartbeat problem? Is this something that is/has been/can be addressed
> by the heartbeat team? Is there a workaround/fix? This problem greatly
> interferes with other network activities that need to take place on our
> servers such as backups and that is how I discovered it because none of
> the backups were completing overnight and the whole machine would be
> locked up due to the kernel oops.
It's either a misconfiguration (like running dhcp on a link without
dhcp), or someone is causing it by hand or with a script.
It's probably not a kernel bug, and it's probably not a heartbeat bug.
Easiest mistake to make: you're running dhcpcd or dhclient on eth0, and
that you shouldn't be.
I saw someone doing that with exactly the same symptoms as you're seeing.
In Red Hat, you have to disable DHCP AND you have to disable the ifplugd
from managing that interface. In the past just disabling DHCP would
have done it. So, it's easy to be running DHCP without meaning to...
What this does is periodically drop the link, then bring it back up. It
does this over and over and over and over...
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems