Gerry Reno wrote:
> Alan Robertson wrote:
>> Gerry Reno wrote:
>>
>>> Alan Robertson wrote:
>>>
>>>> Gerry Reno wrote:
>>>>
>>>>
>>>>> I'm seeing some very strange things lately. Whenever heartbeat is
>>>>> running there are these messages in the log:
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write
>>>>> failure on
>>>>> bcast eth0.: No such device
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
>>>>> send bcast [-1] packet(len=214): No such device
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
>>>>> message with 10 fields
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
>>>>> [t=NS_ackmsg]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
>>>>> [dest=grp-01-30-02]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
>>>>> [ackseq=40cd2]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
>>>>> [(1)destuuid=0x835cfc8(37 28)]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
>>>>> [src=grp-01-30-01]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
>>>>> [(1)srcuuid=0x8361848(36 27)]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] :
>>>>> [hg=a1]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
>>>>> [ts=46367de0]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] :
>>>>> [ttl=4]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] :
>>>>> [auth=1
>>>>> dcf0feb393f46354b060306713eb72adc15eecf3]
>>>>>
>>>>> But yet, in most other respects eth0 seems to behave perfectly
>>>>> normal. I even went so far as to swap out the NIC card for eth0 and
>>>>> same
>>>>> result. I can ping, ftp, ssh, etc. using eth0 with no problems.
>>>>> Where
>>>>> I do see a problem is with using NFS. If I mount a remote NFS
>>>>> mount and
>>>>> try to push a compressed tar to the NFS mounted directory, after about
>>>>> 1GB of transfer I get a kernel oops in the NFS code. Now, if I
>>>>> shutdown
>>>>> heartbeat and perform the same compressed tar it completes correctly
>>>>> without any oops. So I'm baffled by this. Is there any known problem
>>>>> that would cause the above log messages on an otherwise perfectly good
>>>>> network connection and also cause some type of interaction with NFS?
>>>>> This problem seems to follow the primary node. In other words the
>>>>> lockup occurs on whichever node has the primary IPaddr. I can post
>>>>> the
>>>>> log, but it's hundreds of megabytes of this same message.
>>>>>
>>>> Yes.
>>>>
>>>> Running DHCP on a network link. Taking the link down manually. Other
>>>> things that involve messing around with eth0.
>>>>
>>> Alan,
>>> Where do you think this problem lies? Is it a kernel problem; a
>>> heartbeat problem? Is this something that is/has been/can be addressed
>>> by the heartbeat team? Is there a workaround/fix? This problem greatly
>>> interferes with other network activities that need to take place on our
>>> servers such as backups and that is how I discovered it because none of
>>> the backups were completing overnight and the whole machine would be
>>> locked up due to the kernel oops.
>>>
>>
>> It's either a misconfiguration (like running dhcp on a link without
>> dhcp), or someone is causing it by hand or with a script.
>>
>> It's probably not a kernel bug, and it's probably not a heartbeat bug.
>>
>> Easiest mistake to make: you're running dhcpcd or dhclient on eth0, and
>> that you shouldn't be.
>>
>> I saw someone doing that with exactly the same symptoms as you're seeing.
>>
>> In Red Hat, you have to disable DHCP AND you have to disable the ifplugd
>> from managing that interface. In the past just disabling DHCP would
>> have done it. So, it's easy to be running DHCP without meaning to...
>>
>> What this does is periodically drop the link, then bring it back up. It
>> does this over and over and over and over...
>>
>>
>>
> I checked and no dhcp client or server running and we don't use
> ifplugd. Isn't ifplugd mainly for laptops? I filtered the logs and the
> only entries for eth0 are during the heartbeat error logging and from
> avahi-daemon during bootup. That's it. We don't have any scripts that
> we run to manage the network interfaces manually. And I have not run
> any manual commands for doing this with the exception of when I swapped
> out the NIC card on the one machine. But the problem was already
> happening for several weeks by then. Also, I have not changed any of
> the heartbeat config files for a long time. No one else has been on
> these machines recently. I don't understand why heartbeat log says "no
> such device" for eth0 when it is there. I look at the link light and
> you can see the heartbeats. As I said, rather baffling. Could this be
> some type of uuid issue on the heartbeat connections? Like it's
> connected but doesn't know that it's connected?
It is printing "no such device" because that's the message that goes
with the errno it's getting from the kernel. This is a real system call
error. It's printing that message because it's getting that errno.
ifplugd gets used by default on RH I think. The other place I saw this
was servers too...
Can you give me the complete logs starting from an hour before the first
problem, and 100 lines after the first error? Send them to [EMAIL PROTECTED]
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems