Re: [Linux-HA] heartbeat 2.0.8: causing nfs kernel oops

Alan Robertson Tue, 01 May 2007 19:54:04 -0700

Gerry Reno wrote:
> Alan Robertson wrote:
>> Gerry Reno wrote:
>>  
>>> Alan Robertson wrote:
>>>    
>>>> Gerry Reno wrote:
>>>>  
>>>>      
>>>>> I'm seeing some very strange things lately.  Whenever heartbeat is
>>>>> running there are these messages in the log:
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write
>>>>> failure on
>>>>> bcast eth0.: No such device
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
>>>>> send bcast [-1] packet(len=214): No such device
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
>>>>> message with 10 fields
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
>>>>> [t=NS_ackmsg]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
>>>>> [dest=grp-01-30-02]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
>>>>> [ackseq=40cd2]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
>>>>> [(1)destuuid=0x835cfc8(37 28)]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
>>>>> [src=grp-01-30-01]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
>>>>> [(1)srcuuid=0x8361848(36 27)]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] :
>>>>> [hg=a1]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
>>>>> [ts=46367de0]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] :
>>>>> [ttl=4]
>>>>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] :
>>>>> [auth=1
>>>>> dcf0feb393f46354b060306713eb72adc15eecf3]
>>>>>
>>>>> But yet, in most other respects eth0 seems to behave perfectly
>>>>> normal. I even went so far as to swap out the NIC card for eth0 and
>>>>> same
>>>>> result.  I can ping, ftp, ssh, etc. using eth0 with no problems. 
>>>>> Where
>>>>> I do see a problem is with using NFS.  If I mount a remote NFS
>>>>> mount and
>>>>> try to push a compressed tar to the NFS mounted directory, after about
>>>>> 1GB of transfer I get a kernel oops in the NFS code.  Now, if I
>>>>> shutdown
>>>>> heartbeat and perform the same compressed tar it completes correctly
>>>>> without any oops.  So I'm baffled by this.  Is there any known problem
>>>>> that would cause the above log messages on an otherwise perfectly good
>>>>> network connection and also cause some type of interaction with NFS?
>>>>> This problem seems to follow the primary node.  In other words the
>>>>> lockup occurs on whichever node has the primary IPaddr.  I can post
>>>>> the
>>>>> log, but it's hundreds of megabytes of this same message.
>>>>>             
>>>> Yes.
>>>>
>>>> Running DHCP on a network link.  Taking the link down manually.  Other
>>>> things that involve messing around with eth0.
>>>>         
>>> Alan,
>>>  Where do you think this problem lies?  Is it a kernel problem; a
>>> heartbeat problem?  Is this something that is/has been/can be addressed
>>> by the heartbeat team?  Is there a workaround/fix?  This problem greatly
>>> interferes with other network activities that need to take place on our
>>> servers such as backups and that is how I discovered it because none of
>>> the backups were completing overnight and the whole machine would be
>>> locked up due to the kernel oops.
>>>     
>>
>> It's either a misconfiguration (like running dhcp on a link without
>> dhcp), or someone is causing it by hand or with a script.
>>
>> It's probably not a kernel bug, and it's probably not a heartbeat bug.
>>
>> Easiest mistake to make: you're running dhcpcd or dhclient on eth0, and
>> that you shouldn't be.
>>
>> I saw someone doing that with exactly the same symptoms as you're seeing.
>>
>> In Red Hat, you have to disable DHCP AND you have to disable the ifplugd
>> from managing that interface.  In the past just disabling DHCP would
>> have done it.  So, it's easy to be running DHCP without meaning to...
>>
>> What this does is periodically drop the link, then bring it back up.  It
>> does this over and over and over and over...
>>
>>
>>   
> I checked and no dhcp client or server running and we don't use
> ifplugd.  Isn't ifplugd mainly for laptops?  I filtered the logs and the
> only entries for eth0 are during the heartbeat error logging and from
> avahi-daemon during bootup.  That's it.  We don't have any scripts that
> we run to manage the network interfaces manually.  And I have not run
> any manual commands for doing this with the exception of when I swapped
> out the NIC card on the one machine.  But the problem was already
> happening for several weeks by then.  Also, I have not changed any of
> the heartbeat config files for a long time.  No one else has been on
> these machines recently.  I don't understand why heartbeat log says "no
> such device" for eth0 when it is there.  I look at the link light and
> you can see the heartbeats.  As I said, rather baffling.  Could this be
> some type of uuid issue on the heartbeat connections?  Like it's
> connected but doesn't know that it's connected?


It is printing "no such device" because that's the message that goes
with the errno it's getting from the kernel.  This is a real system call
error.  It's printing that message because it's getting that errno.

ifplugd gets used by default on RH I think.  The other place I saw this
was servers too...

Can you give me the complete logs starting from an hour before the first
problem, and 100 lines after the first error?  Send them to [EMAIL PROTECTED]


-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] heartbeat 2.0.8: causing nfs kernel oops

Reply via email to