Alan Robertson wrote:
Gerry Reno wrote:
I'm seeing some very strange things lately. Whenever heartbeat is
running there are these messages in the log:
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write failure on
bcast eth0.: No such device
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
send bcast [-1] packet(len=214): No such device
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
message with 10 fields
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
[t=NS_ackmsg]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
[dest=grp-01-30-02]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
[ackseq=40cd2]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
[(1)destuuid=0x835cfc8(37 28)]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
[src=grp-01-30-01]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
[(1)srcuuid=0x8361848(36 27)]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] : [hg=a1]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
[ts=46367de0]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] : [ttl=4]
Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] : [auth=1
dcf0feb393f46354b060306713eb72adc15eecf3]
But yet, in most other respects eth0 seems to behave perfectly normal.
I even went so far as to swap out the NIC card for eth0 and same
result. I can ping, ftp, ssh, etc. using eth0 with no problems. Where
I do see a problem is with using NFS. If I mount a remote NFS mount and
try to push a compressed tar to the NFS mounted directory, after about
1GB of transfer I get a kernel oops in the NFS code. Now, if I shutdown
heartbeat and perform the same compressed tar it completes correctly
without any oops. So I'm baffled by this. Is there any known problem
that would cause the above log messages on an otherwise perfectly good
network connection and also cause some type of interaction with NFS?
This problem seems to follow the primary node. In other words the
lockup occurs on whichever node has the primary IPaddr. I can post the
log, but it's hundreds of megabytes of this same message.
Yes.
Running DHCP on a network link. Taking the link down manually. Other
things that involve messing around with eth0.
Alan,
Where do you think this problem lies? Is it a kernel problem; a
heartbeat problem? Is this something that is/has been/can be addressed
by the heartbeat team? Is there a workaround/fix? This problem greatly
interferes with other network activities that need to take place on our
servers such as backups and that is how I discovered it because none of
the backups were completing overnight and the whole machine would be
locked up due to the kernel oops.
thx,
-Gerry
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems