Re: [Linux-HA] Re: heartbeat shuts down all VM machines

rupert Mon, 03 Mar 2008 08:40:52 -0800

On Fri, Feb 29, 2008 at 5:19 PM, rupert <[EMAIL PROTECTED]> wrote:
> I did some google about the ucast errors, but not much info came arround.
>
>  What can be the cause of this? I rebooted and/or restarted the
>  machines but always on  both machines the log fills with the following
>
>
>  Feb 29 16:17:15 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
>  Feb 29 16:17:17 xen-B1 heartbeat: [2974]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
>  Feb 29 16:17:17 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
>  Feb 29 16:17:19 xen-B1 heartbeat: [2974]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
>  Feb 29 16:17:19 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
>  --
>  Feb 29 16:18:39 xen-A1 heartbeat: [2936]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
>  Feb 29 16:18:39 xen-A1 heartbeat: [2936]: ERROR: write failure on
>
> ucast eth0.: No such device
>
>  --are these related?
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:149) Waiting for 2050.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:476)
>  hotplugStatusCallback
>  /local/domain/0/backend/vbd/2/2050/hotplug-status.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:490)
>  hotplugStatusCallback 1.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices irq.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices vkbd.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices vfb.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices pci.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices ioports.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices tap.
>  [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
>  devices vtpm.
>
>
>  On both machines there are a couple of network services that run well
>  thorught eth0,
>  so ther device is up. Can this be because xen created some iptables rules?
>
>
iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere            udp dpt:bootps
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:bootps


Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             192.168.122.0/24    state
RELATED,ESTABLISHED
ACCEPT     all  --  192.168.122.0/24     anywhere
ACCEPT     all  --  anywhere             anywhere
REJECT     all  --  anywhere             anywhere
reject-with icmp-port-unreachable
REJECT     all  --  anywhere             anywhere
reject-with icmp-port-unreachable
ACCEPT     all  --  mx2.mailcluster.solvians.com  anywhere
PHYSDEV match --physdev-in vif2.1
ACCEPT     udp  --  anywhere             anywhere            PHYSDEV
match --physdev-in vif2.1 udp spt:bootpc dpt:bootps
ACCEPT     all  --  mx2.fra1.mailcluster  anywhere            PHYSDEV
match --physdev-in vif2.0
ACCEPT     udp  --  anywhere             anywhere            PHYSDEV
match --physdev-in vif2.0 udp spt:bootpc dpt:bootps
ACCEPT     all  --  anywhere             anywhere            PHYSDEV
match --physdev-in vif3.0

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


why does xen create something for  192.168.122.0/24  net, never used this here!

>  thx for your help
>
>  Heiko
>
>
>
>  On Fri, Feb 29, 2008 at 9:33 AM, rupert <[EMAIL PROTECTED]> wrote:
>  > it works now much better, both systems did a reboot (dont know why),
>  >  and now both VM running on the first server, so how can i get the
>  >  second server to take back the 2nd VM?
>  >
>  >
>  >
>  >  On Thu, Feb 28, 2008 at 1:19 PM, Dejan Muhamedagic <[EMAIL PROTECTED]> 
> wrote:
>  >  > Hi,
>  >  >
>  >  >
>  >  >
>  >  >  On Thu, Feb 28, 2008 at 12:11:31PM +0100, rupert wrote:
>  >  >  > mmh, i just restart the 2nd server to check in hearbeat moves the VM
>  >  >  > to the server1.
>  >  >  > I couldnt find any info about that in the logfiles on the first
>  >  >  > server, something like taking over backend-B1,
>  >  >  > and one VM did not start. But after the reboot of the server2 after
>  >  >  > some time it correctly starts the backend-B1
>  >  >  >
>  >  >  > heartbeat[4959]: 2008/02/28_10:36:19 WARN: Logging daemon is disabled
>  >  >  > --enabling logging daemon is
>  >  >  >
>  >  >  > recommended
>  >  >  > heartbeat[4959]: 2008/02/28_10:36:19 info: **************************
>  >  >  > heartbeat[4959]: 2008/02/28_10:36:19 info: Configuration validated.
>  >  >  > Starting heartbeat 2.1.2
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: heartbeat: version 2.1.2
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: Heartbeat generation: 
> 1202824451
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
>  >  >  > Added signal manual handler
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
>  >  >  > Added signal manual handler
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: Removing
>  >  >  > /var/run/heartbeat/rsctmp failed, recreating.
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: write socket
>  >  >  > priority set to IPTOS_LOWDELA
>  >  >  >
>  >  >  > Y on eth0
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound send
>  >  >  > socket to device: eth0
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound receive
>  >  >  > socket to device: eth0
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: started on
>  >  >  > port 694 interface eth0 to 172.
>  >  >  >
>  >  >  >  20.2.1
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_SignalHandler:
>  >  >  > Added signal handler for sign
>  >  >  >
>  >  >  > al 17
>  >  >  > heartbeat[4960]: 2008/02/28_10:36:19 info: Local status now set to: 
> 'up'
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 WARN: node 
> xen-a1.fra1.mailcluster: is dead
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 info: Comm_now_up(): updating
>  >  >  > status to active
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 info: Local status now set to: 
> 'active'
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 WARN: No STONITH device 
> configured.
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 WARN: Shared disks are not 
> protected.
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:20 info: Resources being acquired
>  >  >  > from xen-a1.fra1.mailcluster.
>  >  >  > harc[4989]:     2008/02/28_10:38:20 info: Running 
> /etc/ha.d/rc.d/status status
>  >  >  > heartbeat[4990]: 2008/02/28_10:38:20 info: Local Resource 
> acquisition completed.
>  >  >  > mach_down[5019]:        2008/02/28_10:38:20 info: Taking over 
> resource
>  >  >  > group drbddisk::drbd_backen
>  >  >  >                                                                     d
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:20 info: Acquiring resource
>  >  >  > group: xen-a1.fra1.mailcluste
>  >  >  >
>  >  >  > r drbddisk::drbd_backend xen::backend-A1
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:20 info: Running
>  >  >  > /etc/ha.d/resource.d/drbddisk drbd_backe
>  >  >  >
>  >  >  >           nd start
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:30 info: Local Resource acquisition
>  >  >  > completed. (none)
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:30 info: local resource transition 
> completed.
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:32 ERROR: Return code 1 from
>  >  >  > /etc/ha.d/resource.d/drbddis
>  >  >  >                                                                      
> k
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:32 CRIT: Giving up resources
>  >  >  > due to failure of drbddisk::
>  >  >  >
>  >  >  > drbd_backend
>  >  >
>  >  >  You have to find out why is drbddisk failing.
>  >  >
>  >  >
>  >  >
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:32 info: Releasing resource
>  >  >  > group: xen-a1.fra1.mailcluste
>  >  >  >
>  >  >  > r drbddisk::drbd_backend xen::backend-A1
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:32 info: Running
>  >  >  > /etc/ha.d/resource.d/xen backend-A1 stop
>  >  >  > ResourceManager[5073]:  2008/02/28_10:38:33 info: Running
>  >  >  > /etc/ha.d/resource.d/drbddisk drbd_backe
>  >  >  >
>  >  >  >           nd stop
>  >  >  > mach_down[5019]:        2008/02/28_10:38:33 info:
>  >  >  > /usr/share/heartbeat/mach_down: nice_failback: f
>  >  >  >
>  >  >  >                   oreign resources acquired
>  >  >  > mach_down[5019]:        2008/02/28_10:38:33 info: mach_down takeover
>  >  >  > complete for node xen-a1.fra1
>  >  >  >
>  >  >  > .mailcluster.
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:33 info: mach_down takeover 
> complete.
>  >  >  > heartbeat[4960]: 2008/02/28_10:38:33 info: Initial resource
>  >  >  > acquisition complete (mach_down)
>  >  >  > harc[5232]:     2008/02/28_10:38:33 info: Running
>  >  >  > /etc/ha.d/rc.d/ip-request-resp ip-request-resp
>  >  >  > ip-request-resp[5232]:  2008/02/28_10:38:33 received ip-request-resp
>  >  >  > drbddisk::drbd_backend_2 OK y
>  >  >  >
>  >  >  > es
>  >  >  > ResourceManager[5253]:  2008/02/28_10:38:33 info: Acquiring resource
>  >  >  > group: xen-b1.fra1.mailcluste
>  >  >  >
>  >  >  > r drbddisk::drbd_backend_2 xen::backend-B1
>  >  >  > ResourceManager[5253]:  2008/02/28_10:38:33 info: Running
>  >  >  > /etc/ha.d/resource.d/drbddisk drbd_backe
>  >  >  >
>  >  >  >           nd_2 start
>  >  >  > ResourceManager[5253]:  2008/02/28_10:38:33 info: Running
>  >  >  > /etc/ha.d/resource.d/xen backend-B1 star
>  >  >  >
>  >  >  >           t
>  >  >  > hb_standby[5588]:       2008/02/28_10:39:03 Going standby [foreign].
>  >  >  > heartbeat[4960]: 2008/02/28_10:39:03 info: xen-b1.fra1.mailcluster
>  >  >  > wants to go standby [foreign]
>  >  >  > heartbeat[4960]: 2008/02/28_10:39:13 WARN: No reply to standby
>  >  >  > request.  Standby request cancelled
>  >  >  >
>  >  >  > but after a reboot some minutes before i had the logfile flooding 
> with
>  >  >  > this message
>  >  >  >
>  >  >  > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
>  >  >  > ucast packet: No such device
>  >  >  > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
>  >  >  > eth0.: No such device
>  >  >  > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
>  >  >  > ucast packet: No such device
>  >  >  > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
>  >  >  > eth0.: No such device
>  >  >
>  >  >  Well, looks like eth0 doesn't exist.
>  >  >
>  >  >
>  >  >  > I stopped iptables, but it didnt go away, only after a new reboot,
>  >  >  > what the reason for this
>  >  >  > error?
>  >  >  >
>  >  >  > in ha.cf should be both nodes have a "ucast eth0 172.20.2.1" entry?
>  >  >
>  >  >  No. It should be ucast eth0 node2-ipaddress on node1 and vice
>  >  >  versa on node2. To simplify management, you can put both ucast
>  >  >  directives on both nodes. I believe that this is well documented
>  >  >  in ha.cf.
>  >  >
>  >  >  Thanks,
>  >  >
>  >  >  Dejan
>  >  >
>  >  >
>  >  >
>  >  >  > thx
>  >  >  >
>  >  >  > On Thu, Feb 28, 2008 at 11:18 AM, Dejan Muhamedagic <[EMAIL 
> PROTECTED]> wrote:
>  >  >  > > Hi,
>  >  >  > >
>  >  >  > >
>  >  >  > >  On Thu, Feb 28, 2008 at 08:36:33AM +0100, rupert wrote:
>  >  >  > >  > has no one some ideas to this matter?
>  >  >  > >
>  >  >  > >  This is a drbd related issue. You should be better off in a drbd
>  >  >  > >  forum.
>  >  >  > >
>  >  >  > >  Thanks,
>  >  >  > >
>  >  >  > >  Dejan
>  >  >  > >
>  >  >  > >
>  >  >  > >
>  >  >  > >  > thx
>  >  >  > >  >
>  >  >  > >  > On Tue, Feb 26, 2008 at 12:10 PM, rupert <[EMAIL PROTECTED]> 
> wrote:
>  >  >  > >  > > Hello,
>  >  >  > >  > >
>  >  >  > >  > >  i set up a cluster with 2 drbdb devices and 2 VM on each 
> server.
>  >  >  > >  > >  When one server goes down the other should take over the 
> part of down one.
>  >  >  > >  > >  The drbd goes like this:
>  >  >  > >  > >  a -> a
>  >  >  > >  > >  b <- b
>  >  >  > >  > >
>  >  >  > >  > >  the other machine are not drbdb devices, just some loopback 
> VM which
>  >  >  > >  > >  caryy no data,
>  >  >  > >  > >  can they be in the config for heartbeat?
>  >  >  > >  > >
>  >  >  > >  > >  in my haresources I have the following entries on both 
> servers
>  >  >  > >  > >
>  >  >  > >  > >  xen-A1.fra1.mailcluster drbddisk::drbd_backend 
> xen::backend-A1 xen::MX1-A1
>  >  >  > >  > >  xen-B1.fra1.mailcluster drbddisk::drbd_backend_2 
> xen::backend-B1 xen::MX2-B1
>  >  >  > >  > >
>  >  >  > >  > >  in ha.cf on the first server I set ucast to
>  >  >  > >  > >  ucast eth0 172.20.1.1
>  >  >  > >  > >  and
>  >  >  > >  > >  ucast eth0 172.20.2.1
>  >  >  > >  > >  on the second server
>  >  >  > >  > >
>  >  >  > >  > >  when i restart the ha deamon it powers down all the VMs and 
> makes on
>  >  >  > >  > >  the first server
>  >  >  > >  > >  all the drbd device primary but they should be on the first 
> server
>  >  >  > >  > >
>  >  >  > >  > >  GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by
>  >  >  > >  > >  [EMAIL PROTECTED], 2008-02-13 19:17:43
>  >  >  > >  > >   0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C 
> r---
>  >  >  > >  > >     ns:135995280 nr:0 dw:779680 dr:135790386 al:224 bm:8602 
> lo:0 pe:0 ua:0 ap:0
>  >  >  > >  > >         resync: used:0/31 hits:8442668 misses:8308 starving:0 
> dirty:0
>  >  >  > >  > >  changed:8308
>  >  >  > >  > >         act_log: used:0/257 hits:136296 misses:224 starving:0 
> dirty:0
>  >  >  > >  > >  changed:224
>  >  >  > >  > >   1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C 
> r---
>  >  >  > >  > >     ns:0 nr:663968 dw:663968 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 
> ap:0
>  >  >  > >  > >         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 
> changed:0
>  >  >  > >  > >
>  >  >  > >  > >
>  >  >  > >  > >  on my first start heartbeat told me that the drbddisk is 
> active and it
>  >  >  > >  > >  shouldnt be,
>  >  >  > >  > >  but its the one that is on each server the main drbdisk, the 
> other is
>  >  >  > >  > >  the backup
>  >  >  > >  > >  for failouts.
>  >  >  > >  > >
>  >  >  > >  > >  Resource drbddisk::drbd_backend_2 is active, and s
>  >  >  > >  > >
>  >  >  > >  > >                                       hould not be!
>  >  >  > >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources can affect 
> data integrity!
>  >  >  > >  > >  2008/02/26_07:42:58 info: If you don't know what this means, 
> then get help!
>  >  >  > >  > >  2008/02/26_07:42:58 info: Read the docs and/or source to
>  >  >  > >  > >  /usr/share/heartbeat/Re
>  >  >  > >  > >
>  >  >  > >  > >            sourceManager for more details.
>  >  >  > >  > >  CRITICAL: Resource drbddisk::drbd_backend_2 is active, and 
> should not be!
>  >  >  > >  > >  CRITICAL: Non-idle resources can affect data integrity!
>  >  >  > >  > >  info: If you don't know what this means, then get help!
>  >  >  > >  > >  info: Read the docs and/or the source to
>  >  >  > >  > >  /usr/share/heartbeat/ResourceManager fo
>  >  >  > >  > >
>  >  >  > >  > >                            r more details.
>  >  >  > >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources will affect 
> resource takeback!
>  >  >  > >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources may affect 
> data integrity!
>  >  >  > >  > >
>  >  >  > >  > >
>  >  >  > >  > >  thx for your help
>  >  >  > >  > >
>  >  >  > >  > _______________________________________________
>  >  >  > >  > Linux-HA mailing list
>  >  >  > >  > [email protected]
>  >  >  > >  > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  >  >  > >  > See also: http://linux-ha.org/ReportingProblems
>  >  >  > >
>  >  >  > >  --
>  >  >  > >  Dejan
>  >  >  > >  _______________________________________________
>  >  >  > >  Linux-HA mailing list
>  >  >  > >  [email protected]
>  >  >  > >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  >  >  > >  See also: http://linux-ha.org/ReportingProblems
>  >  >  > >
>  >  >  > _______________________________________________
>  >  >  > Linux-HA mailing list
>  >  >  > [email protected]
>  >  >  > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  >  >  > See also: http://linux-ha.org/ReportingProblems
>  >  >  _______________________________________________
>  >  >  Linux-HA mailing list
>  >  >  [email protected]
>  >  >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  >  >  See also: http://linux-ha.org/ReportingProblems
>  >  >
>  >
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Re: heartbeat shuts down all VM machines

Reply via email to