Re: [Linux-HA] Re: heartbeat shuts down all VM machines

Dejan Muhamedagic Thu, 28 Feb 2008 04:20:31 -0800

Hi,

On Thu, Feb 28, 2008 at 12:11:31PM +0100, rupert wrote:
> mmh, i just restart the 2nd server to check in hearbeat moves the VM
> to the server1.
> I couldnt find any info about that in the logfiles on the first
> server, something like taking over backend-B1,
> and one VM did not start. But after the reboot of the server2 after
> some time it correctly starts the backend-B1
> 
> heartbeat[4959]: 2008/02/28_10:36:19 WARN: Logging daemon is disabled
> --enabling logging daemon is
> 
> recommended
> heartbeat[4959]: 2008/02/28_10:36:19 info: **************************
> heartbeat[4959]: 2008/02/28_10:36:19 info: Configuration validated.
> Starting heartbeat 2.1.2
> heartbeat[4960]: 2008/02/28_10:36:19 info: heartbeat: version 2.1.2
> heartbeat[4960]: 2008/02/28_10:36:19 info: Heartbeat generation: 1202824451
> heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
> Added signal manual handler
> heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
> Added signal manual handler
> heartbeat[4960]: 2008/02/28_10:36:19 info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: write socket
> priority set to IPTOS_LOWDELA
> 
> Y on eth0
> heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound send
> socket to device: eth0
> heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound receive
> socket to device: eth0
> heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: started on
> port 694 interface eth0 to 172.
> 
>  20.2.1
> heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_SignalHandler:
> Added signal handler for sign
> 
> al 17
> heartbeat[4960]: 2008/02/28_10:36:19 info: Local status now set to: 'up'
> heartbeat[4960]: 2008/02/28_10:38:20 WARN: node xen-a1.fra1.mailcluster: is 
> dead
> heartbeat[4960]: 2008/02/28_10:38:20 info: Comm_now_up(): updating
> status to active
> heartbeat[4960]: 2008/02/28_10:38:20 info: Local status now set to: 'active'
> heartbeat[4960]: 2008/02/28_10:38:20 WARN: No STONITH device configured.
> heartbeat[4960]: 2008/02/28_10:38:20 WARN: Shared disks are not protected.
> heartbeat[4960]: 2008/02/28_10:38:20 info: Resources being acquired
> from xen-a1.fra1.mailcluster.
> harc[4989]:     2008/02/28_10:38:20 info: Running /etc/ha.d/rc.d/status status
> heartbeat[4990]: 2008/02/28_10:38:20 info: Local Resource acquisition 
> completed.
> mach_down[5019]:        2008/02/28_10:38:20 info: Taking over resource
> group drbddisk::drbd_backen
>                                                                     d
> ResourceManager[5073]:  2008/02/28_10:38:20 info: Acquiring resource
> group: xen-a1.fra1.mailcluste
> 
> r drbddisk::drbd_backend xen::backend-A1
> ResourceManager[5073]:  2008/02/28_10:38:20 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backe
> 
>           nd start
> heartbeat[4960]: 2008/02/28_10:38:30 info: Local Resource acquisition
> completed. (none)
> heartbeat[4960]: 2008/02/28_10:38:30 info: local resource transition 
> completed.
> ResourceManager[5073]:  2008/02/28_10:38:32 ERROR: Return code 1 from
> /etc/ha.d/resource.d/drbddis
>                                                                      k
> ResourceManager[5073]:  2008/02/28_10:38:32 CRIT: Giving up resources
> due to failure of drbddisk::
> 
> drbd_backend


You have to find out why is drbddisk failing.

> ResourceManager[5073]:  2008/02/28_10:38:32 info: Releasing resource
> group: xen-a1.fra1.mailcluste
> 
> r drbddisk::drbd_backend xen::backend-A1
> ResourceManager[5073]:  2008/02/28_10:38:32 info: Running
> /etc/ha.d/resource.d/xen backend-A1 stop
> ResourceManager[5073]:  2008/02/28_10:38:33 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backe
> 
>           nd stop
> mach_down[5019]:        2008/02/28_10:38:33 info:
> /usr/share/heartbeat/mach_down: nice_failback: f
> 
>                   oreign resources acquired
> mach_down[5019]:        2008/02/28_10:38:33 info: mach_down takeover
> complete for node xen-a1.fra1
> 
> .mailcluster.
> heartbeat[4960]: 2008/02/28_10:38:33 info: mach_down takeover complete.
> heartbeat[4960]: 2008/02/28_10:38:33 info: Initial resource
> acquisition complete (mach_down)
> harc[5232]:     2008/02/28_10:38:33 info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> ip-request-resp[5232]:  2008/02/28_10:38:33 received ip-request-resp
> drbddisk::drbd_backend_2 OK y
> 
> es
> ResourceManager[5253]:  2008/02/28_10:38:33 info: Acquiring resource
> group: xen-b1.fra1.mailcluste
> 
> r drbddisk::drbd_backend_2 xen::backend-B1
> ResourceManager[5253]:  2008/02/28_10:38:33 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backe
> 
>           nd_2 start
> ResourceManager[5253]:  2008/02/28_10:38:33 info: Running
> /etc/ha.d/resource.d/xen backend-B1 star
> 
>           t
> hb_standby[5588]:       2008/02/28_10:39:03 Going standby [foreign].
> heartbeat[4960]: 2008/02/28_10:39:03 info: xen-b1.fra1.mailcluster
> wants to go standby [foreign]
> heartbeat[4960]: 2008/02/28_10:39:13 WARN: No reply to standby
> request.  Standby request cancelled
> 
> but after a reboot some minutes before i had the logfile flooding with
> this message
> 
> heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
> ucast packet: No such device
> heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
> eth0.: No such device
> heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
> ucast packet: No such device
> heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
> eth0.: No such device

Well, looks like eth0 doesn't exist.

> I stopped iptables, but it didnt go away, only after a new reboot,
> what the reason for this
> error?
> 
> in ha.cf should be both nodes have a "ucast eth0 172.20.2.1" entry?

No. It should be ucast eth0 node2-ipaddress on node1 and vice
versa on node2. To simplify management, you can put both ucast
directives on both nodes. I believe that this is well documented
in ha.cf.

Thanks,

Dejan

> thx
> 
> On Thu, Feb 28, 2008 at 11:18 AM, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >
> >  On Thu, Feb 28, 2008 at 08:36:33AM +0100, rupert wrote:
> >  > has no one some ideas to this matter?
> >
> >  This is a drbd related issue. You should be better off in a drbd
> >  forum.
> >
> >  Thanks,
> >
> >  Dejan
> >
> >
> >
> >  > thx
> >  >
> >  > On Tue, Feb 26, 2008 at 12:10 PM, rupert <[EMAIL PROTECTED]> wrote:
> >  > > Hello,
> >  > >
> >  > >  i set up a cluster with 2 drbdb devices and 2 VM on each server.
> >  > >  When one server goes down the other should take over the part of down 
> > one.
> >  > >  The drbd goes like this:
> >  > >  a -> a
> >  > >  b <- b
> >  > >
> >  > >  the other machine are not drbdb devices, just some loopback VM which
> >  > >  caryy no data,
> >  > >  can they be in the config for heartbeat?
> >  > >
> >  > >  in my haresources I have the following entries on both servers
> >  > >
> >  > >  xen-A1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 
> > xen::MX1-A1
> >  > >  xen-B1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1 
> > xen::MX2-B1
> >  > >
> >  > >  in ha.cf on the first server I set ucast to
> >  > >  ucast eth0 172.20.1.1
> >  > >  and
> >  > >  ucast eth0 172.20.2.1
> >  > >  on the second server
> >  > >
> >  > >  when i restart the ha deamon it powers down all the VMs and makes on
> >  > >  the first server
> >  > >  all the drbd device primary but they should be on the first server
> >  > >
> >  > >  GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by
> >  > >  [EMAIL PROTECTED], 2008-02-13 19:17:43
> >  > >   0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> >  > >     ns:135995280 nr:0 dw:779680 dr:135790386 al:224 bm:8602 lo:0 pe:0 
> > ua:0 ap:0
> >  > >         resync: used:0/31 hits:8442668 misses:8308 starving:0 dirty:0
> >  > >  changed:8308
> >  > >         act_log: used:0/257 hits:136296 misses:224 starving:0 dirty:0
> >  > >  changed:224
> >  > >   1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
> >  > >     ns:0 nr:663968 dw:663968 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
> >  > >         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
> >  > >
> >  > >
> >  > >  on my first start heartbeat told me that the drbddisk is active and it
> >  > >  shouldnt be,
> >  > >  but its the one that is on each server the main drbdisk, the other is
> >  > >  the backup
> >  > >  for failouts.
> >  > >
> >  > >  Resource drbddisk::drbd_backend_2 is active, and s
> >  > >
> >  > >                                       hould not be!
> >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources can affect data 
> > integrity!
> >  > >  2008/02/26_07:42:58 info: If you don't know what this means, then get 
> > help!
> >  > >  2008/02/26_07:42:58 info: Read the docs and/or source to
> >  > >  /usr/share/heartbeat/Re
> >  > >
> >  > >            sourceManager for more details.
> >  > >  CRITICAL: Resource drbddisk::drbd_backend_2 is active, and should not 
> > be!
> >  > >  CRITICAL: Non-idle resources can affect data integrity!
> >  > >  info: If you don't know what this means, then get help!
> >  > >  info: Read the docs and/or the source to
> >  > >  /usr/share/heartbeat/ResourceManager fo
> >  > >
> >  > >                            r more details.
> >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources will affect resource 
> > takeback!
> >  > >  2008/02/26_07:42:58 CRITICAL: Non-idle resources may affect data 
> > integrity!
> >  > >
> >  > >
> >  > >  thx for your help
> >  > >
> >  > _______________________________________________
> >  > Linux-HA mailing list
> >  > [email protected]
> >  > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >  > See also: http://linux-ha.org/ReportingProblems
> >
> >  --
> >  Dejan
> >  _______________________________________________
> >  Linux-HA mailing list
> >  [email protected]
> >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >  See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Re: heartbeat shuts down all VM machines

Reply via email to