[Linux-HA] HA fails when stopping master director

Alejandro Sánchez Meroño Tue, 03 Mar 2009 07:45:59 -0800

Hello everybody, 

Here Alejandro from Valencia, Spain. I'm glad to join this mailing list, and 
though at present I'm a complete rookie on HA -and a "sophomore" in Linux-, I'd 
like to think that some day I might help others about this subject.


Unfortunately, it's me who at present need a helping hand from you...

OK, I'll try to put all the data in order: 

     A) Abstract of the issue: I have configured load balancing and high 
availability with two web servers and two directors with ldirectord and 
heartbeat. Load balance works fine, but when testing the HA, if I stop 
heartbeat at the main director, the system swaps to backup director but... only 
for a few seconds!! Then, everything is dead. ha-debug log at the main director 
seems happy, while ha-debug log at the backup director just repeats hundreds of 
times 

     B) What I am actually trying to do:  
My main objective is rather simple: Obtain load balancing and high availability 
from two mirror web servers -Apache. At present we have just one single web 
server with rather heavy work load and running important web applications, so 
we need to secure it. Some day we will have four physical servers, two of them 
running as Load Directors (master and backup) and two of them as replicated web 
servers. But before, I must learn how to do it, of course. So I set up a pilot 
system.  

     C) My pilot system: 
I'm working on an Apple Xserve, where I have created four virtual machines. On 
each one of them I have installed Ubuntu 8.10. I assigned static IP's to each 
one of the VM, and reserved a virtual IP to access the web servers.
So, I have: 
        director1: 172.25.146.32
        director2: 172.25.146.33
        web1: 172.25.146.37
        web2: 172.25.146.38
        Virtual IP: 172.25.146.31
director1 and web1 access the network via eth0, while director2 and web2 do it 
via eth1 (I don't know why, it simply was configured like that when I created 
the virtual machines and installed Ubuntu). 

Each machine has the same /etc/hosts: 
127.0.0.1               localhost
172.25.146.32   director1
172.25.146.33   director2
172.25.146.37   web1
172.25.146.38   web2

     D) What I have installed and configured: 

                D1) Apache and PHP5 on web1 and web2. I can access from the 
browser http://172.25.146.37, and http://172.25.146.38 with no problems. 
                D2) I wrote the following script on director1 and director2: 
/etc/network/if-up.d/loadmodules

###################
#!/bin/bash

echo ip_vs_dh >> /etc/modules
echo ip_vs_ftp >> /etc/modules
echo ip_vs >> /etc/modules
echo ip_vs_lblc >> /etc/modules
echo ip_vs_lblcr >> /etc/modules
echo ip_vs_lc >> /etc/modules
echo ip_vs_nq >> /etc/modules
echo ip_vs_rr >> /etc/modules
echo ip_vs_sed >> /etc/modules
echo ip_vs_sh >> /etc/modules
echo ip_vs_wlc >> /etc/modules
echo ip_vs_wrr >> /etc/modules

modprobe ip_vs_dh
modprobe ip_vs_ftp
modprobe ip_vs
modprobe ip_vs_lblc
modprobe ip_vs_lblcr
modprobe ip_vs_lc
modprobe ip_vs_nq
modprobe ip_vs_rr
modprobe ip_vs_sed
modprobe ip_vs_sh
modprobe ip_vs_wlc
modprobe ip_vs_wrr
######################

But I noticed that when restarting the machines, the modules weren't reloaded. 
So I edited the file /etc/modules and added the lines manually (ip_vs_dh and so 
on)... I don't know if I did well...

                D3) On director1 and director2, I did: apt-get install ipvsadm 
ldirectord heartbeat
                D4) Enabled packet forwarding on /etc/sysctl.conf: 
                        net.ipv4.ip_forward = 1
and then 
                        sysctl -p
                D5) The files: ha.cf, haresources, authkeys, ldirectord.cf and 
logd.cf on director1 and director2: 

/etc/ha.d/ha.cf: 

#This is for director1
#Changed eth0 by eth1 on director2
#
debugfile /var/log/ha-debug
logfile /var/log/ha-log
use_logd yes
logfacility local0
keepalive 1
warntime 10
deadtime 30
initdead 120
updport 694
ucast eth0 172.25.146.32
ucast eth0 172.25.146.33
auto_failback on
node director1
node director2
ping 172.25.146.1 #gateway
respawn hacluster /usr/lib/heartbeat/ipfail

/etc/ha.d/haresources: 

director1 \
  ldirectord::ldirector.cf \
  LVSSyncDaemonSwap::master \
  IPaddr2::172.25.146.31/24/eth0/172.25.146.255
#172.25.146.255 broadcast address
#changed eth0 by eth1 on director2
                
/etc/ha.d/authkeys: (same for director1 and director2) 

auth 3
3 md5 mypassword

/etc/ha.d/ldirectord.cf: (same for director1 and director2)

checktimeout=10
checkinterval=2
autoreload=no
logfile="local0"
quiescent=yes
virtual=172.25.146.31:80
        real=172.25.146.37:80 gate
        real=172.25.146.38:80 gate
        fallback=127.0.0.1:80 gate
        service=http
        request="test.html"
        receive="test"
        scheduler=rr
        protocol=tcp
        checktype=negotiate

/etc/logd.cf

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility daemon
entity logd
useapphbd no
sendqlen 256
recvqlen 256

                D6) Created the proper /var/www/test.html on web1 and web2

                D7) Typed: 
update-rc.d heartbeat start 75 2 3 4 5 . stop 05 0 1 6 .
update-rc.d -f ldirectord remove
/etc/init.d/ldirectord stop
/etc/init.d/heartbeat start

                D8) I checked: 
ip add sh eth0 on director1, OK
ip add sh eth1 on director2, OK
ldirectord ldirectord.cf status on director1 and director2, running and 
stopped, OK
ipvsadm -L -n on director1 and director2, shows the routing table on director1 
and nothing on director2, OK
/etc/ha.d/resource.d/LVSSyncDaemonSwap master status on director1 and 
director2, running and stopped, OK

                D9) On both web servers, I enabled arp_ignore and arp_announce 
in /etc/sysctl.conf: 
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 1
net.ipv4.conf.eth0.arp_announce = 1
(changed eth0 by eth1 on web2). 
And then: sysctl -p

                D10) On both web servers, I added the following on 
/etc/network/interfaces: 

auto lo:0
iface lo:0 inet static
        address 172.25.146.31
        netmask 255.255.255.255
        pre-up sysctl -p > /dev/null

And then: ifup lo:0

        E) Done. Final tests: 

                E1) I try to access http://172.25.146.31 on my browser. 
Success. I can check which server is serving with: 
ipvsadm -L -n --stats 
Both servers are serving alternatively, as expected (round robin -rr- 
algorithm).

                E2) I kill web1. http://172.25.146.31 keeps on. Same if I start 
again web1 and kill web2. Success.

So I achieved Load Balancing. Let's see what happens with the High Availability.

                E3) I stop heartbeat on director1 with: 
/etc/init.d/heartbeat stop

And... http://172.25.146.31 doesn't answer anymore... Ouch!!!!!!

                E4) OK, OK, wait a second, let's go back: 
/etc/init.d/heartbeat start (on director1)

And http://172.25.146.31 keeps with no answer... Ooooouch!!!!!!

If I do: 
ipvsadm -L -n
There appears no route anymore (in director1 and director2).

Feeling miserable, I do in a hopeless intuition: 
/etc/init.d/heartbeat start (on director1, again)

And, surprise, http.... is alive again!!

So, if I put director1 down, heartbeat doesn't swap to director2, and if I want 
to put it up again, I must start heartbeat twice!! (so, "auto_failback on" 
doesn't work either)...

I tried then to put director1 down, and start heartbeat thousands of times on 
director2. Nothing happens anyway... 

So I have achieved Lousy Availability instead!!! :_(

I have attached the ha-debug log files to this e-mail, I guess that they must 
be significative for more experienced people... Especially the ha-debug of 
director2 that only repeats over and over again the same sentence: 

ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device

So I sense that something is trying to access director2 through eth0, which 
doesn't exist, as its interface is eth1. But I have revisited many times every 
configuration file and I can't find where can be the error.

So... please please please, may I get any hint?

Thanks in advance!!!!

Best regards, 

         Alejandro

==      
Alejandro Sanchez Merono - [email protected]
TIC Department
Institute of Electrical Technology
Parque Tecnologico de Valencia
PATERNA (Valencia)
Spain

Tel.: (+34) 96 136 66 70
Fax: (+34) 96 136 66 80
Web: http://www.ite.es <http://www.ite.es/> 
E-mail: [email protected]

No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO:  Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
INFO:  Success
No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO:  Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
INFO:  Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO:  Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.

INFO:  Success
INFO:  Success
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Initializing connection to logging 
daemon failed. Logging daemon may not be running
heartbeat[15720]: 2009/03/03_10:47:24 info: Enabling logging daemon
heartbeat[15720]: 2009/03/03_10:47:24 info: logfile and debug file are those 
specified in logd config file (default /etc/logd.cf)
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Core dumps could be lost if 
multiple dumps occur.
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Consider setting non-default value 
in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
heartbeat[15720]: 2009/03/03_10:47:24 info: Version 2 support: false
heartbeat[15720]: 2009/03/03_10:47:24 WARN: logd is enabled but 
logfile/debugfile/logfacility is still configured in ha.cf
heartbeat[15720]: 2009/03/03_10:47:24 info: **************************
heartbeat[15720]: 2009/03/03_10:47:24 info: Configuration validated. Starting 
heartbeat 2.1.3
heartbeat[15721]: 2009/03/03_10:47:24 info: heartbeat: version 2.1.3
heartbeat[15721]: 2009/03/03_10:47:24 info: Heartbeat generation: 1236015595
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: write socket priority 
set to IPTOS_LOWDELAY on eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound send socket to 
device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound receive socket 
to device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: started on port 694 
interface eth0 to 172.25.146.32
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: write socket priority 
set to IPTOS_LOWDELAY on eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound send socket to 
device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound receive socket 
to device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: started on port 694 
interface eth0 to 172.25.146.33
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ping heartbeat started.
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_SignalHandler: Added 
signal handler for signal 17
heartbeat[15721]: 2009/03/03_10:47:24 info: Local status now set to: 'up'
heartbeat[15721]: 2009/03/03_10:47:25 info: Link 172.25.146.1:172.25.146.1 up.
heartbeat[15721]: 2009/03/03_10:47:25 info: Status update for node 
172.25.146.1: status ping
heartbeat[15721]: 2009/03/03_10:47:25 info: Link director1:eth0 up.
heartbeat[15721]: 2009/03/03_10:47:48 info: Link director2:eth0 up.
heartbeat[15721]: 2009/03/03_10:47:48 info: Status update for node director2: 
status up
heartbeat[15721]: 2009/03/03_10:47:48 debug: get_delnodelist: delnodelist=
heartbeat[15735]: 2009/03/03_10:47:48 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
logd is not runningharc[15735]:    2009/03/03_10:47:48 info: Running 
/etc/ha.d/rc.d/status status
heartbeat[15721]: 2009/03/03_10:47:49 info: Comm_now_up(): updating status to 
active
heartbeat[15721]: 2009/03/03_10:47:49 info: Local status now set to: 'active'
heartbeat[15721]: 2009/03/03_10:47:49 info: Starting child client 
"/usr/lib/heartbeat/ipfail" (113,125)
heartbeat[15721]: 2009/03/03_10:47:49 info: Status update for node director2: 
status active
heartbeat[15752]: 2009/03/03_10:47:49 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
heartbeat[15751]: 2009/03/03_10:47:49 info: Starting 
"/usr/lib/heartbeat/ipfail" as uid 113  gid 125 (pid 15751)
ipfail[15751]: 2009/03/03_10:47:49 WARN: Initializing connection to logging 
daemon failed. Logging daemon may not be running
ipfail[15751]: 2009/03/03_10:47:49 debug: PID=15751
ipfail[15751]: 2009/03/03_10:47:49 debug: Signing in with heartbeat
logd is not runningharc[15752]:    2009/03/03_10:47:49 info: Running 
/etc/ha.d/rc.d/status status
ipfail[15751]: 2009/03/03_10:47:49 debug: [We are director1]
ipfail[15751]: 2009/03/03_10:47:49 debug: auto_failback -> 1 (on)
ipfail[15751]: 2009/03/03_10:47:49 debug: Setting message filter mode
ipfail[15751]: 2009/03/03_10:47:49 debug: Starting node walk
ipfail[15751]: 2009/03/03_10:47:50 debug: Cluster node: 172.25.146.1: status: 
ping
ipfail[15751]: 2009/03/03_10:47:51 debug: Cluster node: director2: status: 
active
ipfail[15751]: 2009/03/03_10:47:51 debug: [They are director2]
ipfail[15751]: 2009/03/03_10:47:51 debug: Cluster node: director1: status: 
active
ipfail[15751]: 2009/03/03_10:47:51 debug: Setting message signal
ipfail[15751]: 2009/03/03_10:47:52 debug: Waiting for messages...
ipfail[15751]: 2009/03/03_10:47:53 debug: Other side is unstable.
ipfail[15751]: 2009/03/03_10:47:57 debug: Got asked for num_ping.
ipfail[15751]: 2009/03/03_10:47:57 debug: Found ping node 172.25.146.1!
ipfail[15751]: 2009/03/03_10:47:58 info: Ping node count is balanced.
ipfail[15751]: 2009/03/03_10:47:58 debug: Abort message sent.
heartbeat[15721]: 2009/03/03_10:47:59 info: local resource transition completed.
heartbeat[15721]: 2009/03/03_10:47:59 info: Initial resource acquisition 
complete (T_RESOURCES(us))
heartbeat[15721]: 2009/03/03_10:47:59 info: remote resource transition 
completed.
ipfail[15751]: 2009/03/03_10:47:59 debug: Other side is unstable.
ipfail[15751]: 2009/03/03_10:47:59 debug: Other side is now stable.
heartbeat[15823]: 2009/03/03_10:48:00 info: Local Resource acquisition 
completed.
heartbeat[15721]: 2009/03/03_10:48:00 debug: StartNextRemoteRscReq(): child 
count 1
heartbeat[15862]: 2009/03/03_10:48:00 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
INFO:  Success
INFO:  Success

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] HA fails when stopping master director

Reply via email to