Hello everybody,
Here Alejandro from Valencia, Spain. I'm glad to join this mailing list, and
though at present I'm a complete rookie on HA -and a "sophomore" in Linux-, I'd
like to think that some day I might help others about this subject.
Unfortunately, it's me who at present need a helping hand from you...
OK, I'll try to put all the data in order:
A) Abstract of the issue: I have configured load balancing and high
availability with two web servers and two directors with ldirectord and
heartbeat. Load balance works fine, but when testing the HA, if I stop
heartbeat at the main director, the system swaps to backup director but... only
for a few seconds!! Then, everything is dead. ha-debug log at the main director
seems happy, while ha-debug log at the backup director just repeats hundreds of
times
B) What I am actually trying to do:
My main objective is rather simple: Obtain load balancing and high availability
from two mirror web servers -Apache. At present we have just one single web
server with rather heavy work load and running important web applications, so
we need to secure it. Some day we will have four physical servers, two of them
running as Load Directors (master and backup) and two of them as replicated web
servers. But before, I must learn how to do it, of course. So I set up a pilot
system.
C) My pilot system:
I'm working on an Apple Xserve, where I have created four virtual machines. On
each one of them I have installed Ubuntu 8.10. I assigned static IP's to each
one of the VM, and reserved a virtual IP to access the web servers.
So, I have:
director1: 172.25.146.32
director2: 172.25.146.33
web1: 172.25.146.37
web2: 172.25.146.38
Virtual IP: 172.25.146.31
director1 and web1 access the network via eth0, while director2 and web2 do it
via eth1 (I don't know why, it simply was configured like that when I created
the virtual machines and installed Ubuntu).
Each machine has the same /etc/hosts:
127.0.0.1 localhost
172.25.146.32 director1
172.25.146.33 director2
172.25.146.37 web1
172.25.146.38 web2
D) What I have installed and configured:
D1) Apache and PHP5 on web1 and web2. I can access from the
browser http://172.25.146.37, and http://172.25.146.38 with no problems.
D2) I wrote the following script on director1 and director2:
/etc/network/if-up.d/loadmodules
###################
#!/bin/bash
echo ip_vs_dh >> /etc/modules
echo ip_vs_ftp >> /etc/modules
echo ip_vs >> /etc/modules
echo ip_vs_lblc >> /etc/modules
echo ip_vs_lblcr >> /etc/modules
echo ip_vs_lc >> /etc/modules
echo ip_vs_nq >> /etc/modules
echo ip_vs_rr >> /etc/modules
echo ip_vs_sed >> /etc/modules
echo ip_vs_sh >> /etc/modules
echo ip_vs_wlc >> /etc/modules
echo ip_vs_wrr >> /etc/modules
modprobe ip_vs_dh
modprobe ip_vs_ftp
modprobe ip_vs
modprobe ip_vs_lblc
modprobe ip_vs_lblcr
modprobe ip_vs_lc
modprobe ip_vs_nq
modprobe ip_vs_rr
modprobe ip_vs_sed
modprobe ip_vs_sh
modprobe ip_vs_wlc
modprobe ip_vs_wrr
######################
But I noticed that when restarting the machines, the modules weren't reloaded.
So I edited the file /etc/modules and added the lines manually (ip_vs_dh and so
on)... I don't know if I did well...
D3) On director1 and director2, I did: apt-get install ipvsadm
ldirectord heartbeat
D4) Enabled packet forwarding on /etc/sysctl.conf:
net.ipv4.ip_forward = 1
and then
sysctl -p
D5) The files: ha.cf, haresources, authkeys, ldirectord.cf and
logd.cf on director1 and director2:
/etc/ha.d/ha.cf:
#This is for director1
#Changed eth0 by eth1 on director2
#
debugfile /var/log/ha-debug
logfile /var/log/ha-log
use_logd yes
logfacility local0
keepalive 1
warntime 10
deadtime 30
initdead 120
updport 694
ucast eth0 172.25.146.32
ucast eth0 172.25.146.33
auto_failback on
node director1
node director2
ping 172.25.146.1 #gateway
respawn hacluster /usr/lib/heartbeat/ipfail
/etc/ha.d/haresources:
director1 \
ldirectord::ldirector.cf \
LVSSyncDaemonSwap::master \
IPaddr2::172.25.146.31/24/eth0/172.25.146.255
#172.25.146.255 broadcast address
#changed eth0 by eth1 on director2
/etc/ha.d/authkeys: (same for director1 and director2)
auth 3
3 md5 mypassword
/etc/ha.d/ldirectord.cf: (same for director1 and director2)
checktimeout=10
checkinterval=2
autoreload=no
logfile="local0"
quiescent=yes
virtual=172.25.146.31:80
real=172.25.146.37:80 gate
real=172.25.146.38:80 gate
fallback=127.0.0.1:80 gate
service=http
request="test.html"
receive="test"
scheduler=rr
protocol=tcp
checktype=negotiate
/etc/logd.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility daemon
entity logd
useapphbd no
sendqlen 256
recvqlen 256
D6) Created the proper /var/www/test.html on web1 and web2
D7) Typed:
update-rc.d heartbeat start 75 2 3 4 5 . stop 05 0 1 6 .
update-rc.d -f ldirectord remove
/etc/init.d/ldirectord stop
/etc/init.d/heartbeat start
D8) I checked:
ip add sh eth0 on director1, OK
ip add sh eth1 on director2, OK
ldirectord ldirectord.cf status on director1 and director2, running and
stopped, OK
ipvsadm -L -n on director1 and director2, shows the routing table on director1
and nothing on director2, OK
/etc/ha.d/resource.d/LVSSyncDaemonSwap master status on director1 and
director2, running and stopped, OK
D9) On both web servers, I enabled arp_ignore and arp_announce
in /etc/sysctl.conf:
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 1
net.ipv4.conf.eth0.arp_announce = 1
(changed eth0 by eth1 on web2).
And then: sysctl -p
D10) On both web servers, I added the following on
/etc/network/interfaces:
auto lo:0
iface lo:0 inet static
address 172.25.146.31
netmask 255.255.255.255
pre-up sysctl -p > /dev/null
And then: ifup lo:0
E) Done. Final tests:
E1) I try to access http://172.25.146.31 on my browser.
Success. I can check which server is serving with:
ipvsadm -L -n --stats
Both servers are serving alternatively, as expected (round robin -rr-
algorithm).
E2) I kill web1. http://172.25.146.31 keeps on. Same if I start
again web1 and kill web2. Success.
So I achieved Load Balancing. Let's see what happens with the High Availability.
E3) I stop heartbeat on director1 with:
/etc/init.d/heartbeat stop
And... http://172.25.146.31 doesn't answer anymore... Ouch!!!!!!
E4) OK, OK, wait a second, let's go back:
/etc/init.d/heartbeat start (on director1)
And http://172.25.146.31 keeps with no answer... Ooooouch!!!!!!
If I do:
ipvsadm -L -n
There appears no route anymore (in director1 and director2).
Feeling miserable, I do in a hopeless intuition:
/etc/init.d/heartbeat start (on director1, again)
And, surprise, http.... is alive again!!
So, if I put director1 down, heartbeat doesn't swap to director2, and if I want
to put it up again, I must start heartbeat twice!! (so, "auto_failback on"
doesn't work either)...
I tried then to put director1 down, and start heartbeat thousands of times on
director2. Nothing happens anyway...
So I have achieved Lousy Availability instead!!! :_(
I have attached the ha-debug log files to this e-mail, I guess that they must
be significative for more experienced people... Especially the ha-debug of
director2 that only repeats over and over again the same sentence:
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
So I sense that something is trying to access director2 through eth0, which
doesn't exist, as its interface is eth1. But I have revisited many times every
configuration file and I can't find where can be the error.
So... please please please, may I get any hint?
Thanks in advance!!!!
Best regards,
Alejandro
==
Alejandro Sanchez Merono - [email protected]
TIC Department
Institute of Electrical Technology
Parque Tecnologico de Valencia
PATERNA (Valencia)
Spain
Tel.: (+34) 96 136 66 70
Fax: (+34) 96 136 66 80
Web: http://www.ite.es <http://www.ite.es/>
E-mail: [email protected]
No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO: Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
INFO: Success
No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO: Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
INFO: Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon master --mcast-interface=eth0 failed.
INFO: Success
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
No such device
ERROR: ipvsadm --start-daemon backup --mcast-interface=eth0 failed.
INFO: Success
INFO: Success
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Initializing connection to logging
daemon failed. Logging daemon may not be running
heartbeat[15720]: 2009/03/03_10:47:24 info: Enabling logging daemon
heartbeat[15720]: 2009/03/03_10:47:24 info: logfile and debug file are those
specified in logd config file (default /etc/logd.cf)
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Core dumps could be lost if
multiple dumps occur.
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Consider setting non-default value
in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
heartbeat[15720]: 2009/03/03_10:47:24 WARN: Consider setting
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
heartbeat[15720]: 2009/03/03_10:47:24 info: Version 2 support: false
heartbeat[15720]: 2009/03/03_10:47:24 WARN: logd is enabled but
logfile/debugfile/logfacility is still configured in ha.cf
heartbeat[15720]: 2009/03/03_10:47:24 info: **************************
heartbeat[15720]: 2009/03/03_10:47:24 info: Configuration validated. Starting
heartbeat 2.1.3
heartbeat[15721]: 2009/03/03_10:47:24 info: heartbeat: version 2.1.3
heartbeat[15721]: 2009/03/03_10:47:24 info: Heartbeat generation: 1236015595
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: write socket priority
set to IPTOS_LOWDELAY on eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound send socket to
device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: started on port 694
interface eth0 to 172.25.146.32
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: write socket priority
set to IPTOS_LOWDELAY on eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound send socket to
device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ucast: started on port 694
interface eth0 to 172.25.146.33
heartbeat[15721]: 2009/03/03_10:47:24 info: glib: ping heartbeat started.
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[15721]: 2009/03/03_10:47:24 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[15721]: 2009/03/03_10:47:24 info: Local status now set to: 'up'
heartbeat[15721]: 2009/03/03_10:47:25 info: Link 172.25.146.1:172.25.146.1 up.
heartbeat[15721]: 2009/03/03_10:47:25 info: Status update for node
172.25.146.1: status ping
heartbeat[15721]: 2009/03/03_10:47:25 info: Link director1:eth0 up.
heartbeat[15721]: 2009/03/03_10:47:48 info: Link director2:eth0 up.
heartbeat[15721]: 2009/03/03_10:47:48 info: Status update for node director2:
status up
heartbeat[15721]: 2009/03/03_10:47:48 debug: get_delnodelist: delnodelist=
heartbeat[15735]: 2009/03/03_10:47:48 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
logd is not runningharc[15735]: 2009/03/03_10:47:48 info: Running
/etc/ha.d/rc.d/status status
heartbeat[15721]: 2009/03/03_10:47:49 info: Comm_now_up(): updating status to
active
heartbeat[15721]: 2009/03/03_10:47:49 info: Local status now set to: 'active'
heartbeat[15721]: 2009/03/03_10:47:49 info: Starting child client
"/usr/lib/heartbeat/ipfail" (113,125)
heartbeat[15721]: 2009/03/03_10:47:49 info: Status update for node director2:
status active
heartbeat[15752]: 2009/03/03_10:47:49 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
heartbeat[15751]: 2009/03/03_10:47:49 info: Starting
"/usr/lib/heartbeat/ipfail" as uid 113 gid 125 (pid 15751)
ipfail[15751]: 2009/03/03_10:47:49 WARN: Initializing connection to logging
daemon failed. Logging daemon may not be running
ipfail[15751]: 2009/03/03_10:47:49 debug: PID=15751
ipfail[15751]: 2009/03/03_10:47:49 debug: Signing in with heartbeat
logd is not runningharc[15752]: 2009/03/03_10:47:49 info: Running
/etc/ha.d/rc.d/status status
ipfail[15751]: 2009/03/03_10:47:49 debug: [We are director1]
ipfail[15751]: 2009/03/03_10:47:49 debug: auto_failback -> 1 (on)
ipfail[15751]: 2009/03/03_10:47:49 debug: Setting message filter mode
ipfail[15751]: 2009/03/03_10:47:49 debug: Starting node walk
ipfail[15751]: 2009/03/03_10:47:50 debug: Cluster node: 172.25.146.1: status:
ping
ipfail[15751]: 2009/03/03_10:47:51 debug: Cluster node: director2: status:
active
ipfail[15751]: 2009/03/03_10:47:51 debug: [They are director2]
ipfail[15751]: 2009/03/03_10:47:51 debug: Cluster node: director1: status:
active
ipfail[15751]: 2009/03/03_10:47:51 debug: Setting message signal
ipfail[15751]: 2009/03/03_10:47:52 debug: Waiting for messages...
ipfail[15751]: 2009/03/03_10:47:53 debug: Other side is unstable.
ipfail[15751]: 2009/03/03_10:47:57 debug: Got asked for num_ping.
ipfail[15751]: 2009/03/03_10:47:57 debug: Found ping node 172.25.146.1!
ipfail[15751]: 2009/03/03_10:47:58 info: Ping node count is balanced.
ipfail[15751]: 2009/03/03_10:47:58 debug: Abort message sent.
heartbeat[15721]: 2009/03/03_10:47:59 info: local resource transition completed.
heartbeat[15721]: 2009/03/03_10:47:59 info: Initial resource acquisition
complete (T_RESOURCES(us))
heartbeat[15721]: 2009/03/03_10:47:59 info: remote resource transition
completed.
ipfail[15751]: 2009/03/03_10:47:59 debug: Other side is unstable.
ipfail[15751]: 2009/03/03_10:47:59 debug: Other side is now stable.
heartbeat[15823]: 2009/03/03_10:48:00 info: Local Resource acquisition
completed.
heartbeat[15721]: 2009/03/03_10:48:00 debug: StartNextRemoteRscReq(): child
count 1
heartbeat[15862]: 2009/03/03_10:48:00 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
INFO: Success
INFO: Success
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems