We have a number of LVS clusters setup and have been running them successfully for a few years, however occasionally we notice that the load balancing plays up and the traffic becomes unbalanced. What happens is all requests start getting sent to a single backend server and the other backend servers get no traffic.
It doesn't matter what scheduler is used (ie: rr, wrr, lc, wlc) we still ocassionally see this pattern. Also its not the monitoring script taking the backed nodes off the ipvs table. Our LVS directors run CentOS 4.6 and OpenSUSE 10.1 (unrelated clusters) but we see it happen on both. The following URL [ http://img208.imageshack.us/my.php?image=lvsunbalancedsf2.gif ], has a picture of the backend servers monitoring and you can see at about 2:00 traffic from three of the backend servers drops off to nothing and "server 4" starts handling all traffic. Whilst this occurred ipvsadm gave the following output; $ ipvsadm -ln IP Virtual Server version 1.2.0 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP xxx.xxx.xxx.102:80 rr -> xxx.xxx.xxx.9:80 Route 100 0 0 -> xxx.xxx.xxx.10:80 Route 100 0 0 -> xxx.xxx.xxx.11:80 Route 100 0 0 -> xxx.xxx.xxx.12:80 Route 100 0 0 TCP xxx.xxx.xxx.103:80 rr -> xxx.xxx.xxx.12:80 Route 100 0 0 -> xxx.xxx.xxx.11:80 Route 100 0 0 -> xxx.xxx.xxx.10:80 Route 100 0 0 -> xxx.xxx.xxx.9:80 Route 100 0 0 TCP xxx.xxx.xxx.100:80 rr -> xxx.xxx.xxx.9:80 Route 100 0 0 -> xxx.xxx.xxx.10:80 Route 100 0 0 -> xxx.xxx.xxx.11:80 Route 100 0 0 -> xxx.xxx.xxx.12:80 Route 100 0 0 TCP xxx.xxx.xxx.101:80 rr -> xxx.xxx.xxx.12:80 Route 100 1 1 -> xxx.xxx.xxx.11:80 Route 100 6 0 -> xxx.xxx.xxx.10:80 Route 100 1 3 -> xxx.xxx.xxx.9:80 Route 100 1 0 TCP xxx.xxx.xxx.104:80 rr -> xxx.xxx.xxx.9:80 Route 100 0 0 -> xxx.xxx.xxx.10:80 Route 100 0 0 -> xxx.xxx.xxx.11:80 Route 100 0 0 -> xxx.xxx.xxx.12:80 Route 100 0 0 The connection column's have pretty much dropped to zero, even though traffic is still being served on the site. Note that server4 in the graph corresponds to xxx.xxx.xxx.12. To get around this problem we run a script which monitors the VIP and if the HTTP requests take to long (due to the single backend node being overloaded) we cause heartbeat to standby and swap to the VIP's to the other LVS director. That then causes everything to be rebalanced. Anyhow does anyone else have this problem and ideally a fix? Thanks, Paul _______________________________________________ LinuxVirtualServer.org mailing list - [email protected] Send requests to [EMAIL PROTECTED] or go to http://lists.graemef.net/mailman/listinfo/lvs-users
