Hi, I am currently trying to get down to the core of a problem where my LVS-director seems to drop a packet coming from a client from time to time. We have this problem on our production systems and can reproduce the problem on staging.
Our setup: =========== We are using ipvsadm with Linux CentOS5 x86_64 in a PV XEN-DomU. Current Version details: Kernel: 2.6.18-348.1.1.el5xen ipvsadm: 1.24-13.el5 LVS-Setup: We use IPVS in DR-mode, for managing the running connections we use lvs-kiss. lvs is running in a heartbeat-v1-cluster (two virtual nodes), master and backup are running constantly on both nodes For the LVS-services we use logical IPs being setup by heartbeat (active/passive-clustermode) The real-servers are physical Linux-machines. Network-Setup: The VM acting as director is running as XEN-PV-DomU on a Dom0 using bridged networks. Networks "in play": abn-network (staging-network, used to connect the client to the director), used by the real-servers to send the answer to the clients (direct routing approach), used for ipvsadm slave/master multicast-traffic lvs-network: This is a dedicated VLAN which connects director and real-servers dr-arp-problem: solved my suppressing arp-answers on the real-servers for the service-ip The service-IP is configured as logical IP on the lvs-interface on the real-servers. In this setup ip_forwarding is not needed anywhere (neither on director, nor on real-server). VM details: 1 GB RAM, 2 vCPUs, system-load almost 0, memory 73M free, 224M buffers, 536M cache, no swap. top shows almost always 100% idle, 0% us/sy/ni/wa/hi/si/st. Configuration details: ipvsadm -Ln for the service in question shows: TCP x.y.183.217:12405 wrr persistent 7200 -> 192.168.83.234:12405 Route 1000 0 0 -> 192.168.83.235:12405 Route 1000 0 0 x.y first two octets are from our internal class-B-range. We use 192.168.83.x as lvs-network for staging. Persistent ipvsadm-configuration: /etc/sysconfig/ipvsadm: --set 20 20 20 Cluster-configuration: /etc/ha.d/haresources: $primary_directorname lvs-kiss x.y.183.217 lvs-kiss-configuration-snippet for the service above: <VirtualServer idm-abn:12405> ServiceType tcp Scheduler wrr DynamicScheduler 0 Persistance 7200 QueueSize 2 Fuzz 0.1 <RealServer rs1-lvs:12405> PacketForwardingMethod gatewaying Test ping -c 1 -nq -W 1 rs1-lvs >/dev/null RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs1-lvs" RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs1-lvs" </RealServer> <RealServer rs2-lvs:12405> PacketForwardingMethod gatewaying Test ping -c 1 -nq -W 1 rs2-lvs >/dev/null RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs2-lvs" RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs2-lvs" </RealServer> </VirtualServer> idm-abn, rs1 and rs2 resolve via /etc/hosts. About the service: This is a soa-web-service. How we reproduce the error: From a client we run constant calls to the web-service at an interval of one call in three seconds. From time to time there will be a connection reset from the director to the client. Interesting: This happens on n x 100th + 1 tries - interesting is the one. What we did to trace down the problem: - Checked /proc/sys/net/ipv4/vs: all values are set to default, so drop_packet is NOT in place (=0) - tcpdump on client, fronted/abn of the director, backend/lvs of the directory, lvs and abn of the real-servers In this tcpdump we could see a request from the client, answered by a connection-reset by the director. The packet was NOT forwarded via LVS. I welcome any ideas on how to track this problem further down. If any information is unclear/missing to drill down the problem - please ask. Kind regards Nils Hildebrand _______________________________________________ Please read the documentation before posting - it's available at: http://www.linuxvirtualserver.org/ LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org Send requests to lvs-users-requ...@linuxvirtualserver.org or go to http://lists.graemef.net/mailman/listinfo/lvs-users