2 node CentOS 4.8 cluster on ESX 4 cluster (cluster across boxes) [r...@host ~]# uname -a
Linux hostname 2.6.9-89.0.19.ELlargesmp 2 GB RAM 2 vCPU 1 200 GB RDM - GFS1 VMware fencing Member Status: Quorate Member Name Status ------ ---- ------ Host1 Online, Local, rgmanager Host2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- www-http host1 started www-nfs host2 started vhostip-http host2 started vhost-http host2 started [r...@host ~]# rpm -qa | grep cman cman-kernel-2.6.9-56.7.el4_8.10 cman-kernel-smp-2.6.9-56.7.el4_8.10 cman-devel-1.0.24-1 cman-kernel-largesmp-2.6.9-56.7.el4_8.10 cman-1.0.24-1 cman-kernheaders-2.6.9-56.7.el4_8.10 /var/log/messages Apr 26 18:45:32 tesla kernel: oom-killer: gfp_mask=0xd0 Apr 26 18:45:32 tesla kernel: Mem-info: Apr 26 18:45:32 tesla kernel: Node 0 DMA per-cpu: Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 2, high 6, batch 1 Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 2, batch 1 Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 2, high 6, batch 1 Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 2, batch 1 Apr 26 18:45:32 tesla kernel: Node 0 Normal per-cpu: Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 32, high 96, batch 16 Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 32, batch 16 Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 32, high 96, batch 16 Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 32, batch 16 Apr 26 18:45:32 tesla kernel: Node 0 HighMem per-cpu: empty Apr 26 18:45:32 tesla kernel: Apr 26 18:45:32 tesla kernel: Free pages: 6352kB (0kB HighMem) Apr 26 18:45:32 tesla kernel: Active:3245 inactive:3129 dirty:0 writeback:0 unstable:0 free:1588 slab:499421 mapped:4514 pagetables:914 Apr 26 18:45:32 tesla kernel: Node 0 DMA free:752kB min:44kB low:88kB high:132kB active:0kB inactive:0kB present:15996kB pages_scanned:0 all_unreclaimable? yes Apr 26 18:45:32 tesla kernel: protections[]: 0 286000 286000 Apr 26 18:45:32 tesla kernel: Node 0 Normal free:5600kB min:5720kB low:11440kB high:17160kB active:12980kB inactive:12516kB present:2080704kB pages_scanned:20031 all_unreclaimable? yes Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0 Apr 26 18:45:32 tesla kernel: Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0 Apr 26 18:45:32 tesla kernel: Node 0 DMA: 4*4kB 4*8kB 2*16kB 3*32kB 3*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 752kB Apr 26 18:45:32 tesla kernel: Node 0 Normal: 0*4kB 0*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5600kB Apr 26 18:45:32 tesla kernel: Node 0 HighMem: empty Apr 26 18:45:32 tesla kernel: 6192 pagecache pages Every 4 days the host2 system (running NFS service) starts running oom-killer, goes brain dead, and gets fenced. The http processes are restarted every morning at 4:00 AM for log rotates so I don't think they are the problem. Attempts to fix: http://kbase.redhat.com/faq/docs/DOC-3993 http://kbase.redhat.com/faq/docs/DOC-7317 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=di splayKC&externalId=1002704> &cmd=displayKC&externalId=1002704 Release Found: Red Hat Enterprise Linux 4 Update 4 Symptom: The command top shows a lot of memory is being cached and swap is hardly being used. Solution: On Red Hat Enterprise Release 4 Update 4, a workaround to the oom killer kills random processess while there is still memory available, is to issue the following commend: This will cause page reclamation to happen sooner, thus providing more 'protection' for the zones. Changes to Tesla : [r...@host ~]# echo 100 > /proc/sys/vm/lower_zone_protection Anybody have any ideas? Thanks, Eric
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
