2011/7/11 Arne Redlich <[email protected]>: > 2011/7/10 Caspar Smit <[email protected]>: >> Hi all, >> >> I'm having stability problems with my setup. >> >> Here's my setup >> >> Two 24 bay supermicro servers (x8dth-6f) full with 24x Western Digital >> 2TB SATA RE4-GP disks. >> >> The OS on both nodes is Debian Lenny with the backports >> linux-image-2.6.32-bpo.5-amd64 kernel, the OS resides on a SSD. >> >> I created 4x 8TB software raid5 sets (md0/1/2/3) containing 5 disks >> each. The four remaining disks are hotspares >> >> Then i created 4x DRBD volumes of the raid sets with the other server. >> Using pacemaker and the iSCSITarget/iSCSILogicalUnit RA's I created 4 >> targets/lun/failover ips on top of the 4 DRBD volumes. >> >> Each DRBD target/lun has it's own subnet and a dedicated 1gbit ethernet link. >> >> I compiled the stable version v1.4.20.2 of IETD from the tarball from >> the stable ietd website. >> >> After a while of operation (sometimes a few days, sometimes a week) >> the system freezes and I have to hard reset the system to get it back >> online. >> Pacemaker does a failover to the second node and then that node >> freezes also after a while, sometimes instantly, sometimes it takes >> longer. >> >> I see these call traces in the syslog: > > Caspar, > > Can you post more details about your IET configuration? > Are there any other error messages except the ones below in the syslog? > Anyway, this one seems not to involve IET, but looks similar: > http://lists.debian.org/debian-backports/2011/01/msg00037.html > > So that suggests that the problem might not lie with IET.
Yes, I saw that post too. Unfortunatly no replies. I was using fileio with wb but already switched back to wt mode because I think wb is not really a good idea in cluster situations (failover) Am I right? Will blockio make any difference? Blockio doesn't use the local page cache as i remember so maybe then the system doesn't have to allocate local memory? > >> Jul 10 13:14:15 node01 kernel: [ 477.389011] istd1: page allocation >> failure. order:0, mode:0x4020 >> Jul 10 13:14:17 node01 kernel: [ 477.389242] Pid: 4213, comm: istd1 >> Not tainted 2.6.32-bpo.5-amd64 #1 >> Jul 10 13:14:17 node01 kernel: [ 477.389486] Call Trace: >> Jul 10 13:14:17 node01 kernel: [ 477.389675] <IRQ> >> [<ffffffff810ba4b1>] ? __alloc_pages_nodemask+0x592/0x5f5 >> Jul 10 13:14:17 node01 kernel: [ 477.390014] [<ffffffff810e67c2>] ? >> new_slab+0x5b/0x1ca > > <snip> > >> Jul 10 13:14:17 node01 kernel: [ 477.393109] Mem-Info: >> Jul 10 13:14:17 node01 kernel: [ 477.393111] Node 0 DMA per-cpu: >> Jul 10 13:14:17 node01 kernel: [ 477.393113] CPU 0: hi: 0, >> btch: 1 usd: 0 >> Jul 10 13:14:17 node01 kernel: [ 477.393114] Node 0 DMA32 per-cpu: >> Jul 10 13:14:17 node01 kernel: [ 477.393116] CPU 0: hi: 186, >> btch: 31 usd: 30 >> Jul 10 13:14:17 node01 kernel: [ 477.393121] active_anon:28148 >> inactive_anon:28740 isolated_anon:0 >> Jul 10 13:14:17 node01 kernel: [ 477.393122] active_file:14658 >> inactive_file:24493 isolated_file:0 >> Jul 10 13:14:17 node01 kernel: [ 477.393123] unevictable:4547 >> dirty:382 writeback:19817 unstable:0 >> Jul 10 13:14:17 node01 kernel: [ 477.393124] free:751 >> slab_reclaimable:2325 slab_unreclaimable:6997 >> Jul 10 13:14:17 node01 kernel: [ 477.393125] mapped:3650 shmem:152 >> pagetables:986 bounce:0 >> Jul 10 13:14:17 node01 kernel: [ 477.393126] Node 0 DMA free:1984kB >> min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:12kB >> active_file:7884kB inactive_file:5152kB unevictable:0kB >> isolated(anon):0kB isolated(file):0kB present:15312kB mlocked:0kB >> dirty:8kB writeback:808kB mapped:0kB shmem:0kB slab_reclaimable:392kB >> slab_unreclaimable:352kB kernel_stack:0kB pagetables:0kB unstable:0kB >> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no >> Jul 10 13:14:17 node01 kernel: [ 477.393136] lowmem_reserve[]: 0 489 489 489 >> Jul 10 13:14:17 node01 kernel: [ 477.393138] Node 0 DMA32 free:1020kB >> min:2784kB low:3480kB high:4176kB active_anon:112592kB >> inactive_anon:114948kB active_file:50748kB inactive_file:92820kB >> unevictable:18188kB isolated(anon):0kB isolated(file):0kB >> present:500896kB mlocked:18188kB dirty:1520kB writeback:78460kB >> mapped:14600kB shmem:608kB slab_reclaimable:8908kB >> slab_unreclaimable:27636kB kernel_stack:1032kB pagetables:3944kB >> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:448 >> all_unreclaimable? no > > The amount of writeback data (78460kB) looks pretty much to me - seems > the backend storage cannot keep up. Might be due to this being a dump > from a VM though. You might be right, i switched back to write through iomode already. Maybe the raid5 sets are too slow to handle the traffic? During heavy load I see the iowait doesn't drop below 80 in the vmstat output of the system and the load rises to 27 (quad core system). Kind regards, Caspar > >> Jul 10 13:14:17 node01 kernel: [ 477.393148] lowmem_reserve[]: 0 0 0 0 >> Jul 10 13:14:17 node01 kernel: [ 477.393150] Node 0 DMA: 2*4kB 1*8kB >> 1*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB >> 0*4096kB = 1984kB >> Jul 10 13:14:17 node01 kernel: [ 477.393156] Node 0 DMA32: 187*4kB >> 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB >> 0*4096kB = 1020kB >> Jul 10 13:14:17 node01 kernel: [ 477.393162] 41122 total pagecache pages >> Jul 10 13:14:17 node01 kernel: [ 477.393163] 39 pages in swap cache >> Jul 10 13:14:17 node01 kernel: [ 477.393165] Swap cache stats: add >> 39, delete 0, find 36/36 >> Jul 10 13:14:17 node01 kernel: [ 477.393166] Free swap = 2128416kB >> Jul 10 13:14:17 node01 kernel: [ 477.393167] Total swap = 2128572kB >> Jul 10 13:14:17 node01 kernel: [ 477.395112] 131056 pages RAM >> Jul 10 13:14:17 node01 kernel: [ 477.395114] 3844 pages reserved >> Jul 10 13:14:17 node01 kernel: [ 477.395115] 51112 pages shared >> Jul 10 13:14:17 node01 kernel: [ 477.395116] 90961 pages non-shared >> Jul 10 13:14:17 node01 kernel: [ 477.395119] SLUB: Unable to allocate >> memory on node -1 (gfp=0x20) >> Jul 10 13:14:17 node01 kernel: [ 477.395121] cache: kmalloc-1024, >> object size: 1024, buffer size: 1024, default order: 1, min order: 0 >> Jul 10 13:14:17 node01 kernel: [ 477.395124] node 0: slabs: 90, >> objs: 720, free: 0 >> >> I see messages about the e1000 driver and iscsi_trgt so that's why i >> sent this to these mailing lists. >> >> Sometimes I get swapper: page allocation failure. order:0, mode:0x402 >> >> It only arises when there is heavy network IO to the system. >> >> What is causing these freezes? Is it some kind of memory leak because >> it only happens after a while and not instantly? >> Is this a known problem? >> >> I tried the following: >> >> 1) Put in more RAM, the system had 4GB RAM and I upgraded to 16GB on >> both nodes. This doesn't seem to have any effect. >> Ps. the log above is from a virtual (vmware) instance of the same >> OS image and the same issue arises in the virtual machine (512MB >> memory). >> >> 2) I reverted to the lenny stable kernel linux-image-2.6.26-2-amd64 >> kernel and then there are no freezes, but with this kernel the >> performance is much lower >> then with the 2.6.32 kernel, and the system load seem to get much >> higher on heavy load. I'd like to know the cause of this and keep >> using the 2.6.32 kernel. >> >> 3) I tried using a different NIC (Intel Pro/1000 PT Quad port server >> adpater) using the e1000e driver and I have the same issue with that >> NIC. >> >> 4) I will try using the latest stable linux intel e1000e drivers >> (v1.3.17) from the intel site, i already compiled them but didn't have >> enough time to get results (freezes). >> >> 5) Could it be possible that ethernet flow control could solve this >> issue? Maybe the storage can't handle the IO's and now I don't have >> flow control enabled on the switch. I am just guessing here. >> >> 6) Googling around I found that other users have these issues with >> different NIC drivers/kernels and that it is not only on debian. >> >> Here's a link to a ubuntu bug report which doesn't provide a solution >> or cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/164018 >> >> If more info is needed I'm willing to provide. >> >> Maybe other mailing lists need to be informed? >> >> Kind regards, >> Caspar Smit >> >> ------------------------------------------------------------------------------ >> All of the data generated in your IT infrastructure is seriously valuable. >> Why? It contains a definitive record of application performance, security >> threats, fraudulent activity, and more. Splunk takes this data and makes >> sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-d2d-c2 >> _______________________________________________ >> Iscsitarget-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel >> > ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
