2011/7/10 Caspar Smit <[email protected]>:
> Hi all,
>
> I'm having stability problems with my setup.
>
> Here's my setup
>
> Two 24 bay supermicro servers (x8dth-6f) full with 24x Western Digital
> 2TB SATA RE4-GP disks.
>
> The OS on both nodes is Debian Lenny with the backports
> linux-image-2.6.32-bpo.5-amd64 kernel, the OS resides on a SSD.
>
> I created 4x 8TB software raid5 sets (md0/1/2/3) containing 5 disks
> each. The four remaining disks are hotspares
>
> Then i created 4x DRBD volumes of the raid sets with the other server.
> Using pacemaker and the iSCSITarget/iSCSILogicalUnit RA's I created 4
> targets/lun/failover ips on top of the 4 DRBD volumes.
>
> Each DRBD target/lun has it's own subnet and a dedicated 1gbit ethernet link.
>
> I compiled the stable version v1.4.20.2 of IETD from the tarball from
> the stable ietd website.
>
> After a while of operation (sometimes a few days, sometimes a week)
> the system freezes and I have to hard reset the system to get it back
> online.
> Pacemaker does a failover to the second node and then that node
> freezes also after a while, sometimes instantly, sometimes it takes
> longer.
>
> I see these call traces in the syslog:

Caspar,

Can you post more details about your IET configuration?
Are there any other error messages except the ones below in the syslog?
Anyway, this one seems not to involve IET, but looks similar:
http://lists.debian.org/debian-backports/2011/01/msg00037.html

So that suggests that the problem might not lie with IET.

> Jul 10 13:14:15 node01 kernel: [  477.389011] istd1: page allocation
> failure. order:0, mode:0x4020
> Jul 10 13:14:17 node01 kernel: [  477.389242] Pid: 4213, comm: istd1
> Not tainted 2.6.32-bpo.5-amd64 #1
> Jul 10 13:14:17 node01 kernel: [  477.389486] Call Trace:
> Jul 10 13:14:17 node01 kernel: [  477.389675]  <IRQ>
> [<ffffffff810ba4b1>] ? __alloc_pages_nodemask+0x592/0x5f5
> Jul 10 13:14:17 node01 kernel: [  477.390014]  [<ffffffff810e67c2>] ?
> new_slab+0x5b/0x1ca

<snip>

> Jul 10 13:14:17 node01 kernel: [  477.393109] Mem-Info:
> Jul 10 13:14:17 node01 kernel: [  477.393111] Node 0 DMA per-cpu:
> Jul 10 13:14:17 node01 kernel: [  477.393113] CPU    0: hi:    0,
> btch:   1 usd:   0
> Jul 10 13:14:17 node01 kernel: [  477.393114] Node 0 DMA32 per-cpu:
> Jul 10 13:14:17 node01 kernel: [  477.393116] CPU    0: hi:  186,
> btch:  31 usd:  30
> Jul 10 13:14:17 node01 kernel: [  477.393121] active_anon:28148
> inactive_anon:28740 isolated_anon:0
> Jul 10 13:14:17 node01 kernel: [  477.393122]  active_file:14658
> inactive_file:24493 isolated_file:0
> Jul 10 13:14:17 node01 kernel: [  477.393123]  unevictable:4547
> dirty:382 writeback:19817 unstable:0
> Jul 10 13:14:17 node01 kernel: [  477.393124]  free:751
> slab_reclaimable:2325 slab_unreclaimable:6997
> Jul 10 13:14:17 node01 kernel: [  477.393125]  mapped:3650 shmem:152
> pagetables:986 bounce:0
> Jul 10 13:14:17 node01 kernel: [  477.393126] Node 0 DMA free:1984kB
> min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:12kB
> active_file:7884kB inactive_file:5152kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15312kB mlocked:0kB
> dirty:8kB writeback:808kB mapped:0kB shmem:0kB slab_reclaimable:392kB
> slab_unreclaimable:352kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> Jul 10 13:14:17 node01 kernel: [  477.393136] lowmem_reserve[]: 0 489 489 489
> Jul 10 13:14:17 node01 kernel: [  477.393138] Node 0 DMA32 free:1020kB
> min:2784kB low:3480kB high:4176kB active_anon:112592kB
> inactive_anon:114948kB active_file:50748kB inactive_file:92820kB
> unevictable:18188kB isolated(anon):0kB isolated(file):0kB
> present:500896kB mlocked:18188kB dirty:1520kB writeback:78460kB
> mapped:14600kB shmem:608kB slab_reclaimable:8908kB
> slab_unreclaimable:27636kB kernel_stack:1032kB pagetables:3944kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:448
> all_unreclaimable? no

The amount of writeback data (78460kB) looks pretty much to me - seems
the backend storage cannot keep up. Might be due to this being a dump
from a VM though.

> Jul 10 13:14:17 node01 kernel: [  477.393148] lowmem_reserve[]: 0 0 0 0
> Jul 10 13:14:17 node01 kernel: [  477.393150] Node 0 DMA: 2*4kB 1*8kB
> 1*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB
> 0*4096kB = 1984kB
> Jul 10 13:14:17 node01 kernel: [  477.393156] Node 0 DMA32: 187*4kB
> 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 1020kB
> Jul 10 13:14:17 node01 kernel: [  477.393162] 41122 total pagecache pages
> Jul 10 13:14:17 node01 kernel: [  477.393163] 39 pages in swap cache
> Jul 10 13:14:17 node01 kernel: [  477.393165] Swap cache stats: add
> 39, delete 0, find 36/36
> Jul 10 13:14:17 node01 kernel: [  477.393166] Free swap  = 2128416kB
> Jul 10 13:14:17 node01 kernel: [  477.393167] Total swap = 2128572kB
> Jul 10 13:14:17 node01 kernel: [  477.395112] 131056 pages RAM
> Jul 10 13:14:17 node01 kernel: [  477.395114] 3844 pages reserved
> Jul 10 13:14:17 node01 kernel: [  477.395115] 51112 pages shared
> Jul 10 13:14:17 node01 kernel: [  477.395116] 90961 pages non-shared
> Jul 10 13:14:17 node01 kernel: [  477.395119] SLUB: Unable to allocate
> memory on node -1 (gfp=0x20)
> Jul 10 13:14:17 node01 kernel: [  477.395121]   cache: kmalloc-1024,
> object size: 1024, buffer size: 1024, default order: 1, min order: 0
> Jul 10 13:14:17 node01 kernel: [  477.395124]   node 0: slabs: 90,
> objs: 720, free: 0
>
> I see messages about the e1000 driver and iscsi_trgt so that's why i
> sent this to these mailing lists.
>
> Sometimes I get  swapper: page allocation failure. order:0, mode:0x402
>
> It only arises when there is heavy network IO to the system.
>
> What is causing these freezes? Is it some kind of memory leak because
> it only happens after a while and not instantly?
> Is this a known problem?
>
> I tried the following:
>
> 1) Put in more RAM, the system had 4GB RAM and I upgraded to 16GB on
> both nodes. This doesn't seem to have any effect.
>    Ps. the log above is from a virtual (vmware) instance of the same
> OS image and the same issue arises in the virtual machine (512MB
> memory).
>
> 2) I reverted to the lenny stable kernel linux-image-2.6.26-2-amd64
> kernel and then there are no freezes, but with this kernel the
> performance is much lower
>   then with the 2.6.32 kernel, and the system load seem to get much
> higher on heavy load. I'd like to know the cause of this and keep
> using the 2.6.32 kernel.
>
> 3) I tried using a different NIC (Intel Pro/1000 PT Quad port server
> adpater) using the e1000e driver and I have the same issue with that
> NIC.
>
> 4) I will try using the latest stable linux intel e1000e drivers
> (v1.3.17) from the intel site, i already compiled them but didn't have
> enough time to get results (freezes).
>
> 5) Could it be possible that ethernet flow control could solve this
> issue? Maybe the storage can't handle the IO's and now I don't have
> flow control enabled on the switch. I am just guessing here.
>
> 6) Googling around I found that other users have these issues with
> different NIC drivers/kernels and that it is not only on debian.
>
> Here's a link to a ubuntu bug report which doesn't provide a solution
> or cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/164018
>
> If more info is needed I'm willing to provide.
>
> Maybe other mailing lists need to be informed?
>
> Kind regards,
> Caspar Smit
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Iscsitarget-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
>

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to