Re: [E1000-devel] [Iscsitarget-devel] Bug which is causing system freezes.

Caspar Smit Tue, 12 Jul 2011 01:10:46 -0700

2011/7/11 Arne Redlich <[email protected]>:
> 2011/7/10 Caspar Smit <[email protected]>:
>> Hi all,
>>
>> I'm having stability problems with my setup.
>>
>> Here's my setup
>>
>> Two 24 bay supermicro servers (x8dth-6f) full with 24x Western Digital
>> 2TB SATA RE4-GP disks.
>>
>> The OS on both nodes is Debian Lenny with the backports
>> linux-image-2.6.32-bpo.5-amd64 kernel, the OS resides on a SSD.
>>
>> I created 4x 8TB software raid5 sets (md0/1/2/3) containing 5 disks
>> each. The four remaining disks are hotspares
>>
>> Then i created 4x DRBD volumes of the raid sets with the other server.
>> Using pacemaker and the iSCSITarget/iSCSILogicalUnit RA's I created 4
>> targets/lun/failover ips on top of the 4 DRBD volumes.
>>
>> Each DRBD target/lun has it's own subnet and a dedicated 1gbit ethernet link.
>>
>> I compiled the stable version v1.4.20.2 of IETD from the tarball from
>> the stable ietd website.
>>
>> After a while of operation (sometimes a few days, sometimes a week)
>> the system freezes and I have to hard reset the system to get it back
>> online.
>> Pacemaker does a failover to the second node and then that node
>> freezes also after a while, sometimes instantly, sometimes it takes
>> longer.
>>
>> I see these call traces in the syslog:
>
> Caspar,
>
> Can you post more details about your IET configuration?
> Are there any other error messages except the ones below in the syslog?
> Anyway, this one seems not to involve IET, but looks similar:
> http://lists.debian.org/debian-backports/2011/01/msg00037.html
>
> So that suggests that the problem might not lie with IET.


Yes, I saw that post too. Unfortunatly no replies.

I was using fileio with wb but already switched back to wt mode
because I think wb is not really a good idea in cluster situations
(failover)
Am I right? Will blockio make any difference? Blockio doesn't use the
local page cache as i remember so maybe then the system doesn't have
to allocate local memory?

>
>> Jul 10 13:14:15 node01 kernel: [  477.389011] istd1: page allocation
>> failure. order:0, mode:0x4020
>> Jul 10 13:14:17 node01 kernel: [  477.389242] Pid: 4213, comm: istd1
>> Not tainted 2.6.32-bpo.5-amd64 #1
>> Jul 10 13:14:17 node01 kernel: [  477.389486] Call Trace:
>> Jul 10 13:14:17 node01 kernel: [  477.389675]  <IRQ>
>> [<ffffffff810ba4b1>] ? __alloc_pages_nodemask+0x592/0x5f5
>> Jul 10 13:14:17 node01 kernel: [  477.390014]  [<ffffffff810e67c2>] ?
>> new_slab+0x5b/0x1ca
>
> <snip>
>
>> Jul 10 13:14:17 node01 kernel: [  477.393109] Mem-Info:
>> Jul 10 13:14:17 node01 kernel: [  477.393111] Node 0 DMA per-cpu:
>> Jul 10 13:14:17 node01 kernel: [  477.393113] CPU    0: hi:    0,
>> btch:   1 usd:   0
>> Jul 10 13:14:17 node01 kernel: [  477.393114] Node 0 DMA32 per-cpu:
>> Jul 10 13:14:17 node01 kernel: [  477.393116] CPU    0: hi:  186,
>> btch:  31 usd:  30
>> Jul 10 13:14:17 node01 kernel: [  477.393121] active_anon:28148
>> inactive_anon:28740 isolated_anon:0
>> Jul 10 13:14:17 node01 kernel: [  477.393122]  active_file:14658
>> inactive_file:24493 isolated_file:0
>> Jul 10 13:14:17 node01 kernel: [  477.393123]  unevictable:4547
>> dirty:382 writeback:19817 unstable:0
>> Jul 10 13:14:17 node01 kernel: [  477.393124]  free:751
>> slab_reclaimable:2325 slab_unreclaimable:6997
>> Jul 10 13:14:17 node01 kernel: [  477.393125]  mapped:3650 shmem:152
>> pagetables:986 bounce:0
>> Jul 10 13:14:17 node01 kernel: [  477.393126] Node 0 DMA free:1984kB
>> min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:12kB
>> active_file:7884kB inactive_file:5152kB unevictable:0kB
>> isolated(anon):0kB isolated(file):0kB present:15312kB mlocked:0kB
>> dirty:8kB writeback:808kB mapped:0kB shmem:0kB slab_reclaimable:392kB
>> slab_unreclaimable:352kB kernel_stack:0kB pagetables:0kB unstable:0kB
>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> Jul 10 13:14:17 node01 kernel: [  477.393136] lowmem_reserve[]: 0 489 489 489
>> Jul 10 13:14:17 node01 kernel: [  477.393138] Node 0 DMA32 free:1020kB
>> min:2784kB low:3480kB high:4176kB active_anon:112592kB
>> inactive_anon:114948kB active_file:50748kB inactive_file:92820kB
>> unevictable:18188kB isolated(anon):0kB isolated(file):0kB
>> present:500896kB mlocked:18188kB dirty:1520kB writeback:78460kB
>> mapped:14600kB shmem:608kB slab_reclaimable:8908kB
>> slab_unreclaimable:27636kB kernel_stack:1032kB pagetables:3944kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:448
>> all_unreclaimable? no
>
> The amount of writeback data (78460kB) looks pretty much to me - seems
> the backend storage cannot keep up. Might be due to this being a dump
> from a VM though.

You might be right, i switched back to write through iomode already.
Maybe the raid5 sets are too slow to handle the traffic?
During heavy load I see the iowait doesn't drop below 80 in the vmstat
output of the system and the load rises to 27 (quad core system).

Kind regards,
Caspar
>
>> Jul 10 13:14:17 node01 kernel: [  477.393148] lowmem_reserve[]: 0 0 0 0
>> Jul 10 13:14:17 node01 kernel: [  477.393150] Node 0 DMA: 2*4kB 1*8kB
>> 1*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB
>> 0*4096kB = 1984kB
>> Jul 10 13:14:17 node01 kernel: [  477.393156] Node 0 DMA32: 187*4kB
>> 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 1020kB
>> Jul 10 13:14:17 node01 kernel: [  477.393162] 41122 total pagecache pages
>> Jul 10 13:14:17 node01 kernel: [  477.393163] 39 pages in swap cache
>> Jul 10 13:14:17 node01 kernel: [  477.393165] Swap cache stats: add
>> 39, delete 0, find 36/36
>> Jul 10 13:14:17 node01 kernel: [  477.393166] Free swap  = 2128416kB
>> Jul 10 13:14:17 node01 kernel: [  477.393167] Total swap = 2128572kB
>> Jul 10 13:14:17 node01 kernel: [  477.395112] 131056 pages RAM
>> Jul 10 13:14:17 node01 kernel: [  477.395114] 3844 pages reserved
>> Jul 10 13:14:17 node01 kernel: [  477.395115] 51112 pages shared
>> Jul 10 13:14:17 node01 kernel: [  477.395116] 90961 pages non-shared
>> Jul 10 13:14:17 node01 kernel: [  477.395119] SLUB: Unable to allocate
>> memory on node -1 (gfp=0x20)
>> Jul 10 13:14:17 node01 kernel: [  477.395121]   cache: kmalloc-1024,
>> object size: 1024, buffer size: 1024, default order: 1, min order: 0
>> Jul 10 13:14:17 node01 kernel: [  477.395124]   node 0: slabs: 90,
>> objs: 720, free: 0
>>
>> I see messages about the e1000 driver and iscsi_trgt so that's why i
>> sent this to these mailing lists.
>>
>> Sometimes I get  swapper: page allocation failure. order:0, mode:0x402
>>
>> It only arises when there is heavy network IO to the system.
>>
>> What is causing these freezes? Is it some kind of memory leak because
>> it only happens after a while and not instantly?
>> Is this a known problem?
>>
>> I tried the following:
>>
>> 1) Put in more RAM, the system had 4GB RAM and I upgraded to 16GB on
>> both nodes. This doesn't seem to have any effect.
>>    Ps. the log above is from a virtual (vmware) instance of the same
>> OS image and the same issue arises in the virtual machine (512MB
>> memory).
>>
>> 2) I reverted to the lenny stable kernel linux-image-2.6.26-2-amd64
>> kernel and then there are no freezes, but with this kernel the
>> performance is much lower
>>   then with the 2.6.32 kernel, and the system load seem to get much
>> higher on heavy load. I'd like to know the cause of this and keep
>> using the 2.6.32 kernel.
>>
>> 3) I tried using a different NIC (Intel Pro/1000 PT Quad port server
>> adpater) using the e1000e driver and I have the same issue with that
>> NIC.
>>
>> 4) I will try using the latest stable linux intel e1000e drivers
>> (v1.3.17) from the intel site, i already compiled them but didn't have
>> enough time to get results (freezes).
>>
>> 5) Could it be possible that ethernet flow control could solve this
>> issue? Maybe the storage can't handle the IO's and now I don't have
>> flow control enabled on the switch. I am just guessing here.
>>
>> 6) Googling around I found that other users have these issues with
>> different NIC drivers/kernels and that it is not only on debian.
>>
>> Here's a link to a ubuntu bug report which doesn't provide a solution
>> or cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/164018
>>
>> If more info is needed I'm willing to provide.
>>
>> Maybe other mailing lists need to be informed?
>>
>> Kind regards,
>> Caspar Smit
>>
>> ------------------------------------------------------------------------------
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>> _______________________________________________
>> Iscsitarget-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
>>
>

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] [Iscsitarget-devel] Bug which is causing system freezes.

Reply via email to