Hi there!

My freshly setup homeserver has serious stability problem, which make the 
the machine absolutely unuseable as home server. 

The system is an amd64 (50...@2600mhz, no overclocking) with 4GB Ram and 
three 1TB SATA disks bundled to a raid5 (kernel md driver). 
The kernels I've tried are gentoo-2.6.28-r, gentoo-2.6.29-r5 and 
vanilla-2.6.29.4.

The problem is, when I copy large amount of data via nfs to the server, then 
the logs get filled with entries like this:

--- Start extraction of /var/log/everything/current

[13010.203643] kswapd0: page allocation failure. order:1, mode:0x4020
Jun  2 21:40:43 [kernel] [13010.203653] Pid: 346, comm: kswapd0 Tainted: G      
  
W  2.6.29.4 #2
Jun  2 21:40:43 [kernel] [13010.203657] Call Trace:
Jun  2 21:40:43 [kernel] [13010.203662]  <IRQ>  [<ffffffff810c887e>] 
__alloc_pages_internal+0x3ce/0x4f0
Jun  2 21:40:43 [kernel] [13010.203690]  [<ffffffff810f2d62>] ? 
unfreeze_slab+0x92/0xf0
Jun  2 21:40:43 [kernel] [13010.203697]  [<ffffffff810f3bc6>] 
__slab_alloc+0x226/0x570
Jun  2 21:40:43 [kernel] [13010.203706]  [<ffffffff814018df>] ? 
__netdev_alloc_skb+0x1f/0x40
Jun  2 21:40:43 [kernel] [13010.203714]  [<ffffffff810f4f4f>] 
__kmalloc_track_caller+0x10f/0x150
Jun  2 21:40:43 [kernel] [13010.203720]  [<ffffffff814018df>] ? 
__netdev_alloc_skb+0x1f/0x40
Jun  2 21:40:43 [kernel] [13010.203726]  [<ffffffff8140158e>] 
__alloc_skb+0x6e/0x140
Jun  2 21:40:43 [kernel] [13010.203732]  [<ffffffff814018df>] 
__netdev_alloc_skb+0x1f/0x40
Jun  2 21:40:43 [kernel] [13010.203743]  [<ffffffff812f6c32>] 
rtl8169_rx_fill+0xc2/0x1f0
Jun  2 21:40:43 [kernel] [13010.203750]  [<ffffffff812f6f80>] 
rtl8169_rx_interrupt+0x220/0x4e0
Jun  2 21:40:43 [kernel] [13010.203757]  [<ffffffff812f73c0>] 
rtl8169_poll+0x40/0x220
Jun  2 21:40:43 [kernel] [13010.203765]  [<ffffffff81407c99>] 
net_rx_action+0xa9/0x150
Jun  2 21:40:43 [kernel] [13010.203775]  [<ffffffff8106d95c>] 
__do_softirq+0x9c/0x160
Jun  2 21:40:43 [kernel] [13010.203783]  [<ffffffff8102b87c>] 
call_softirq+0x1c/0x30
Jun  2 21:40:43 [kernel] [13010.203789]  [<ffffffff8102cbb5>] 
do_softirq+0x65/0xb0
Jun  2 21:40:43 [kernel] [13010.203796]  [<ffffffff8106d595>] 
irq_exit+0x85/0xb0
Jun  2 21:40:43 [kernel] [13010.203801]  [<ffffffff8102ce6a>] 
do_IRQ+0xba/0x1b0
Jun  2 21:40:43 [kernel] [13010.203808]  [<ffffffff8102b0d3>] 
ret_from_intr+0x0/0xf
Jun  2 21:40:43 [kernel] [13010.203811]  <EOI>  [<ffffffff815144db>] ? 
_spin_unlock_irq+0x2b/0x40
Jun  2 21:40:43 [kernel] [13010.203825]  [<ffffffff815144e0>] ? 
_spin_unlock_irq+0x30/0x40
Jun  2 21:40:43 [kernel] [13010.203834]  [<ffffffff810ceba0>] ? 
__remove_mapping+0xd0/0x100
Jun  2 21:40:43 [kernel] [13010.203841]  [<ffffffff810cf560>] ? 
shrink_page_list+0x380/0x7f0
Jun  2 21:40:43 [kernel] [13010.203848]  [<ffffffff810cfc30>] ? 
shrink_list+0x260/0x650
Jun  2 21:40:43 [kernel] [13010.203856]  [<ffffffff810d0293>] ? 
shrink_zone+0x273/0x380
Jun  2 21:40:43 [kernel] [13010.203863]  [<ffffffff810d04e7>] ? 
shrink_slab+0x147/0x180
Jun  2 21:40:43 [kernel] [13010.203870]  [<ffffffff810d0c71>] ? 
kswapd+0x751/0x7c0
Jun  2 21:40:43 [kernel] [13010.203877]  [<ffffffff810ce080>] ? 
isolate_pages_global+0x0/0x280
Jun  2 21:40:43 [kernel] [13010.203885]  [<ffffffff8107f970>] ? 
autoremove_wake_function+0x0/0x40
Jun  2 21:40:43 [kernel] [13010.203896]  [<ffffffff810919bd>] ? 
trace_hardirqs_on+0xd/0x10
Jun  2 21:40:43 [kernel] [13010.203903]  [<ffffffff810d0520>] ? 
kswapd+0x0/0x7c0
Jun  2 21:40:43 [kernel] [13010.203909]  [<ffffffff8107f559>] ? 
kthread+0x49/0x80
Jun  2 21:40:43 [kernel] [13010.203915]  [<ffffffff8102b77a>] ? 
child_rip+0xa/0x20
Jun  2 21:40:43 [kernel] [13010.203920]  [<ffffffff8102b17c>] ? 
restore_args+0x0/0x30
Jun  2 21:40:43 [kernel] [13010.203926]  [<ffffffff8107f510>] ? 
kthread+0x0/0x80
Jun  2 21:40:43 [kernel] [13010.203932]  [<ffffffff8102b770>] ? 
child_rip+0x0/0x20
Jun  2 21:40:43 [kernel] [13010.203936] Mem-Info:
Jun  2 21:40:43 [kernel] [13010.203939] DMA per-cpu:
Jun  2 21:40:43 [kernel] [13010.203943] CPU    0: hi:    0, btch:   1 usd:   
0
Jun  2 21:40:43 [kernel] [13010.203947] CPU    1: hi:    0, btch:   1 usd:   
0
Jun  2 21:40:43 [kernel] [13010.203950] DMA32 per-cpu:
Jun  2 21:40:43 [kernel] [13010.203954] CPU    0: hi:  186, btch:  31 usd: 
180
Jun  2 21:40:43 [kernel] [13010.203958] CPU    1: hi:  186, btch:  31 usd: 
182
Jun  2 21:40:43 [kernel] [13010.203961] Normal per-cpu:
Jun  2 21:40:43 [kernel] [13010.203964] CPU    0: hi:  186, btch:  31 usd: 
155
Jun  2 21:40:43 [kernel] [13010.203968] CPU    1: hi:  186, btch:  31 usd: 
167
Jun  2 21:40:43 [kernel] [13010.203974] Active_anon:270 active_file:11782 
inactive_anon:2024
Jun  2 21:40:43 [kernel] [13010.203976]  inactive_file:905625 unevictable:0 
dirty:28068 writeback:0 unstable:0
Jun  2 21:40:43 [kernel] [13010.203978]  free:15792 slab:50324 mapped:697 
pagetables:413 bounce:0
Jun  2 21:40:43 [kernel] [13010.203986] DMA free:1960kB min:28kB low:32kB 
high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB present:15152kB pages_scanned:0 
all_unreclaimable? yes
Jun  2 21:40:43 [kernel] [13010.203992] lowmem_reserve[]: 0 3480 3949 3949
Jun  2 21:40:43 [kernel] [13010.204004] DMA32 free:60844kB min:7068kB 
low:8832kB high:10600kB active_anon:580kB inactive_anon:384kB 
active_file:30744kB inactive_file:3213088kB unevictable:0kB 
present:3563552kB pages_scanned:32 all_unreclaimable? no
Jun  2 21:40:43 [kernel] [13010.204011] lowmem_reserve[]: 0 0 469 469
Jun  2 21:40:43 [kernel] [13010.204013] Normal free:364kB min:952kB 
low:1188kB high:1428kB active_anon:500kB inactive_anon:7712kB 
active_file:16384kB inactive_file:409412kB unevictable:0kB present:480960kB 
pages_scanned:0 all_unreclaimable? no
Jun  2 21:40:43 [kernel] [13010.204013] lowmem_reserve[]: 0 0 0 0
Jun  2 21:40:43 [kernel] [13010.204013] 917432 total pagecache pages
Jun  2 21:40:43 [kernel] [13010.204013] 0 pages in swap cache
Jun  2 21:40:43 [kernel] [13010.204013] Swap cache stats: add 0, delete 0, 
find 0/0
Jun  2 21:40:43 [kernel] [13010.204013] Free swap  = 5879760kB
Jun  2 21:40:43 [kernel] [13010.204013] Total swap = 5879760kB
Jun  2 21:40:43 [kernel] [13010.204013] 1171440 pages RAM
Jun  2 21:40:43 [kernel] [13010.204013] 178617 pages reserved
Jun  2 21:40:43 [kernel] [13010.204013] 905214 pages shared
Jun  2 21:40:43 [kernel] [13010.204013] 73436 pages non-shared

--- End extraction of /var/log/everything/current

Sometimes its kswap0, nfsd, swapper, and a lot of other progs causing it. 

However, the systems runs without any problem until it has enough physical 
memory. If I copy larger amounts of data (e.g. typical dvd-iso images of 
~4GB), then the log gets filled. Today's output was >65MB :-(
So far, I think these messages are just annoying but do no further harm. 

The second problem is that the network controler (rtl8169) seems to timeout 
on TX, which is very bad if it happens during renewing your dhcp-lease :-( 
the message looks like

Jun  2 18:25:23 [kernel] [ 1290.701314] ------------[ cut here ]------------
Jun  2 18:25:23 [kernel] [ 1290.701321] WARNING: at 
net/sched/sch_generic.c:226 dev_watchdog+0x245/0x260()
Jun  2 18:25:23 [kernel] [ 1290.701326] Hardware name: System Product Name
Jun  2 18:25:23 [kernel] [ 1290.701330] NETDEV WATCHDOG: eth0 (r8169): 
transmit timed out
Jun  2 18:25:23 [kernel] [ 1290.701334] Modules linked in: psmouse k8temp 
pcspkr hwmon
Jun  2 18:25:23 [kernel] [ 1290.701348] Pid: 0, comm: swapper Not tainted 
2.6.29.4 #2
Jun  2 18:25:23 [kernel] [ 1290.701352] Call Trace:
Jun  2 18:25:23 [kernel] [ 1290.701356]  <IRQ>  [<ffffffff810676c0>] 
warn_slowpath+0xd0/0x130
Jun  2 18:25:23 [kernel] [ 1290.701375]  [<ffffffff8108559b>] ? 
sched_clock_cpu+0x13b/0x180
Jun  2 18:25:23 [kernel] [ 1290.701385]  [<ffffffff8126cf19>] ? 
strlcpy+0x49/0x60
Jun  2 18:25:23 [kernel] [ 1290.701391]  [<ffffffff8141ca25>] 
dev_watchdog+0x245/0x260
Jun  2 18:25:23 [kernel] [ 1290.701400]  [<ffffffff815144db>] ? 
_spin_unlock_irq+0x2b/0x40
Jun  2 18:25:23 [kernel] [ 1290.701410]  [<ffffffff81091876>] ? 
trace_hardirqs_on_caller+0x66/0x1a0
Jun  2 18:25:23 [kernel] [ 1290.701415]  [<ffffffff8141c7e0>] ? 
dev_watchdog+0x0/0x260
Jun  2 18:25:23 [kernel] [ 1290.701423]  [<ffffffff810726c0>] 
run_timer_softirq+0x170/0x250
Jun  2 18:25:23 [kernel] [ 1290.701431]  [<ffffffff8106d95c>] 
__do_softirq+0x9c/0x160
Jun  2 18:25:23 [kernel] [ 1290.701439]  [<ffffffff8102b87c>] 
call_softirq+0x1c/0x30
Jun  2 18:25:23 [kernel] [ 1290.701445]  [<ffffffff8102cbb5>] 
do_softirq+0x65/0xb0
Jun  2 18:25:23 [kernel] [ 1290.701451]  [<ffffffff8106d595>] 
irq_exit+0x85/0xb0
Jun  2 18:25:23 [kernel] [ 1290.701457]  [<ffffffff8102ce6a>] 
do_IRQ+0xba/0x1b0
Jun  2 18:25:23 [kernel] [ 1290.701463]  [<ffffffff8102b0d3>] 
ret_from_intr+0x0/0xf
Jun  2 18:25:23 [kernel] [ 1290.701467]  <EOI>  [<ffffffff81031acf>] ? 
default_idle+0x4f/0x60
Jun  2 18:25:23 [kernel] [ 1290.701480]  [<ffffffff81031acd>] ? 
default_idle+0x4d/0x60
Jun  2 18:25:23 [kernel] [ 1290.701487]  [<ffffffff81031b96>] ? 
c1e_idle+0xa6/0x110
Jun  2 18:25:23 [kernel] [ 1290.701495]  [<ffffffff81517b01>] ? 
atomic_notifier_call_chain+0x11/0x20
Jun  2 18:25:23 [kernel] [ 1290.701504]  [<ffffffff81028f27>] ? 
cpu_idle+0x67/0xc0
Jun  2 18:25:23 [kernel] [ 1290.701513]  [<ffffffff8150d068>] ? 
start_secondary+0x163/0x1bb
Jun  2 18:25:23 [kernel] [ 1290.701518] ---[ end trace 301c1d6a9ee969de ]---




I don't know whether the first issue has anything todo with the second one.

The worst thing however is an occasional freeze of the whole system, which 
happens very likely when there is high network load over nfs. If that 
happens, I have a blank console, no network, no keyboard, nothing. Not even 
sysrq seems to work, which makes it pretty hard to tell what has happened. 
After resetting the machine, there is nothing suspicious in the logs (it 
just ends).

Does anyone have suggestions?

Thanks in advance
        Alex

Reply via email to