Re: OOM Condition on SLES11 running WAS - Tuning problems?

Marcy Cortes Mon, 26 Jul 2010 09:08:47 -0700

First of all, you've run out of memory on that server (Swap: 35764956k total, 
35764956k used,) 
It ate all of the 10G and all of the 35G of swap.
How many JVM's are running and what are their min/max heap sizes?




Marcy 

“This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you must 
not use, copy, disclose, or take any action based on this message or any 
information herein. If you have received this message in error, please advise 
the sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation."


-----Original Message-----
From: Linux on 390 Port [mailto:[email protected]] On Behalf Of Daniel 
Tate
Sent: Monday, July 26, 2010 8:24 AM
To: [email protected]
Subject: [LINUX-390] OOM Condition on SLES11 running WAS - Tuning problems?

We're running websphere on a z9 under z/VM 4 systems are live out of 8.   it
is running apps that consume around 16GB of memory on a Windows machine.  on
this, we have allocated 10G of real storage (RAM) and around 35GB of
Swap.    When websphere starts, it consumes all the memory eventually and
halts, but not panics, the system.    We are running 64-Bit.  I'm a z/VM
novice so i don't know much to do..

Here is some information from our WAS Admin:
"We are running WebSphere 6.1.0.25 with FP EJB3.0,Webservices and Web 2.0
installed.  There are two nodes running 14 application servers each. there
are currently 32 applications installed but not currently running.  No
security has been enabled for WebSphere at this time."


At this point i see two problems:

1) Why is OOM Kill not functioning properly
2) Why is websphere performance so awful?

and have two questions

1) Does anyone have any PRACTICAL experience/tips to optimize SLES11 on
z/VM?  So far we've been using dated case studies and redbooks that seem to
be filled with inaccuracies or outdated information.
2) Is there any way to force a coredump via the cp, like you can with the
magic sysrq?

All systems are running the same release and patch level:

[root] bwzld001:~# lsb_release -a
LSB Version:
core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-s390x:core-3.2-s390x:core-4.0-s390x:desktop-4.0-noarch:desktop-4.0-s390:desktop-4.0-s390x:graphics-2.0-noarch:graphics-2.0-s390:graphics-2.0-s390x:graphics-3.2-noarch:graphics-3.2-s390:graphics-3.2-s390x:graphics-4.0-noarch:graphics-4.0-s390:graphics-4.0-s390x
Distributor ID:    SUSE LINUX
Description:    SUSE Linux Enterprise Server 11 (s390x)
Release:    11
Codename:    n/a


Here is a partial top shortly before system death:

top - 08:13:14 up 2 days, 16:08,  2 users,  load average: 51.47, 22.20,
10.25
Tasks: 129 total,   4 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.7%us, 81.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,
1.2%st
Mem:  10268344k total, 10220568k used,    47776k free,      548k buffers
Swap: 35764956k total, 35764956k used,        0k free,    56340k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND

26850 wasadmin  20   0 1506m 253m 2860 S   18  2.5  16:06.28
java
29870 wasadmin  20   0 1497m 279m 2560 S   15  2.8  15:41.13
java
24607 wasadmin  20   0 1502m 223m 2760 S   13  2.2  16:15.14
java
24641 wasadmin  20   0 7229m 1.3g 3172 S   13 13.1 196:35.52
java
26606 wasadmin  20   0 1438m 272m 6212 S   12  2.7  16:02.77
java
27600 wasadmin  20   0 1553m 258m 2920 S   12  2.6  15:46.57
java
24638 wasadmin  20   0 7368m 1.3g  24m S   10 13.7 206:02.05
java
25609 wasadmin  20   0 1528m 219m 2540 S    9  2.2  16:07.33
java
30258 wasadmin  20   0 1515m 249m 2592 S    7  2.5  15:49.79
java
25780 wasadmin  20   0 1604m 277m 2332 S    6  2.8  16:31.41
java
27106 wasadmin  20   0 1458m 273m 2472 S    6  2.7  15:59.13
java
27336 wasadmin  20   0 1528m 238m 2540 S    5  2.4  15:38.82
java
29164 wasadmin  20   0 1527m 224m 2608 S    5  2.2  16:02.56
java
31400 wasadmin  20   0 1509m 259m 2468 S    5  2.6  15:26.38
java
25244 wasadmin  20   0 1509m 290m 2624 S    5  2.9  16:16.07
java
24769 wasadmin  20   0 1409m 259m 2308 S    5  2.6  16:08.12
java
28796 wasadmin  20   0 1338m 263m 3076 S    4  2.6  15:47.72
java
26185 wasadmin  20   0 1493m 274m 2304 S    2  2.7  16:01.97
java
25968 wasadmin  20   0 1427m 257m 2532 S    1  2.6  15:51.50
java
29495 wasadmin  20   0 1466m 259m 2260 S    1  2.6  15:31.82
java
25080 wasadmin  20   0 1445m 236m 2472 S    0  2.4  15:53.19
java
26410 wasadmin  20   0 1475m 271m 2540 S    0  2.7  15:52.48
java
31027 wasadmin  20   0 1413m 238m 2492 S    0  2.4  15:29.78
java
 3695 wasadmin  20   0  9968 1352 1352 S    0  0.0   0:00.13
bash
24474 wasadmin  20   0 1468m 205m 2472 S    0  2.0  16:03.63
java
24920 wasadmin  20   0 1522m 263m 2616 S    0  2.6  16:06.29
java
25422 wasadmin  20   0 1584m 229m 2284 S    0  2.3  16:02.18
java
27892 wasadmin  20   0 1414m 263m 2648 S    0  2.6  15:45.96
java
28184 wasadmin  20   0 1523m 241m 2320 S    0  2.4  15:42.21
java
28486 wasadmin  20   0 1450m 231m 2288 S    0  2.3  15:46.53
java
30625 wasadmin  20   0 1477m 251m 3024 S    0  2.5  15:44.80 java

-----------------


Here are a few screen grabs from the 3720 Console session:

Unless you get a _continuous_flood_ of these messages it means
everything is working fine. Allocations from irqs cannot be
perfectly reliable and the kernel is designed to handle that.
java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
pflags:0x400
040
CPU: 1 Not tainted 2.6.27.45-0.1-default #1
Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
       00000000014a4e88 0000000000000007 0000000000634e00 0000000000000000
       000000000000000d 0000000000000000 000000027fbcf818 000000000000000e
       00000000003cdc00 000000000010521a 000000027fbcf7b0 000000027fbcf7f8
Call Trace:
( 0000000000105174>  show_trace+0x130/0x134)
  000000000019890a>  __alloc_pages_internal+0x406/0x55c
  00000000001c7056>  cache_grow+0x382/0x458
  00000000001c7440>  cache_alloc_refill+0x314/0x36c
  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
  00000000003228f2>  __alloc_skb+0x82/0x208
  000000000032378e>  dev_alloc_skb+0x36/0x64
  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
  000003e0000d5f8c>  qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
  000000000013a800>  tasklet_action+0x10c/0x1e4
  000000000013b908>  __do_softirq+0xe0/0x1c8
  0000000000110252>  do_softirq+0xaa/0xb0
  000000000013b772>  irq_exit+0xc2/0xcc
  00000000002f6586>  do_IRQ+0x132/0x1c8
  0000000000114148>  io_return+0x0/0x8
  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
( 000003e047d6fa00>  0x3e047d6fa00)
  000000000019eb9c>  shrink_page_list+0x1a0/0x584
  000000000019f184>  shrink_inactive_list+0x204/0x5b0
  000000000019f620>  shrink_zone+0xf0/0x1d0
  000000000019f882>  shrink_zones+0xae/0x184
  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
  00000000001a072c>  try_to_free_pages+0x74/0x7c
  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
  000000000019170e>  filemap_fault+0x33a/0x448
  00000000001a55bc>  __do_fault+0x78/0x580
  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
  00000000003b9e1e>  do_dat_exception+0x29e/0x388
  0000000000113c0c>  sysc_return+0x0/0x8
  0000020000214bde>  0x20000214bde
Mem-Info:
DMA per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
 free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
inactive:926
924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
lowmem_reserveݨ: 0 8064 8064
Normal free:3856kB min:10276kB low:12844kB high:15412kB active:4328996kB
inactiv
e:3603924kB present:8257536kB pages_scanned:44557906 all_unreclaimable? yes
lowmem_reserveݨ: 0 0 0
DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB 8*256kB 3*512kB
8*1024kB
= 33220kB
Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 3*1024kB =
3856
kB
9283 total pagecache pages
0 pages in swap cache
Swap cache stats: add 34513958, delete 34513958, find 6612011/8393146
Free swap  = 0kB
Total swap = 35764956kB
2621440 pages RAM
54354 pages reserved
22356 pages shared
2538214 pages non-shared
The following is only an harmless informational message.
Unless you get a _continuous_flood_ of these messages it means
everything is working fine. Allocations from irqs cannot be
perfectly reliable and the kernel is designed to handle that.
java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
pflags:0x400
040
CPU: 1 Not tainted 2.6.27.45-0.1-default #1
Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
       00000000014a5dd3 0000000000000007 0000000000634e00 0000000000000000
       000000000000000d 0000000000000000 000000027fbcf818 000000000000000e
       00000000003cdc00 000000000010521a 000000027fbcf7b0 000000027fbcf7f8
Call Trace:
( 0000000000105174>  show_trace+0x130/0x134)
  000000000019890a>  __alloc_pages_internal+0x406/0x55c
  00000000001c7056>  cache_grow+0x382/0x458
  00000000001c7440>  cache_alloc_refill+0x314/0x36c
  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
  00000000003228f2>  __alloc_skb+0x82/0x208
  000000000032378e>  dev_alloc_skb+0x36/0x64
  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
  000003e0000d5f8c>  qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
  000000000013a800>  tasklet_action+0x10c/0x1e4
  000000000013b908>  __do_softirq+0xe0/0x1c8
  0000000000110252>  do_softirq+0xaa/0xb0
  000000000013b772>  irq_exit+0xc2/0xcc
  00000000002f6586>  do_IRQ+0x132/0x1c8
  0000000000114148>  io_return+0x0/0x8
  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
( 000003e047d6fa00>  0x3e047d6fa00)
  000000000019eb9c>  shrink_page_list+0x1a0/0x584
  000000000019f184>  shrink_inactive_list+0x204/0x5b0
  000000000019f620>  shrink_zone+0xf0/0x1d0
  000000000019f882>  shrink_zones+0xae/0x184
  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
  00000000001a072c>  try_to_free_pages+0x74/0x7c
  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
  000000000019170e>  filemap_fault+0x33a/0x448
  00000000001a55bc>  __do_fault+0x78/0x580
  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
  00000000003b9e1e>  do_dat_exception+0x29e/0x388
  0000000000113c0c>  sysc_return+0x0/0x8
  0000020000214bde>  0x20000214bde
Mem-Info:
DMA per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
 free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
inactive:926
924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
lowmem_reserveݨ: 0 8064 8064
Normal free:3856kB min:10276kB low:12844kB high:15412kB active:4328996kB
inactiv
e:3603924kB present:8257536kB pages_scanned:44557906 all_unreclaimable? yes
lowmem_reserveݨ: 0 0 0
DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB 8*256kB 3*512kB
8*1024kB
= 33220kB
Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 3*1024kB =
3856
kB
9283 total pagecache pages
0 pages in swap cache
Swap cache stats: add 34513958, delete 34513958, find 6612011/8393146
Free swap  = 0kB
Total swap = 35764956kB
2621440 pages RAM
54354 pages reserved
22356 pages shared
2538214 pages non-shared
__ratelimit: 4 callbacks suppressed
The following is only an harmless informational message.
Unless you get a _continuous_flood_ of these messages it means
everything is working fine. Allocations from irqs cannot be
perfectly reliable and the kernel is designed to handle that.
java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
pflags:0x400
040
CPU: 1 Not tainted 2.6.27.45-0.1-default #1
Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
*
etc, etc for HUNDREDS of pages..*

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Re: OOM Condition on SLES11 running WAS - Tuning problems?

Reply via email to