Re: OOM Condition on SLES11 running WAS - Tuning problems?

Marcy Cortes Mon, 26 Jul 2010 14:06:24 -0700

I was going to suggest a dump and a ticket to Novell, but it looks like you 
aren't SP1, and so are unsupported.
Anyway you could apply that and try again?



Marcy

“This message may contain confidential and/or privileged information. If you 
are not the addressee or authorized to receive this for the addressee, you must 
not use, copy, disclose, or take any action based on this message or any 
information herein. If you have received this message in error, please advise 
the sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation."


-----Original Message-----
From: Linux on 390 Port [mailto:[email protected]] On Behalf Of Daniel 
Tate
Sent: Monday, July 26, 2010 11:28 AM
To: [email protected]
Subject: Re: [LINUX-390] OOM Condition on SLES11 running WAS - Tuning problems?

Yeah, i saw that.. problem is these same apps run on 16GB of mem on a
windows box..

We have 28 JVMs and sizes are set to 50/256.

On Mon, Jul 26, 2010 at 11:07 AM, Marcy Cortes <
[email protected]> wrote:

> First of all, you've run out of memory on that server (Swap: 35764956k
> total, 35764956k used,)
> It ate all of the 10G and all of the 35G of swap.
> How many JVM's are running and what are their min/max heap sizes?
>
>
>
> Marcy
>
> “This message may contain confidential and/or privileged information. If
> you are not the addressee or authorized to receive this for the addressee,
> you must not use, copy, disclose, or take any action based on this message
> or any information herein. If you have received this message in error,
> please advise the sender immediately by reply e-mail and delete this
> message. Thank you for your cooperation."
>
>
> -----Original Message-----
> From: Linux on 390 Port [mailto:[email protected]] On Behalf Of
> Daniel Tate
> Sent: Monday, July 26, 2010 8:24 AM
> To: [email protected]
> Subject: [LINUX-390] OOM Condition on SLES11 running WAS - Tuning problems?
>
> We're running websphere on a z9 under z/VM 4 systems are live out of 8.
> it
> is running apps that consume around 16GB of memory on a Windows machine.
>  on
> this, we have allocated 10G of real storage (RAM) and around 35GB of
> Swap.    When websphere starts, it consumes all the memory eventually and
> halts, but not panics, the system.    We are running 64-Bit.  I'm a z/VM
> novice so i don't know much to do..
>
> Here is some information from our WAS Admin:
> "We are running WebSphere 6.1.0.25 with FP EJB3.0,Webservices and Web 2.0
> installed.  There are two nodes running 14 application servers each. there
> are currently 32 applications installed but not currently running.  No
> security has been enabled for WebSphere at this time."
>
>
> At this point i see two problems:
>
> 1) Why is OOM Kill not functioning properly
> 2) Why is websphere performance so awful?
>
> and have two questions
>
> 1) Does anyone have any PRACTICAL experience/tips to optimize SLES11 on
> z/VM?  So far we've been using dated case studies and redbooks that seem to
> be filled with inaccuracies or outdated information.
> 2) Is there any way to force a coredump via the cp, like you can with the
> magic sysrq?
>
> All systems are running the same release and patch level:
>
> [root] bwzld001:~# lsb_release -a
> LSB Version:
>
> core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-s390x:core-3.2-s390x:core-4.0-s390x:desktop-4.0-noarch:desktop-4.0-s390:desktop-4.0-s390x:graphics-2.0-noarch:graphics-2.0-s390:graphics-2.0-s390x:graphics-3.2-noarch:graphics-3.2-s390:graphics-3.2-s390x:graphics-4.0-noarch:graphics-4.0-s390:graphics-4.0-s390x
> Distributor ID:    SUSE LINUX
> Description:    SUSE Linux Enterprise Server 11 (s390x)
> Release:    11
> Codename:    n/a
>
>
> Here is a partial top shortly before system death:
>
> top - 08:13:14 up 2 days, 16:08,  2 users,  load average: 51.47, 22.20,
> 10.25
> Tasks: 129 total,   4 running, 125 sleeping,   0 stopped,   0 zombie
> Cpu(s): 16.7%us, 81.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,
> 1.2%st
> Mem:  10268344k total, 10220568k used,    47776k free,      548k buffers
> Swap: 35764956k total, 35764956k used,        0k free,    56340k cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>
> 26850 wasadmin  20   0 1506m 253m 2860 S   18  2.5  16:06.28
> java
> 29870 wasadmin  20   0 1497m 279m 2560 S   15  2.8  15:41.13
> java
> 24607 wasadmin  20   0 1502m 223m 2760 S   13  2.2  16:15.14
> java
> 24641 wasadmin  20   0 7229m 1.3g 3172 S   13 13.1 196:35.52
> java
> 26606 wasadmin  20   0 1438m 272m 6212 S   12  2.7  16:02.77
> java
> 27600 wasadmin  20   0 1553m 258m 2920 S   12  2.6  15:46.57
> java
> 24638 wasadmin  20   0 7368m 1.3g  24m S   10 13.7 206:02.05
> java
> 25609 wasadmin  20   0 1528m 219m 2540 S    9  2.2  16:07.33
> java
> 30258 wasadmin  20   0 1515m 249m 2592 S    7  2.5  15:49.79
> java
> 25780 wasadmin  20   0 1604m 277m 2332 S    6  2.8  16:31.41
> java
> 27106 wasadmin  20   0 1458m 273m 2472 S    6  2.7  15:59.13
> java
> 27336 wasadmin  20   0 1528m 238m 2540 S    5  2.4  15:38.82
> java
> 29164 wasadmin  20   0 1527m 224m 2608 S    5  2.2  16:02.56
> java
> 31400 wasadmin  20   0 1509m 259m 2468 S    5  2.6  15:26.38
> java
> 25244 wasadmin  20   0 1509m 290m 2624 S    5  2.9  16:16.07
> java
> 24769 wasadmin  20   0 1409m 259m 2308 S    5  2.6  16:08.12
> java
> 28796 wasadmin  20   0 1338m 263m 3076 S    4  2.6  15:47.72
> java
> 26185 wasadmin  20   0 1493m 274m 2304 S    2  2.7  16:01.97
> java
> 25968 wasadmin  20   0 1427m 257m 2532 S    1  2.6  15:51.50
> java
> 29495 wasadmin  20   0 1466m 259m 2260 S    1  2.6  15:31.82
> java
> 25080 wasadmin  20   0 1445m 236m 2472 S    0  2.4  15:53.19
> java
> 26410 wasadmin  20   0 1475m 271m 2540 S    0  2.7  15:52.48
> java
> 31027 wasadmin  20   0 1413m 238m 2492 S    0  2.4  15:29.78
> java
>  3695 wasadmin  20   0  9968 1352 1352 S    0  0.0   0:00.13
> bash
> 24474 wasadmin  20   0 1468m 205m 2472 S    0  2.0  16:03.63
> java
> 24920 wasadmin  20   0 1522m 263m 2616 S    0  2.6  16:06.29
> java
> 25422 wasadmin  20   0 1584m 229m 2284 S    0  2.3  16:02.18
> java
> 27892 wasadmin  20   0 1414m 263m 2648 S    0  2.6  15:45.96
> java
> 28184 wasadmin  20   0 1523m 241m 2320 S    0  2.4  15:42.21
> java
> 28486 wasadmin  20   0 1450m 231m 2288 S    0  2.3  15:46.53
> java
> 30625 wasadmin  20   0 1477m 251m 3024 S    0  2.5  15:44.80 java
>
> -----------------
>
>
> Here are a few screen grabs from the 3720 Console session:
>
> Unless you get a _continuous_flood_ of these messages it means
> everything is working fine. Allocations from irqs cannot be
> perfectly reliable and the kernel is designed to handle that.
> java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> pflags:0x400
> 040
> CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
> 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
>       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
>       00000000014a4e88 0000000000000007 0000000000634e00 0000000000000000
>       000000000000000d 0000000000000000 000000027fbcf818 000000000000000e
>       00000000003cdc00 000000000010521a 000000027fbcf7b0 000000027fbcf7f8
> Call Trace:
> ( 0000000000105174>  show_trace+0x130/0x134)
>  000000000019890a>  __alloc_pages_internal+0x406/0x55c
>  00000000001c7056>  cache_grow+0x382/0x458
>  00000000001c7440>  cache_alloc_refill+0x314/0x36c
>  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
>  00000000003228f2>  __alloc_skb+0x82/0x208
>  000000000032378e>  dev_alloc_skb+0x36/0x64
>  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
>  000003e0000d5f8c>  qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
>  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
>  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
>  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
>  000000000013a800>  tasklet_action+0x10c/0x1e4
>  000000000013b908>  __do_softirq+0xe0/0x1c8
>  0000000000110252>  do_softirq+0xaa/0xb0
>  000000000013b772>  irq_exit+0xc2/0xcc
>  00000000002f6586>  do_IRQ+0x132/0x1c8
>  0000000000114148>  io_return+0x0/0x8
>  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
> ( 000003e047d6fa00>  0x3e047d6fa00)
>  000000000019eb9c>  shrink_page_list+0x1a0/0x584
>  000000000019f184>  shrink_inactive_list+0x204/0x5b0
>  000000000019f620>  shrink_zone+0xf0/0x1d0
>  000000000019f882>  shrink_zones+0xae/0x184
>  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
>  00000000001a072c>  try_to_free_pages+0x74/0x7c
>  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
>  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
>  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
>  000000000019170e>  filemap_fault+0x33a/0x448
>  00000000001a55bc>  __do_fault+0x78/0x580
>  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
>  00000000003b9e1e>  do_dat_exception+0x29e/0x388
>  0000000000113c0c>  sysc_return+0x0/0x8
>  0000020000214bde>  0x20000214bde
> Mem-Info:
> DMA per-cpu:
> CPU    0: hi:  186, btch:  31 usd:   0
> CPU    1: hi:  186, btch:  31 usd:   0
> Normal per-cpu:
> CPU    0: hi:  186, btch:  31 usd:   0
> CPU    1: hi:  186, btch:  31 usd:   0
> Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
>  free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
> DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
> inactive:926
> 924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
> lowmem_reserveݨ: 0 8064 8064
> Normal free:3856kB min:10276kB low:12844kB high:15412kB active:4328996kB
> inactiv
> e:3603924kB present:8257536kB pages_scanned:44557906 all_unreclaimable? yes
> lowmem_reserveݨ: 0 0 0
> DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB 8*256kB 3*512kB
> 8*1024kB
> = 33220kB
> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 3*1024kB =
> 3856
> kB
> 9283 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 34513958, delete 34513958, find 6612011/8393146
> Free swap  = 0kB
> Total swap = 35764956kB
> 2621440 pages RAM
> 54354 pages reserved
> 22356 pages shared
> 2538214 pages non-shared
> The following is only an harmless informational message.
> Unless you get a _continuous_flood_ of these messages it means
> everything is working fine. Allocations from irqs cannot be
> perfectly reliable and the kernel is designed to handle that.
> java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> pflags:0x400
> 040
> CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
> 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
>       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
>       00000000014a5dd3 0000000000000007 0000000000634e00 0000000000000000
>       000000000000000d 0000000000000000 000000027fbcf818 000000000000000e
>       00000000003cdc00 000000000010521a 000000027fbcf7b0 000000027fbcf7f8
> Call Trace:
> ( 0000000000105174>  show_trace+0x130/0x134)
>  000000000019890a>  __alloc_pages_internal+0x406/0x55c
>  00000000001c7056>  cache_grow+0x382/0x458
>  00000000001c7440>  cache_alloc_refill+0x314/0x36c
>  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
>  00000000003228f2>  __alloc_skb+0x82/0x208
>  000000000032378e>  dev_alloc_skb+0x36/0x64
>  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
>  000003e0000d5f8c>  qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
>  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
>  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
>  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
>  000000000013a800>  tasklet_action+0x10c/0x1e4
>  000000000013b908>  __do_softirq+0xe0/0x1c8
>  0000000000110252>  do_softirq+0xaa/0xb0
>  000000000013b772>  irq_exit+0xc2/0xcc
>  00000000002f6586>  do_IRQ+0x132/0x1c8
>  0000000000114148>  io_return+0x0/0x8
>  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
> ( 000003e047d6fa00>  0x3e047d6fa00)
>  000000000019eb9c>  shrink_page_list+0x1a0/0x584
>  000000000019f184>  shrink_inactive_list+0x204/0x5b0
>  000000000019f620>  shrink_zone+0xf0/0x1d0
>  000000000019f882>  shrink_zones+0xae/0x184
>  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
>  00000000001a072c>  try_to_free_pages+0x74/0x7c
>  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
>  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
>  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
>  000000000019170e>  filemap_fault+0x33a/0x448
>  00000000001a55bc>  __do_fault+0x78/0x580
>  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
>  00000000003b9e1e>  do_dat_exception+0x29e/0x388
>  0000000000113c0c>  sysc_return+0x0/0x8
>  0000020000214bde>  0x20000214bde
> Mem-Info:
> DMA per-cpu:
> CPU    0: hi:  186, btch:  31 usd:   0
> CPU    1: hi:  186, btch:  31 usd:   0
> Normal per-cpu:
> CPU    0: hi:  186, btch:  31 usd:   0
> CPU    1: hi:  186, btch:  31 usd:   0
> Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
>  free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
> DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
> inactive:926
> 924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
> lowmem_reserveݨ: 0 8064 8064
> Normal free:3856kB min:10276kB low:12844kB high:15412kB active:4328996kB
> inactiv
> e:3603924kB present:8257536kB pages_scanned:44557906 all_unreclaimable? yes
> lowmem_reserveݨ: 0 0 0
> DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB 8*256kB 3*512kB
> 8*1024kB
> = 33220kB
> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 3*1024kB =
> 3856
> kB
> 9283 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 34513958, delete 34513958, find 6612011/8393146
> Free swap  = 0kB
> Total swap = 35764956kB
> 2621440 pages RAM
> 54354 pages reserved
> 22356 pages shared
> 2538214 pages non-shared
> __ratelimit: 4 callbacks suppressed
> The following is only an harmless informational message.
> Unless you get a _continuous_flood_ of these messages it means
> everything is working fine. Allocations from irqs cannot be
> perfectly reliable and the kernel is designed to handle that.
> java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> pflags:0x400
> 040
> CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> Process java (pid: 28831, task: 00000001ab64c638, ksp: 0000000215bbb5e0)
> 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
>       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8 00000000003b6696
> *
> etc, etc for HUNDREDS of pages..*
>
> ----------------------------------------------------------------------
> For LINUX-390 subscribe / signoff / archive access instructions,
> send email to [email protected] with the message: INFO LINUX-390 or
> visit
> http://www.marist.edu/htbin/wlvindex?LINUX-390
> ----------------------------------------------------------------------
> For more information on Linux on System z, visit
> http://wiki.linuxvm.org/
>

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Re: OOM Condition on SLES11 running WAS - Tuning problems?

Reply via email to