Re: OOM Condition on SLES11 running WAS - Tuning problems?

Mrohs, Ray Mon, 26 Jul 2010 12:49:38 -0700

Set swappiness to 0. Can you just start 1 node as a test?

Ray


> -----Original Message-----
> From: Linux on 390 Port [mailto:[email protected]] On
> Behalf Of Daniel Tate
> Sent: Monday, July 26, 2010 2:28 PM
> To: [email protected]
> Subject: Re: OOM Condition on SLES11 running WAS - Tuning problems?
>
> Yeah, i saw that.. problem is these same apps run on 16GB of mem on a
> windows box..
>
> We have 28 JVMs and sizes are set to 50/256.
>
> On Mon, Jul 26, 2010 at 11:07 AM, Marcy Cortes <
> [email protected]> wrote:
>
> > First of all, you've run out of memory on that server
> (Swap: 35764956k
> > total, 35764956k used,)
> > It ate all of the 10G and all of the 35G of swap.
> > How many JVM's are running and what are their min/max heap sizes?
> >
> >
> >
> > Marcy
> >
> > “This message may contain confidential and/or privileged
> information. If
> > you are not the addressee or authorized to receive this for
> the addressee,
> > you must not use, copy, disclose, or take any action based
> on this message
> > or any information herein. If you have received this
> message in error,
> > please advise the sender immediately by reply e-mail and delete this
> > message. Thank you for your cooperation."
> >
> >
> > -----Original Message-----
> > From: Linux on 390 Port [mailto:[email protected]] On
> Behalf Of
> > Daniel Tate
> > Sent: Monday, July 26, 2010 8:24 AM
> > To: [email protected]
> > Subject: [LINUX-390] OOM Condition on SLES11 running WAS -
> Tuning problems?
> >
> > We're running websphere on a z9 under z/VM 4 systems are
> live out of 8.
> > it
> > is running apps that consume around 16GB of memory on a
> Windows machine.
> >  on
> > this, we have allocated 10G of real storage (RAM) and around 35GB of
> > Swap.    When websphere starts, it consumes all the memory
> eventually and
> > halts, but not panics, the system.    We are running
> 64-Bit.  I'm a z/VM
> > novice so i don't know much to do..
> >
> > Here is some information from our WAS Admin:
> > "We are running WebSphere 6.1.0.25 with FP
> EJB3.0,Webservices and Web 2.0
> > installed.  There are two nodes running 14 application
> servers each. there
> > are currently 32 applications installed but not currently
> running.  No
> > security has been enabled for WebSphere at this time."
> >
> >
> > At this point i see two problems:
> >
> > 1) Why is OOM Kill not functioning properly
> > 2) Why is websphere performance so awful?
> >
> > and have two questions
> >
> > 1) Does anyone have any PRACTICAL experience/tips to
> optimize SLES11 on
> > z/VM?  So far we've been using dated case studies and
> redbooks that seem to
> > be filled with inaccuracies or outdated information.
> > 2) Is there any way to force a coredump via the cp, like
> you can with the
> > magic sysrq?
> >
> > All systems are running the same release and patch level:
> >
> > [root] bwzld001:~# lsb_release -a
> > LSB Version:
> >
> >
> core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-s390x
:core-3.2-s390x:core-4.0-s390x:desktop-4.0-noarch:desktop-4.0-> 
s390:desktop-4.0-s390x:graphics-2.0-noarch:graphics-2.0-s390:g
raphics-2.0-s390x:graphics-3.2-noarch:graphics-3.2-> 
s390:graphics-3.2-s390x:graphics-4.0-noarch:graphics-4.0-s390:
graphics-4.0-s390x
> > Distributor ID:    SUSE LINUX
> > Description:    SUSE Linux Enterprise Server 11 (s390x)
> > Release:    11
> > Codename:    n/a
> >
> >
> > Here is a partial top shortly before system death:
> >
> > top - 08:13:14 up 2 days, 16:08,  2 users,  load average:
> 51.47, 22.20,
> > 10.25
> > Tasks: 129 total,   4 running, 125 sleeping,   0 stopped,   0 zombie
> > Cpu(s): 16.7%us, 81.5%sy,  0.0%ni,  0.0%id,  0.0%wa,
> 0.3%hi,  0.3%si,
> > 1.2%st
> > Mem:  10268344k total, 10220568k used,    47776k free,
> 548k buffers
> > Swap: 35764956k total, 35764956k used,        0k free,
> 56340k cached
> >
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > COMMAND
> >
> > 26850 wasadmin  20   0 1506m 253m 2860 S   18  2.5  16:06.28
> > java
> > 29870 wasadmin  20   0 1497m 279m 2560 S   15  2.8  15:41.13
> > java
> > 24607 wasadmin  20   0 1502m 223m 2760 S   13  2.2  16:15.14
> > java
> > 24641 wasadmin  20   0 7229m 1.3g 3172 S   13 13.1 196:35.52
> > java
> > 26606 wasadmin  20   0 1438m 272m 6212 S   12  2.7  16:02.77
> > java
> > 27600 wasadmin  20   0 1553m 258m 2920 S   12  2.6  15:46.57
> > java
> > 24638 wasadmin  20   0 7368m 1.3g  24m S   10 13.7 206:02.05
> > java
> > 25609 wasadmin  20   0 1528m 219m 2540 S    9  2.2  16:07.33
> > java
> > 30258 wasadmin  20   0 1515m 249m 2592 S    7  2.5  15:49.79
> > java
> > 25780 wasadmin  20   0 1604m 277m 2332 S    6  2.8  16:31.41
> > java
> > 27106 wasadmin  20   0 1458m 273m 2472 S    6  2.7  15:59.13
> > java
> > 27336 wasadmin  20   0 1528m 238m 2540 S    5  2.4  15:38.82
> > java
> > 29164 wasadmin  20   0 1527m 224m 2608 S    5  2.2  16:02.56
> > java
> > 31400 wasadmin  20   0 1509m 259m 2468 S    5  2.6  15:26.38
> > java
> > 25244 wasadmin  20   0 1509m 290m 2624 S    5  2.9  16:16.07
> > java
> > 24769 wasadmin  20   0 1409m 259m 2308 S    5  2.6  16:08.12
> > java
> > 28796 wasadmin  20   0 1338m 263m 3076 S    4  2.6  15:47.72
> > java
> > 26185 wasadmin  20   0 1493m 274m 2304 S    2  2.7  16:01.97
> > java
> > 25968 wasadmin  20   0 1427m 257m 2532 S    1  2.6  15:51.50
> > java
> > 29495 wasadmin  20   0 1466m 259m 2260 S    1  2.6  15:31.82
> > java
> > 25080 wasadmin  20   0 1445m 236m 2472 S    0  2.4  15:53.19
> > java
> > 26410 wasadmin  20   0 1475m 271m 2540 S    0  2.7  15:52.48
> > java
> > 31027 wasadmin  20   0 1413m 238m 2492 S    0  2.4  15:29.78
> > java
> >  3695 wasadmin  20   0  9968 1352 1352 S    0  0.0   0:00.13
> > bash
> > 24474 wasadmin  20   0 1468m 205m 2472 S    0  2.0  16:03.63
> > java
> > 24920 wasadmin  20   0 1522m 263m 2616 S    0  2.6  16:06.29
> > java
> > 25422 wasadmin  20   0 1584m 229m 2284 S    0  2.3  16:02.18
> > java
> > 27892 wasadmin  20   0 1414m 263m 2648 S    0  2.6  15:45.96
> > java
> > 28184 wasadmin  20   0 1523m 241m 2320 S    0  2.4  15:42.21
> > java
> > 28486 wasadmin  20   0 1450m 231m 2288 S    0  2.3  15:46.53
> > java
> > 30625 wasadmin  20   0 1477m 251m 3024 S    0  2.5  15:44.80 java
> >
> > -----------------
> >
> >
> > Here are a few screen grabs from the 3720 Console session:
> >
> > Unless you get a _continuous_flood_ of these messages it means
> > everything is working fine. Allocations from irqs cannot be
> > perfectly reliable and the kernel is designed to handle that.
> > java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> > pflags:0x400
> > 040
> > CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> > Process java (pid: 28831, task: 00000001ab64c638, ksp:
> 0000000215bbb5e0)
> > 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
> >       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8
> 00000000003b6696
> >       00000000014a4e88 0000000000000007 0000000000634e00
> 0000000000000000
> >       000000000000000d 0000000000000000 000000027fbcf818
> 000000000000000e
> >       00000000003cdc00 000000000010521a 000000027fbcf7b0
> 000000027fbcf7f8
> > Call Trace:
> > ( 0000000000105174>  show_trace+0x130/0x134)
> >  000000000019890a>  __alloc_pages_internal+0x406/0x55c
> >  00000000001c7056>  cache_grow+0x382/0x458
> >  00000000001c7440>  cache_alloc_refill+0x314/0x36c
> >  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
> >  00000000003228f2>  __alloc_skb+0x82/0x208
> >  000000000032378e>  dev_alloc_skb+0x36/0x64
> >  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
> >  000003e0000d5f8c>
> qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
> >  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
> >  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
> >  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
> >  000000000013a800>  tasklet_action+0x10c/0x1e4
> >  000000000013b908>  __do_softirq+0xe0/0x1c8
> >  0000000000110252>  do_softirq+0xaa/0xb0
> >  000000000013b772>  irq_exit+0xc2/0xcc
> >  00000000002f6586>  do_IRQ+0x132/0x1c8
> >  0000000000114148>  io_return+0x0/0x8
> >  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
> > ( 000003e047d6fa00>  0x3e047d6fa00)
> >  000000000019eb9c>  shrink_page_list+0x1a0/0x584
> >  000000000019f184>  shrink_inactive_list+0x204/0x5b0
> >  000000000019f620>  shrink_zone+0xf0/0x1d0
> >  000000000019f882>  shrink_zones+0xae/0x184
> >  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
> >  00000000001a072c>  try_to_free_pages+0x74/0x7c
> >  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
> >  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
> >  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
> >  000000000019170e>  filemap_fault+0x33a/0x448
> >  00000000001a55bc>  __do_fault+0x78/0x580
> >  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
> >  00000000003b9e1e>  do_dat_exception+0x29e/0x388
> >  0000000000113c0c>  sysc_return+0x0/0x8
> >  0000020000214bde>  0x20000214bde
> > Mem-Info:
> > DMA per-cpu:
> > CPU    0: hi:  186, btch:  31 usd:   0
> > CPU    1: hi:  186, btch:  31 usd:   0
> > Normal per-cpu:
> > CPU    0: hi:  186, btch:  31 usd:   0
> > CPU    1: hi:  186, btch:  31 usd:   0
> > Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
> >  free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
> > DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
> > inactive:926
> > 924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
> > lowmem_reserveݨ: 0 8064 8064
> > Normal free:3856kB min:10276kB low:12844kB high:15412kB
> active:4328996kB
> > inactiv
> > e:3603924kB present:8257536kB pages_scanned:44557906
> all_unreclaimable? yes
> > lowmem_reserveݨ: 0 0 0
> > DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB
> 8*256kB 3*512kB
> > 8*1024kB
> > = 33220kB
> > Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB
> 1*512kB 3*1024kB =
> > 3856
> > kB
> > 9283 total pagecache pages
> > 0 pages in swap cache
> > Swap cache stats: add 34513958, delete 34513958, find
> 6612011/8393146
> > Free swap  = 0kB
> > Total swap = 35764956kB
> > 2621440 pages RAM
> > 54354 pages reserved
> > 22356 pages shared
> > 2538214 pages non-shared
> > The following is only an harmless informational message.
> > Unless you get a _continuous_flood_ of these messages it means
> > everything is working fine. Allocations from irqs cannot be
> > perfectly reliable and the kernel is designed to handle that.
> > java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> > pflags:0x400
> > 040
> > CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> > Process java (pid: 28831, task: 00000001ab64c638, ksp:
> 0000000215bbb5e0)
> > 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
> >       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8
> 00000000003b6696
> >       00000000014a5dd3 0000000000000007 0000000000634e00
> 0000000000000000
> >       000000000000000d 0000000000000000 000000027fbcf818
> 000000000000000e
> >       00000000003cdc00 000000000010521a 000000027fbcf7b0
> 000000027fbcf7f8
> > Call Trace:
> > ( 0000000000105174>  show_trace+0x130/0x134)
> >  000000000019890a>  __alloc_pages_internal+0x406/0x55c
> >  00000000001c7056>  cache_grow+0x382/0x458
> >  00000000001c7440>  cache_alloc_refill+0x314/0x36c
> >  00000000001c6c12>  kmem_cache_alloc+0x82/0x144
> >  00000000003228f2>  __alloc_skb+0x82/0x208
> >  000000000032378e>  dev_alloc_skb+0x36/0x64
> >  000003e0001a030e>  qeth_core_get_next_skb+0x31e/0x704  eth
> >  000003e0000d5f8c>
> qeth_l3_process_inbound_buffer+0x9c/0x598  eth_l3
> >  000003e0000d6574>  qeth_l3_qdio_input_handler+0xec/0x268  eth_l3
> >  000003e0000ebc44>  qdio_kick_inbound_handler+0xbc/0x178  dio
> >  000003e0000ee58c>  __tiqdio_inbound_processing+0x394/0xdf4  dio
> >  000000000013a800>  tasklet_action+0x10c/0x1e4
> >  000000000013b908>  __do_softirq+0xe0/0x1c8
> >  0000000000110252>  do_softirq+0xaa/0xb0
> >  000000000013b772>  irq_exit+0xc2/0xcc
> >  00000000002f6586>  do_IRQ+0x132/0x1c8
> >  0000000000114148>  io_return+0x0/0x8
> >  00000000002b850e>  _raw_spin_lock_wait+0x86/0xa4
> > ( 000003e047d6fa00>  0x3e047d6fa00)
> >  000000000019eb9c>  shrink_page_list+0x1a0/0x584
> >  000000000019f184>  shrink_inactive_list+0x204/0x5b0
> >  000000000019f620>  shrink_zone+0xf0/0x1d0
> >  000000000019f882>  shrink_zones+0xae/0x184
> >  00000000001a02be>  do_try_to_free_pages+0x96/0x3fc
> >  00000000001a072c>  try_to_free_pages+0x74/0x7c
> >  0000000000198730>  __alloc_pages_internal+0x22c/0x55c
> >  000000000019b5a2>  __do_page_cache_readahead+0x10a/0x2ac
> >  000000000019b7cc>  do_page_cache_readahead+0x88/0xa8
> >  000000000019170e>  filemap_fault+0x33a/0x448
> >  00000000001a55bc>  __do_fault+0x78/0x580
> >  00000000001a962e>  handle_mm_fault+0x1e6/0x4c0
> >  00000000003b9e1e>  do_dat_exception+0x29e/0x388
> >  0000000000113c0c>  sysc_return+0x0/0x8
> >  0000020000214bde>  0x20000214bde
> > Mem-Info:
> > DMA per-cpu:
> > CPU    0: hi:  186, btch:  31 usd:   0
> > CPU    1: hi:  186, btch:  31 usd:   0
> > Normal per-cpu:
> > CPU    0: hi:  186, btch:  31 usd:   0
> > CPU    1: hi:  186, btch:  31 usd:   0
> > Active:1355277 inactive:1132712 dirty:0 writeback:0 unstable:0
> >  free:9269 slab:17875 mapped:765 pagetables:24402 bounce:0
> > DMA free:33220kB min:2568kB low:3208kB high:3852kB active:1092112kB
> > inactive:926
> > 924kB present:2064384kB pages_scanned:21132286 all_unreclaimable? no
> > lowmem_reserveݨ: 0 8064 8064
> > Normal free:3856kB min:10276kB low:12844kB high:15412kB
> active:4328996kB
> > inactiv
> > e:3603924kB present:8257536kB pages_scanned:44557906
> all_unreclaimable? yes
> > lowmem_reserveݨ: 0 0 0
> > DMA: 101*4kB 32*8kB 473*16kB 195*32kB 49*64kB 30*128kB
> 8*256kB 3*512kB
> > 8*1024kB
> > = 33220kB
> > Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB
> 1*512kB 3*1024kB =
> > 3856
> > kB
> > 9283 total pagecache pages
> > 0 pages in swap cache
> > Swap cache stats: add 34513958, delete 34513958, find
> 6612011/8393146
> > Free swap  = 0kB
> > Total swap = 35764956kB
> > 2621440 pages RAM
> > 54354 pages reserved
> > 22356 pages shared
> > 2538214 pages non-shared
> > __ratelimit: 4 callbacks suppressed
> > The following is only an harmless informational message.
> > Unless you get a _continuous_flood_ of these messages it means
> > everything is working fine. Allocations from irqs cannot be
> > perfectly reliable and the kernel is designed to handle that.
> > java: page allocation failure. order:0, mode:0x20, alloc_flags:0x7,
> > pflags:0x400
> > 040
> > CPU: 1 Not tainted 2.6.27.45-0.1-default #1
> > Process java (pid: 28831, task: 00000001ab64c638, ksp:
> 0000000215bbb5e0)
> > 0000000000000000 000000027fbcf7b0 0000000000000002 0000000000000000
> >       000000027fbcf850 000000027fbcf7c8 000000027fbcf7c8
> 00000000003b6696
> > *
> > etc, etc for HUNDREDS of pages..*
> >
> >
> ----------------------------------------------------------------------
> > For LINUX-390 subscribe / signoff / archive access instructions,
> > send email to [email protected] with the message: INFO
> LINUX-390 or
> > visit
> > http://www.marist.edu/htbin/wlvindex?LINUX-390
> >
> ----------------------------------------------------------------------
> > For more information on Linux on System z, visit
> > http://wiki.linuxvm.org/
> >
>
> ----------------------------------------------------------------------
> For LINUX-390 subscribe / signoff / archive access instructions,
> send email to [email protected] with the message: INFO
> LINUX-390 or visit
> http://www.marist.edu/htbin/wlvindex?LINUX-390
> ----------------------------------------------------------------------
> For more information on Linux on System z, visit
> http://wiki.linuxvm.org/
>

Re: OOM Condition on SLES11 running WAS - Tuning problems?

Reply via email to