Thanks a lot for this Martin.

Here is what I got (using version 5.7)

___________ Thu Oct 23 01:15:10 EDT 2014 ___________

Page Summary                Pages                MB  %Tot

------------     ----------------  ----------------  ----

Kernel                     588196              2297    7%

ZFS File Data              480826              1878    6%

Anon                       820560              3205   10%

Exec and libs               49531               193    1%

Page cache                5125284             20020   61%

Free (cachelist)          1009189              3942   12%

Free (freelist)            314893              1230


The main culprit is a process used by a vendor product:

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP


 10592 geneva     37G   20G cpu5     20    0   0:10:32  11% newaga/1


This machine has 32GB RAM so at first glance someone would say we either
increase memory or ask the vendor to provide some guidance on how to limit
memory usage by that process.

However I am wondering if "page cache" should really be alarming? According
to Oracle https://blogs.oracle.com/rmc/entry/the_vm_system_formally_known "The
cachelist operates as part of the freelist. When the freelist is depleted,
allocations are made from the oldest pages in the cachelist. *This allows
the file system page cache to grow to consume all available memory and to
dynamically shrink as memory is required for other purposes.*"

In this case the newaga command is part of a replication script which
brings an in memory database from a remote server locally. This in memory
database works with memory segments that are replicated in disk and loaded
as needed. This system can even work with 16GB RAM. We increased it because
we were getting too many alerts from monit. In Solaris 10 (with the
previous version of the same software) we used to have no memory alerts
from monit using 16GB RAM, same database, or kind of because of course we
changed both the OS and the version of the app.

Bottom line I am now trying to understand if monit should be reporting
memory usage in a different way for Solaris 11 or the vendor should be
using memory in a different way or Solaris should be tweaked to please
alerts.


Under normal operation BTW this is what we get:

> ::memstat

Page Summary                Pages                MB  %Tot

------------     ----------------  ----------------  ----

Kernel                     585743              2288    7%

ZFS File Data              861077              3363   10%

Anon                       793486              3099    9%

Exec and libs               45752               178    1%

Page cache                 259302              1012    3%

Free (cachelist)          4301112             16801   51%

Free (freelist)           1542007              6023   18%

Total                     8388479             32767


Thanks again for your help with this!

- Nestor

On Thu, Oct 23, 2014 at 5:24 AM, Martin Pala <[email protected]> wrote:

> You can use the prstat exec action too, just remove the "-s rss" option to
> let it sort the output by CPU usage (default)
>
> Regards,
> Martin
>
>
> On 22 Oct 2014, at 18:58, Nestor Urquiza <[email protected]> wrote:
>
> Thanks for this Martin,
>
> I will keep you posted now that I installed 5.7 and put the command in
> monitrc as recommended.
>
> We are also getting some alerts for CPU usage spikes. Do you have a
> recommendation for the command to run when getting those as well?
>
> Thanks!
> - Nestor
>
> On Wed, Oct 22, 2014 at 3:33 AM, Martin Pala <[email protected]>
> wrote:
>
>> Hi Nestor,
>>
>> you can use something like this to get the distribution (will record the
>> memstat output + user space distribution ... processes by RSS):
>>
>>         if memory usage > 80% then exec "/bin/sh -c 'exec >>
>> /tmp/memstat.$$; echo ___________ `date` ___________; echo ::memstat | sudo
>> mdb -k; prstat -c -s rss 1 10'"
>>
>>
>> There was fix for memory usage report for Solaris in Monit 5.7 ... please
>> can you upgrade to Monit 5.9? If the problem will persist - is the system
>> where Monit is running 32-bit or 64-bit? Is it the Solaris zone?
>>
>>
>> Regards,
>> Martin
>>
>>
>> > On 20 Oct 2014, at 22:04, Nestor Urquiza <[email protected]>
>> wrote:
>> >
>> > Hi Martin,
>> >
>> > Is there a way to put monit in debug mode so we get more information
>> about the memory distribution at the moment of the alert?
>> >
>> > One thing we have noticed is that regardless how many cycles we wait to
>> alert, the succeed message comes in the next cycle after the alert which is
>> really weird.
>> >
>> > Thanks,
>> >
>> > - Nestor
>> >
>> > On Sun, Oct 19, 2014 at 12:32 PM, Nestor Urquiza <
>> [email protected]> wrote:
>> > I am sorry about the examples but yes we do get memory utilization
>> spikes:
>> >
>> > "mem usage of 82.6% matches resource limit [mem usage>80.0%],"
>> >
>> > It is difficult to get that information at the time of the alert
>> though. Is there a way to put monit on debug mode or something to get
>> exactly the memory utilization distribution?
>> >
>> > Right now everything is alright:
>> >
>> > $ sudo monit status
>> >
>> > ...
>> >
>> > System 'server'
>> >
>> >   status                            Running
>> >
>> >   monitoring status                 Monitored
>> >
>> >   load average                      [0.13] [0.12] [0.11]
>> >
>> >   cpu                               0.3%us 1.4%sy 0.0%wa
>> >
>> >   memory usage                      11822268 kB [35.2%]
>> >
>> >   swap usage                        0 kB [0.0%]
>> >
>> >   data collected                    Sun, 19 Oct 2014 12:23:47
>> >
>> > ...
>> >
>> >
>> >
>> > $ echo ::memstat | sudo mdb -k
>> >
>> > Page Summary                Pages                MB  %Tot
>> >
>> > ------------     ----------------  ----------------  ----
>> >
>> > Kernel                     591587              2310    7%
>> >
>> > ZFS File Data             1089502              4255   13%
>> >
>> > Anon                       999345              3903   12%
>> >
>> > Exec and libs               50239               196    1%
>> >
>> > Page cache                 249081               972    3%
>> >
>> > Free (cachelist)          3821104             14926   46%
>> >
>> > Free (freelist)           1587621              6201   19%
>> >
>> >
>> > Total                     8388479             32767
>> >
>> >
>> >
>> >
>> >
>> > Thanks,
>> >
>> > - Nestor
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sat, Oct 18, 2014 at 4:22 PM, Martin Pala <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > the attached error message ("cpu system usage ...") is for CPU test ...
>> not related to memory usage. High "cpu system" usage may be for example
>> sign of heavy disk I/O activity and/or swapping (memory shortage) - check
>> vmstat output for details.
>> >
>> > If the memory usage report is problem, please can you provide output of
>> "echo ::memstat | mdb -k" and "monit status" (just the System service part
>> is sufficient).
>> >
>> >
>> > Regards,
>> > Martin
>> >
>> >
>> >
>> > > On 16 Oct 2014, at 16:41, Nestor Urquiza <[email protected]>
>> wrote:
>> > >
>> > > Hi guys,
>> > >
>> > > Since we went from Solaris 10 to 11 we have seen an increase monit
>> alerts related to memory resource utilization. We used to get no alerts
>> even when we set the memorty threshold really low, for example:
>> > >
>> > > "...cpu system usage of 45.8% matches resource limit [cpu system
>> usage>40.0%]"
>> > >
>> > >
>> > > We have incremented the threshold to 90% but still we get alerts.
>> > >
>> > > Could it be that the way monit decides what is free memory in Solaris
>> is incorrect when using ZFS
>> http://serverfault.com/questions/378392/how-should-i-monitor-memory-usage-performance-in-sunos-solaris
>> > >
>> > > We are running monit version 5.5 BTW which has been working fine for
>> ages.
>> > >
>> > > Perhaps version 5.9 has done something in that regard as I read the
>> release notes ( http://mmonit.com/monit/changes/ ) are allowing to
>> monitor generic device strings (not related really but worth to ask).
>> > >
>> > > Thanks!
>> > >
>> > > - Nestor
>> > >
>> > > --
>> > > To unsubscribe:
>> > > https://lists.nongnu.org/mailman/listinfo/monit-general
>> >
>> >
>> > --
>> > To unsubscribe:
>> > https://lists.nongnu.org/mailman/listinfo/monit-general
>> >
>> >
>> > --
>> > To unsubscribe:
>> > https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>
>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>
--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Reply via email to