Re: CPU usage in Monit

Lawrence, Wayne Thu, 08 Dec 2011 07:36:28 -0800

Hi Martin,

Have tested that version and it seems to be working perfectly as i am
unable to replicate the error.


Many thanks for your help.

Kind regards

Wayne

On 8 December 2011 15:09, Martin Pala <[email protected]> wrote:

>  I'm sorry, the hint won't work, as when the manual action is done, its
> handling is prioritized and the rest of services are handled when the
> action completed.
>
> The fix for the problem is available here, please can you try it?:
> http://www.mmonit.com/tmp/monit-5.3.1p3.tar.gz
>
> Regards,
> Martin
>
>
>
>  On Dec 8, 2011, at 3:09 PM, Lawrence, Wayne wrote:
>
>  Hi Martin,
>
> Actually the system check is just after the web server setup and before
> the apache checks, I have basically modified the default monitrc file and
> have not changed the order of the checks. So my check order is as follows.
>
> check system
> check apache_bin
> check httpd
> check postfix
> check other services
>
> if there is a change to the config i can make to remedy this i will be
> happy to try it and report the results.
>
> Regards
>
> Wayne
>
>
>
> On 8 December 2011 13:50, Martin Pala <[email protected]> wrote:
>
>>  Thanks, the root cause is clear now.
>>
>> It seems that your configuration (most probably the apache check) uses
>> the pattern based process check and the system check is most probably
>> defined behinf the apache in monitrc. When you do restart of such service,
>> monit waits for it to start and refreshes the process list to see whether
>> it started. The process list refresh also refreshes the cpu usage
>> statistics - this happens every 5 milliseconds until the process starts or
>> the action times out.
>>
>> The CPU usage reported by monit (for example system 50%) is thus true but
>> the value comes from very short timeframe (cpu usage from last 5
>> milliseconds) instead of full cycle (for example cpu usage from last 5
>> seconds) . If the system check will be defined first (in front of apache
>> check), this won't happen, as it will take the initial values (from the
>> cycle start) before the apache action occurred.
>>
>> => it is bug limited to specific configuration:
>> 1.) "check process myproc matching …" is used
>> 2.) "check system" is defined after the myproc check
>> 3.) the myproc service is restarted
>>
>> Workaround:
>> move the "check system" ahead of other services in your monit
>> configuration file
>>
>>
>> We'll fix the problem.
>>
>> Thanks for help with the testing and data :)
>>
>> Regards,
>> Martin
>>
>>
>>
>>  On Dec 8, 2011, at 2:03 PM, Lawrence, Wayne wrote:
>>
>>  Hi Martin,
>>
>> did as you instructed here is the output.
>> From my untrained eye there is some serious miscalculation in the 4th
>> CPUDEBUG statement not a clue how it arrives at that figure.
>>
>> CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199
>> (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209,
>> cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589
>> -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0
>> CPUDEBUG: check_system: time=1323349102:
>> systeminfo.total_cpu_user_percent=-1.00%,
>> systeminfo.total_cpu_syst_percent=-1.00%,
>> systeminfo.total_cpu_wait_percent=-1.00%
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227
>> (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131,
>> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576
>> -- old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605,
>> old_cpu_total=59512589
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227
>> (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%),
>> cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153,
>> cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227,
>> old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229
>> (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132,
>> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583
>> -- old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606,
>> old_cpu_total=59516582
>> CPUDEBUG: check_system: time=1323349142:
>> systeminfo.total_cpu_user_percent=0.00%,
>> systeminfo.total_cpu_syst_percent=100.00%,
>> systeminfo.total_cpu_wait_percent=0.00%
>> CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307
>> (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021,
>> cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623
>> -- old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606,
>> old_cpu_total=59516616
>> CPUDEBUG: check_system: time=1323349202:
>> systeminfo.total_cpu_user_percent=0.90%,
>> systeminfo.total_cpu_syst_percent=0.90%,
>> systeminfo.total_cpu_wait_percent=0.00%
>>
>> Regards
>>
>> Wayne
>>
>> On 8 December 2011 12:50, Martin Pala <[email protected]> wrote:
>>
>>>  Hi,
>>>
>>> thanks for update. I have prepared the debug version, which logs the
>>> values computed based on /proc/stat right when they are ready and once
>>> again before the values are checked, so we can see whether the values were
>>> read+computed correctly and whether no memory corruption occurred before
>>> they were compared by the validation engine => there are two "CPUDEBUG" log
>>> entries per cycle.
>>>
>>> You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz
>>>
>>> To compile:
>>> tar -xzf monit-5.3.1p2.tar.gz
>>> cd monit-5.3.1p2
>>> ./configure
>>> make
>>>
>>> Then stop existing monit instance and run new monit binary:
>>> ./monit -vI  2>&1 | grep CPUDEBUG
>>>
>>> after you'll replicate the problem, terminate monit with ^C and send the
>>> whole CPUDEBUG output since monit start
>>>
>>> Regards,,
>>> Martin
>>>
>>>
>>>
>>>  On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote:
>>>
>>>  Hi Martin just as a side note here i disabled the cpu ssystem test and
>>> tried again and it seems that the issue is present with all the cpu
>>> monitoring/
>>>
>>> I used the restarting of httpd as i knew it would trigger and alert and
>>> these were the results.
>>>
>>>  Date:        Thu, 08 Dec 2011 10:27:59
>>>       Action:      alert
>>>       Host:        <hostname removed>
>>>       Description: cpu user usage of 100.0% matches resource limit [cpu
>>> user usage>70.0%]
>>>
>>>
>>> I ran vmstat 1 10 at the same time as you can see its the 4th line.
>>>
>>>
>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> -----cpu-----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>>> id wa st
>>>  0  0      0 739220 142536 973532    0    0     4     7   10    6  0  0
>>> 99  0  0
>>>  0  0      0 739088 142536 973532    0    0     0     0  114  160  0  1
>>> 99  0  0
>>>  3  0      0 739088 142536 973536    0    0     0     0  126  169  1  2
>>> 97  0  0
>>>  0  0      0 737336 142536 973544    0    0     0   168  721  796 35 14
>>> 50  1  0
>>>  1  0      0 736964 142536 973544    0    0     0     0  109  160  1  1
>>> 98  0  0
>>>
>>> and just to make it a little simpler i ran sar 1 10 as well as it is
>>> more human readable.
>>>
>>> 10:27:55        CPU     %user     %nice   %system   %iowait
>>> %steal     %idle
>>> 10:27:56        all      1.01      0.00      1.01      0.00
>>> 0.00     97.98
>>> 10:27:57        all      0.00      0.00      1.00      0.00
>>> 0.00     99.00
>>> 10:27:58        all      3.96      0.00      3.96      0.00
>>> 0.00     92.08
>>> 10:27:59        all     32.00      0.00     12.00      1.00
>>> 0.00     55.00
>>>
>>> Something struck me as odd while testing this yesterdays results
>>> reporting 50% system usage from 15.84% actual means the reported usage is
>>> 3.2 times the actual. todays reported user usage of 100% is 3.2 times the
>>> actual 32%. so it seems just need to work out why it is multiplying the
>>> results.
>>>
>>> Regards
>>>
>>> Wayne
>>>
>>> On 7 December 2011 11:43, Lawrence, Wayne 
>>> <[email protected]>wrote:
>>>
>>>> Hi Martin,
>>>>
>>>> I downloaded the source from the Monit website and compiled it on the
>>>> server.
>>>> I have started monit in verbose mode and this is the
>>>> relevant information it outputs when the event occurs.
>>>>
>>>>   cpu system usage of 50.0% matches resource limit [cpu system
>>>> usage>30.0%]
>>>>
>>>> -------------------------------------------------------------------------------
>>>>    ../tools/bin/monit() [0x41a533]
>>>>  ../tools/bin/monit(LogError+0x9f) [0x41ad2f]
>>>>    ../tools/bin/monit(Event_post+0x328) [0x417ba8]
>>>>     ..t/tools/bin/monit() [0x428071]
>>>>     ../tools/bin/monit(check_system+0x2b) [0x4285bb]
>>>>     ../tools/bin/monit(validate+0x226) [0x42ad16]
>>>>    ../tools/bin/monit() [0x41422d]
>>>>     ../tools/bin/monit(main+0x511) [0x4149e1]
>>>>     /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd]
>>>>     ../tools/bin/monit() [0x40b179]
>>>>
>>>> -------------------------------------------------------------------------------
>>>> Unfortunately remote access is not an option but I will happily run a
>>>> debug version to try and track down this problem as I really would like to
>>>> use Monit for my current build.
>>>>
>>>> Regards
>>>>
>>>> Wayne
>>>>  On 7 December 2011 11:17, Martin Pala <[email protected]> wrote:
>>>>
>>>>>   Thanks for data.
>>>>>
>>>>> The /proc/stat format is this:
>>>>>
>>>>>     cpu <user> <nice> <system> <idle> <wait> <irq> <softirq>
>>>>>
>>>>> The values count the cpu cycles, so if we subtract the corresponding
>>>>> values from your output, we get this:
>>>>>
>>>>>                    user   nice   system   idle   wait   irq   softirq
>>>>>   |   total
>>>>> 09:57:35    1         0        1              99     0       0      0
>>>>>          |    101
>>>>> 09:57:36    1         0        0              98     0       0
>>>>> 0          |    99
>>>>> 09:57:37    25       0        16           59     1       0      0
>>>>>      |    101
>>>>>  09:57:38    1         0        2              98     0       0
>>>>> 0          |    101
>>>>>
>>>>> => at  09:57:37 the cpu usage was:
>>>>>
>>>>> user      = 24.75%
>>>>> system =  15.84%
>>>>> wait      =   0.99%
>>>>>
>>>>> This corresponds to the previous vmstat output. Monit counts the cpu
>>>>> usage the same way as above and doesn't modify these values => your monit
>>>>> really reports strange cpu usage (reported 50% vs. real ~ 16%).
>>>>>
>>>>> What's the origin of your monit binary? Did you compile it from
>>>>> original source code or some 3rd party source code distibution? (such as
>>>>> RHEL or Fedora repository). Or do you use the pre-compiled binaries from
>>>>> www.mmonit.com? Or some 3rd party binary, patches or source code from
>>>>> other site?
>>>>>
>>>>> Please can you try to run monit in verbose mode and provide full
>>>>> output?:
>>>>>
>>>>>    1.) stop monit
>>>>>    2.) run monit in foreground with verbose mode enabled:
>>>>>        ./monit -vI
>>>>>    3.) after the problem happens, stop monit with "^C" and send output
>>>>>
>>>>> I can also prepare debug version which will dump the cpu usage related
>>>>> informations or if you can provide remote access to the system, i can
>>>>> troubleshoot the problem remotely.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>   On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote:
>>>>>
>>>>>  Hi Martin,
>>>>>
>>>>> this is the output of the commands you requested.
>>>>>
>>>>> 1.) uname -m
>>>>>
>>>>> x86_64
>>>>>
>>>>>  2.) file `which monit`
>>>>>
>>>>>  ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically
>>>>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
>>>>> I ran the command you supplied to get the cup usage directly as well
>>>>> while restarting the httpd service as i know this will generate an alert.
>>>>>
>>>>>
>>>>>
>>>>>       Date:        Wed, 07 Dec 2011 09:57:37
>>>>>       Action:      exec
>>>>>       Host:        <hostname removed>
>>>>>       Description: cpu system usage of 50.0% matches resource limit
>>>>> [cpu system usage>30.0%]
>>>>>
>>>>> Wed Dec  7 09:57:34 GMT 2011
>>>>> cpu  207060 501 103542 49452254 25303 83 1569 0 0
>>>>> Wed Dec  7 09:57:35 GMT 2011
>>>>> cpu  207061 501 103543 49452353 25303 83 1569 0 0
>>>>> Wed Dec  7 09:57:36 GMT 2011
>>>>> cpu  207062 501 103543 49452451 25303 83 1569 0 0
>>>>> Wed Dec  7 09:57:37 GMT 2011
>>>>> cpu  207087 501 103559 49452510 25304 83 1569 0 0
>>>>> Wed Dec  7 09:57:38 GMT 2011
>>>>> cpu  207088 501 103561 49452608 25304 83 1569 0 0
>>>>> Wed Dec  7 09:57:40 GMT 2011
>>>>> If my understanding of /proc/stat is coreect this still doesnt make
>>>>> any sense but i may be wrong.
>>>>>
>>>>> Regards
>>>>>
>>>>> Wayne
>>>>>
>>>>>
>>>>>
>>>>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote:
>>>>>
>>>>>>  Please can you check that your monit binary matches the system
>>>>>> architecture? (i.e. for example 64-bit monit binary on 64-bit system - 
>>>>>> not
>>>>>> 32-bit monit on 64-bit system)
>>>>>>
>>>>>> To verify provide please the output of following commands:
>>>>>> 1.) uname -m
>>>>>> 2.) file `which monit`
>>>>>>
>>>>>> Monit takes the statistics from the /proc/stat kernel interface. You
>>>>>> can collect the statistics manually like this - for example to fetch the
>>>>>> state in 1 second intervals (30 samples):
>>>>>>
>>>>>>  $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1;
>>>>>> done
>>>>>>
>>>>>> Note: monit takes the first /proc/stat line ("cpu") which contains
>>>>>> the overall cpu usage in the system (summary of all cpus). The /proc/stat
>>>>>> also contains per-cpu statistics if you want to collect all the 
>>>>>> statistics,
>>>>>> replace the "grep 'cpu '" simply with "cat".
>>>>>>
>>>>>> Regards,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>  On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote:
>>>>>>
>>>>>>  Hi Martin,
>>>>>>
>>>>>> I have tried various methods to dientify the cause of this and took
>>>>>> your advice and used vmstat. I simply restarted the httpd process from 
>>>>>> the
>>>>>> monit web interface while the comand was running and got the following
>>>>>> warning.
>>>>>>
>>>>>>        Description: cpu system usage of 50.0% matches resource limit
>>>>>> [cpu system usage>30.0%]
>>>>>>
>>>>>> But vmstat doesnt show that level of usage at the point of alert. As
>>>>>> you can see there is some usage in the 3rd line of the output when i
>>>>>> restarted the httpd service but it doesnt seem enough to trigger an 
>>>>>> alert.
>>>>>>
>>>>>> vmstat 1 10
>>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>>>>> -----cpu-----
>>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>>>>>> sy id wa st
>>>>>>  0  0      0 859596 114684 856908    0    0     4     6   81   77
>>>>>> 0  0 99  0  0
>>>>>>  0  0      0 859448 114684 856916    0    0     0     0  100   94  1
>>>>>> 0 99  0  0
>>>>>>  0  0      0 898352 114692 815600    0    0     0   168  555  605 23
>>>>>> 15 61  1  0
>>>>>>
>>>>>> Not sure if there are any other tests i can run to narrow this down a
>>>>>> bit further as it still isn't making sense.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Wayne
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote:
>>>>>>
>>>>>>>  Hi Lawrence,
>>>>>>>
>>>>>>> the test which triggers the alert is "system" cpu => it's the time
>>>>>>> the system spend in kernel mode. The cpu usage could be triggered by 
>>>>>>> some
>>>>>>> background kernel task, to verify the monit report matches the system 
>>>>>>> cpu
>>>>>>> usage, you should use either "vmstat" or "top" instead of "ps".
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>>  On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote:
>>>>>>>
>>>>>>>  Hi Igor,
>>>>>>>
>>>>>>> the operating system is RHEL6 and monit version is 5.3.1
>>>>>>>
>>>>>>> this is what i have in my config
>>>>>>>
>>>>>>>     if cpu usage (user) > 70% then alert
>>>>>>>     if cpu usage (system) > 30% then alert
>>>>>>>     if cpu usage (wait) > 20% then alert
>>>>>>>
>>>>>>> this is one of the errors
>>>>>>> Description: cpu system usage of 50.0% matches resource limit [cpu
>>>>>>> system usage>30.0%]
>>>>>>>
>>>>>>> this is what i get in /var/log/messages
>>>>>>> Dec  6 12:01:29 <hostname-removed> monit[864]: <hostname-removed>
>>>>>>> cpu system usage of 50.0% matches resource limit [cpu system 
>>>>>>> usage>30.0%]
>>>>>>> Dec  6 12:02:29 <hostname-removed> monit[864]:
>>>>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded
>>>>>>> [current cpu system usage=0.9%]
>>>>>>>
>>>>>>> this is the output of ps --no-headers -A -o "%*cpu* sz ucomm" |
>>>>>>> sort -k1nr | head -20
>>>>>>>
>>>>>>>  12:01:29 up 4 days, 20:24,  2 users,  load average: 0.04, 0.01, 0.00
>>>>>>>              total       used       free     shared    buffers
>>>>>>> cached
>>>>>>> Mem:       2055108    1092176     962932          0      53156
>>>>>>> 811864
>>>>>>> -/+ buffers/cache:     227156    1827952
>>>>>>> Swap:      4128760          0    4128760
>>>>>>>  1.2 44308 perl
>>>>>>>  0.0     0 aio/0
>>>>>>>  0.0     0 async/mgr
>>>>>>>  0.0     0 ata/0
>>>>>>>  0.0     0 ata_aux
>>>>>>>  0.0     0 bdi-default
>>>>>>>  0.0     0 cpuset
>>>>>>>  0.0     0 crypto/0
>>>>>>>  0.0     0 events/0
>>>>>>>  0.0     0 ext4-dio-unwrit
>>>>>>>  0.0     0 flush-253:0
>>>>>>>  0.0     0 jbd2/dm-0-8
>>>>>>>  0.0     0 kacpi_hotplug
>>>>>>>  0.0     0 kacpi_notify
>>>>>>>  0.0     0 kacpid
>>>>>>>  0.0     0 kauditd
>>>>>>>  0.0     0 kblockd/0
>>>>>>>  0.0     0 kdmflush
>>>>>>>  0.0     0 khelper
>>>>>>>  0.0     0 khubd
>>>>>>>
>>>>>>> Have to say i am at a total loss as there is no way the usage
>>>>>>> figures are accurate.
>>>>>>> If there is any other info i can supply that will be useful please
>>>>>>> let me know.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Wayne
>>>>>>>
>>>>>>>
>>>>>>> On 6 December 2011 12:03, Igor Homyakov <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Lawrence,
>>>>>>>>
>>>>>>>> Could you be a little bit more specific ?  Please provide
>>>>>>>> information
>>>>>>>> about you operation system, monit version on which the problem
>>>>>>>> occurred and so on.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Igor Homyakov
>>>>>>>>
>>>>>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne
>>>>>>>> <[email protected]> wrote:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > I have a few CPU usage checks in my monitrc but it seems monit is
>>>>>>>> > misreporting the usage.
>>>>>>>> >
>>>>>>>> > I have run several tests and it seems that monit is multiplying
>>>>>>>> the actual
>>>>>>>> > usage by 10.
>>>>>>>> >
>>>>>>>> > I ran a process with top running in another shell and CPU usage
>>>>>>>> for the user
>>>>>>>> > was never above 10% yet monit informed me that there was 100% cpu
>>>>>>>> usage.
>>>>>>>> >
>>>>>>>> > I have tried various configurations including the one that came
>>>>>>>> with the
>>>>>>>> > default config for system cpu monitoring and all seem to
>>>>>>>> demonstrate the
>>>>>>>> > same issue.
>>>>>>>> >
>>>>>>>> > Any advice welcomed on this
>>>>>>>> >
>>>>>>>> > Regards
>>>>>>>> >
>>>>>>>> > Wayne Lawrence
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe:
>>>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>>>>
>>>>
>>>>
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>>
>>>
>>>
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>>
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>>
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>
>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: CPU usage in Monit

Reply via email to