Hi Martin, Actually the system check is just after the web server setup and before the apache checks, I have basically modified the default monitrc file and have not changed the order of the checks. So my check order is as follows.
check system check apache_bin check httpd check postfix check other services if there is a change to the config i can make to remedy this i will be happy to try it and report the results. Regards Wayne On 8 December 2011 13:50, Martin Pala <[email protected]> wrote: > Thanks, the root cause is clear now. > > It seems that your configuration (most probably the apache check) uses the > pattern based process check and the system check is most probably defined > behinf the apache in monitrc. When you do restart of such service, monit > waits for it to start and refreshes the process list to see whether it > started. The process list refresh also refreshes the cpu usage statistics - > this happens every 5 milliseconds until the process starts or the action > times out. > > The CPU usage reported by monit (for example system 50%) is thus true but > the value comes from very short timeframe (cpu usage from last 5 > milliseconds) instead of full cycle (for example cpu usage from last 5 > seconds) . If the system check will be defined first (in front of apache > check), this won't happen, as it will take the initial values (from the > cycle start) before the apache action occurred. > > => it is bug limited to specific configuration: > 1.) "check process myproc matching …" is used > 2.) "check system" is defined after the myproc check > 3.) the myproc service is restarted > > Workaround: > move the "check system" ahead of other services in your monit > configuration file > > > We'll fix the problem. > > Thanks for help with the testing and data :) > > Regards, > Martin > > > > On Dec 8, 2011, at 2:03 PM, Lawrence, Wayne wrote: > > Hi Martin, > > did as you instructed here is the output. > From my untrained eye there is some serious miscalculation in the 4th > CPUDEBUG statement not a clue how it arrives at that figure. > > CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199 > (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209, > cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589 > -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0 > CPUDEBUG: check_system: time=1323349102: > systeminfo.total_cpu_user_percent=-1.00%, > systeminfo.total_cpu_syst_percent=-1.00%, > systeminfo.total_cpu_wait_percent=-1.00% > CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 > (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131, > cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 > -- old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605, > old_cpu_total=59512589 > CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 > (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%), > cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153, > cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227, > old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576 > CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229 > (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132, > cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583 > -- old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606, > old_cpu_total=59516582 > CPUDEBUG: check_system: time=1323349142: > systeminfo.total_cpu_user_percent=0.00%, > systeminfo.total_cpu_syst_percent=100.00%, > systeminfo.total_cpu_wait_percent=0.00% > CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307 > (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021, > cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623 > -- old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606, > old_cpu_total=59516616 > CPUDEBUG: check_system: time=1323349202: > systeminfo.total_cpu_user_percent=0.90%, > systeminfo.total_cpu_syst_percent=0.90%, > systeminfo.total_cpu_wait_percent=0.00% > > Regards > > Wayne > > On 8 December 2011 12:50, Martin Pala <[email protected]> wrote: > >> Hi, >> >> thanks for update. I have prepared the debug version, which logs the >> values computed based on /proc/stat right when they are ready and once >> again before the values are checked, so we can see whether the values were >> read+computed correctly and whether no memory corruption occurred before >> they were compared by the validation engine => there are two "CPUDEBUG" log >> entries per cycle. >> >> You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz >> >> To compile: >> tar -xzf monit-5.3.1p2.tar.gz >> cd monit-5.3.1p2 >> ./configure >> make >> >> Then stop existing monit instance and run new monit binary: >> ./monit -vI 2>&1 | grep CPUDEBUG >> >> after you'll replicate the problem, terminate monit with ^C and send the >> whole CPUDEBUG output since monit start >> >> Regards,, >> Martin >> >> >> >> On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote: >> >> Hi Martin just as a side note here i disabled the cpu ssystem test and >> tried again and it seems that the issue is present with all the cpu >> monitoring/ >> >> I used the restarting of httpd as i knew it would trigger and alert and >> these were the results. >> >> Date: Thu, 08 Dec 2011 10:27:59 >> Action: alert >> Host: <hostname removed> >> Description: cpu user usage of 100.0% matches resource limit [cpu >> user usage>70.0%] >> >> >> I ran vmstat 1 10 at the same time as you can see its the 4th line. >> >> >> procs -----------memory---------- ---swap-- -----io---- --system-- >> -----cpu----- >> r b swpd free buff cache si so bi bo in cs us sy >> id wa st >> 0 0 0 739220 142536 973532 0 0 4 7 10 6 0 0 >> 99 0 0 >> 0 0 0 739088 142536 973532 0 0 0 0 114 160 0 1 >> 99 0 0 >> 3 0 0 739088 142536 973536 0 0 0 0 126 169 1 2 >> 97 0 0 >> 0 0 0 737336 142536 973544 0 0 0 168 721 796 35 14 >> 50 1 0 >> 1 0 0 736964 142536 973544 0 0 0 0 109 160 1 1 >> 98 0 0 >> >> and just to make it a little simpler i ran sar 1 10 as well as it is more >> human readable. >> >> 10:27:55 CPU %user %nice %system %iowait %steal >> %idle >> 10:27:56 all 1.01 0.00 1.01 0.00 0.00 >> 97.98 >> 10:27:57 all 0.00 0.00 1.00 0.00 0.00 >> 99.00 >> 10:27:58 all 3.96 0.00 3.96 0.00 0.00 >> 92.08 >> 10:27:59 all 32.00 0.00 12.00 1.00 0.00 >> 55.00 >> >> Something struck me as odd while testing this yesterdays results >> reporting 50% system usage from 15.84% actual means the reported usage is >> 3.2 times the actual. todays reported user usage of 100% is 3.2 times the >> actual 32%. so it seems just need to work out why it is multiplying the >> results. >> >> Regards >> >> Wayne >> >> On 7 December 2011 11:43, Lawrence, Wayne <[email protected]>wrote: >> >>> Hi Martin, >>> >>> I downloaded the source from the Monit website and compiled it on the >>> server. >>> I have started monit in verbose mode and this is the >>> relevant information it outputs when the event occurs. >>> >>> cpu system usage of 50.0% matches resource limit [cpu system >>> usage>30.0%] >>> >>> ------------------------------------------------------------------------------- >>> ../tools/bin/monit() [0x41a533] >>> ../tools/bin/monit(LogError+0x9f) [0x41ad2f] >>> ../tools/bin/monit(Event_post+0x328) [0x417ba8] >>> ..t/tools/bin/monit() [0x428071] >>> ../tools/bin/monit(check_system+0x2b) [0x4285bb] >>> ../tools/bin/monit(validate+0x226) [0x42ad16] >>> ../tools/bin/monit() [0x41422d] >>> ../tools/bin/monit(main+0x511) [0x4149e1] >>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd] >>> ../tools/bin/monit() [0x40b179] >>> >>> ------------------------------------------------------------------------------- >>> Unfortunately remote access is not an option but I will happily run a >>> debug version to try and track down this problem as I really would like to >>> use Monit for my current build. >>> >>> Regards >>> >>> Wayne >>> On 7 December 2011 11:17, Martin Pala <[email protected]> wrote: >>> >>>> Thanks for data. >>>> >>>> The /proc/stat format is this: >>>> >>>> cpu <user> <nice> <system> <idle> <wait> <irq> <softirq> >>>> >>>> The values count the cpu cycles, so if we subtract the corresponding >>>> values from your output, we get this: >>>> >>>> user nice system idle wait irq softirq >>>> | total >>>> 09:57:35 1 0 1 99 0 0 0 >>>> | 101 >>>> 09:57:36 1 0 0 98 0 0 0 >>>> | 99 >>>> 09:57:37 25 0 16 59 1 0 0 >>>> | 101 >>>> 09:57:38 1 0 2 98 0 0 >>>> 0 | 101 >>>> >>>> => at 09:57:37 the cpu usage was: >>>> >>>> user = 24.75% >>>> system = 15.84% >>>> wait = 0.99% >>>> >>>> This corresponds to the previous vmstat output. Monit counts the cpu >>>> usage the same way as above and doesn't modify these values => your monit >>>> really reports strange cpu usage (reported 50% vs. real ~ 16%). >>>> >>>> What's the origin of your monit binary? Did you compile it from >>>> original source code or some 3rd party source code distibution? (such as >>>> RHEL or Fedora repository). Or do you use the pre-compiled binaries from >>>> www.mmonit.com? Or some 3rd party binary, patches or source code from >>>> other site? >>>> >>>> Please can you try to run monit in verbose mode and provide full >>>> output?: >>>> >>>> 1.) stop monit >>>> 2.) run monit in foreground with verbose mode enabled: >>>> ./monit -vI >>>> 3.) after the problem happens, stop monit with "^C" and send output >>>> >>>> I can also prepare debug version which will dump the cpu usage related >>>> informations or if you can provide remote access to the system, i can >>>> troubleshoot the problem remotely. >>>> >>>> >>>> Regards, >>>> Martin >>>> >>>> >>>> >>>> On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote: >>>> >>>> Hi Martin, >>>> >>>> this is the output of the commands you requested. >>>> >>>> 1.) uname -m >>>> >>>> x86_64 >>>> >>>> 2.) file `which monit` >>>> >>>> ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically >>>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped >>>> I ran the command you supplied to get the cup usage directly as well >>>> while restarting the httpd service as i know this will generate an alert. >>>> >>>> >>>> >>>> Date: Wed, 07 Dec 2011 09:57:37 >>>> Action: exec >>>> Host: <hostname removed> >>>> Description: cpu system usage of 50.0% matches resource limit >>>> [cpu system usage>30.0%] >>>> >>>> Wed Dec 7 09:57:34 GMT 2011 >>>> cpu 207060 501 103542 49452254 25303 83 1569 0 0 >>>> Wed Dec 7 09:57:35 GMT 2011 >>>> cpu 207061 501 103543 49452353 25303 83 1569 0 0 >>>> Wed Dec 7 09:57:36 GMT 2011 >>>> cpu 207062 501 103543 49452451 25303 83 1569 0 0 >>>> Wed Dec 7 09:57:37 GMT 2011 >>>> cpu 207087 501 103559 49452510 25304 83 1569 0 0 >>>> Wed Dec 7 09:57:38 GMT 2011 >>>> cpu 207088 501 103561 49452608 25304 83 1569 0 0 >>>> Wed Dec 7 09:57:40 GMT 2011 >>>> If my understanding of /proc/stat is coreect this still doesnt make any >>>> sense but i may be wrong. >>>> >>>> Regards >>>> >>>> Wayne >>>> >>>> >>>> >>>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote: >>>> >>>>> Please can you check that your monit binary matches the system >>>>> architecture? (i.e. for example 64-bit monit binary on 64-bit system - not >>>>> 32-bit monit on 64-bit system) >>>>> >>>>> To verify provide please the output of following commands: >>>>> 1.) uname -m >>>>> 2.) file `which monit` >>>>> >>>>> Monit takes the statistics from the /proc/stat kernel interface. You >>>>> can collect the statistics manually like this - for example to fetch the >>>>> state in 1 second intervals (30 samples): >>>>> >>>>> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; >>>>> done >>>>> >>>>> Note: monit takes the first /proc/stat line ("cpu") which contains the >>>>> overall cpu usage in the system (summary of all cpus). The /proc/stat also >>>>> contains per-cpu statistics if you want to collect all the statistics, >>>>> replace the "grep 'cpu '" simply with "cat". >>>>> >>>>> Regards, >>>>> Martin >>>>> >>>>> >>>>> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote: >>>>> >>>>> Hi Martin, >>>>> >>>>> I have tried various methods to dientify the cause of this and took >>>>> your advice and used vmstat. I simply restarted the httpd process from the >>>>> monit web interface while the comand was running and got the following >>>>> warning. >>>>> >>>>> Description: cpu system usage of 50.0% matches resource limit >>>>> [cpu system usage>30.0%] >>>>> >>>>> But vmstat doesnt show that level of usage at the point of alert. As >>>>> you can see there is some usage in the 3rd line of the output when i >>>>> restarted the httpd service but it doesnt seem enough to trigger an alert. >>>>> >>>>> vmstat 1 10 >>>>> procs -----------memory---------- ---swap-- -----io---- --system-- >>>>> -----cpu----- >>>>> r b swpd free buff cache si so bi bo in cs us >>>>> sy id wa st >>>>> 0 0 0 859596 114684 856908 0 0 4 6 81 77 0 >>>>> 0 99 0 0 >>>>> 0 0 0 859448 114684 856916 0 0 0 0 100 94 1 >>>>> 0 99 0 0 >>>>> 0 0 0 898352 114692 815600 0 0 0 168 555 605 23 >>>>> 15 61 1 0 >>>>> >>>>> Not sure if there are any other tests i can run to narrow this down a >>>>> bit further as it still isn't making sense. >>>>> >>>>> Regards >>>>> >>>>> Wayne >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote: >>>>> >>>>>> Hi Lawrence, >>>>>> >>>>>> the test which triggers the alert is "system" cpu => it's the time >>>>>> the system spend in kernel mode. The cpu usage could be triggered by some >>>>>> background kernel task, to verify the monit report matches the system cpu >>>>>> usage, you should use either "vmstat" or "top" instead of "ps". >>>>>> >>>>>> Best regards, >>>>>> Martin >>>>>> >>>>>> >>>>>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote: >>>>>> >>>>>> Hi Igor, >>>>>> >>>>>> the operating system is RHEL6 and monit version is 5.3.1 >>>>>> >>>>>> this is what i have in my config >>>>>> >>>>>> if cpu usage (user) > 70% then alert >>>>>> if cpu usage (system) > 30% then alert >>>>>> if cpu usage (wait) > 20% then alert >>>>>> >>>>>> this is one of the errors >>>>>> Description: cpu system usage of 50.0% matches resource limit [cpu >>>>>> system usage>30.0%] >>>>>> >>>>>> this is what i get in /var/log/messages >>>>>> Dec 6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu >>>>>> system usage of 50.0% matches resource limit [cpu system usage>30.0%] >>>>>> Dec 6 12:02:29 <hostname-removed> monit[864]: >>>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded >>>>>> [current cpu system usage=0.9%] >>>>>> >>>>>> this is the output of ps --no-headers -A -o "%*cpu* sz ucomm" | sort >>>>>> -k1nr | head -20 >>>>>> >>>>>> 12:01:29 up 4 days, 20:24, 2 users, load average: 0.04, 0.01, 0.00 >>>>>> total used free shared buffers >>>>>> cached >>>>>> Mem: 2055108 1092176 962932 0 53156 >>>>>> 811864 >>>>>> -/+ buffers/cache: 227156 1827952 >>>>>> Swap: 4128760 0 4128760 >>>>>> 1.2 44308 perl >>>>>> 0.0 0 aio/0 >>>>>> 0.0 0 async/mgr >>>>>> 0.0 0 ata/0 >>>>>> 0.0 0 ata_aux >>>>>> 0.0 0 bdi-default >>>>>> 0.0 0 cpuset >>>>>> 0.0 0 crypto/0 >>>>>> 0.0 0 events/0 >>>>>> 0.0 0 ext4-dio-unwrit >>>>>> 0.0 0 flush-253:0 >>>>>> 0.0 0 jbd2/dm-0-8 >>>>>> 0.0 0 kacpi_hotplug >>>>>> 0.0 0 kacpi_notify >>>>>> 0.0 0 kacpid >>>>>> 0.0 0 kauditd >>>>>> 0.0 0 kblockd/0 >>>>>> 0.0 0 kdmflush >>>>>> 0.0 0 khelper >>>>>> 0.0 0 khubd >>>>>> >>>>>> Have to say i am at a total loss as there is no way the usage figures >>>>>> are accurate. >>>>>> If there is any other info i can supply that will be useful please >>>>>> let me know. >>>>>> >>>>>> Regards >>>>>> >>>>>> Wayne >>>>>> >>>>>> >>>>>> On 6 December 2011 12:03, Igor Homyakov < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Lawrence, >>>>>>> >>>>>>> Could you be a little bit more specific ? Please provide information >>>>>>> about you operation system, monit version on which the problem >>>>>>> occurred and so on. >>>>>>> >>>>>>> Regards >>>>>>> Igor Homyakov >>>>>>> >>>>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne >>>>>>> <[email protected]> wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I have a few CPU usage checks in my monitrc but it seems monit is >>>>>>> > misreporting the usage. >>>>>>> > >>>>>>> > I have run several tests and it seems that monit is multiplying >>>>>>> the actual >>>>>>> > usage by 10. >>>>>>> > >>>>>>> > I ran a process with top running in another shell and CPU usage >>>>>>> for the user >>>>>>> > was never above 10% yet monit informed me that there was 100% cpu >>>>>>> usage. >>>>>>> > >>>>>>> > I have tried various configurations including the one that came >>>>>>> with the >>>>>>> > default config for system cpu monitoring and all seem to >>>>>>> demonstrate the >>>>>>> > same issue. >>>>>>> > >>>>>>> > Any advice welcomed on this >>>>>>> > >>>>>>> > Regards >>>>>>> > >>>>>>> > Wayne Lawrence >>>>>>> >>>>>> >>>> >>>> -- >>>> To unsubscribe: >>>> https://lists.nongnu.org/mailman/listinfo/monit-general >>>> >>> >>> >> -- >> To unsubscribe: >> https://lists.nongnu.org/mailman/listinfo/monit-general >> >> >> >> -- >> To unsubscribe: >> https://lists.nongnu.org/mailman/listinfo/monit-general >> > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general > > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general >
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
