Hi Martin, Have tested that version and it seems to be working perfectly as i am unable to replicate the error.
Many thanks for your help. Kind regards Wayne On 8 December 2011 15:09, Martin Pala <[email protected]> wrote: > I'm sorry, the hint won't work, as when the manual action is done, its > handling is prioritized and the rest of services are handled when the > action completed. > > The fix for the problem is available here, please can you try it?: > http://www.mmonit.com/tmp/monit-5.3.1p3.tar.gz > > Regards, > Martin > > > > On Dec 8, 2011, at 3:09 PM, Lawrence, Wayne wrote: > > Hi Martin, > > Actually the system check is just after the web server setup and before > the apache checks, I have basically modified the default monitrc file and > have not changed the order of the checks. So my check order is as follows. > > check system > check apache_bin > check httpd > check postfix > check other services > > if there is a change to the config i can make to remedy this i will be > happy to try it and report the results. > > Regards > > Wayne > > > > On 8 December 2011 13:50, Martin Pala <[email protected]> wrote: > >> Thanks, the root cause is clear now. >> >> It seems that your configuration (most probably the apache check) uses >> the pattern based process check and the system check is most probably >> defined behinf the apache in monitrc. When you do restart of such service, >> monit waits for it to start and refreshes the process list to see whether >> it started. The process list refresh also refreshes the cpu usage >> statistics - this happens every 5 milliseconds until the process starts or >> the action times out. >> >> The CPU usage reported by monit (for example system 50%) is thus true but >> the value comes from very short timeframe (cpu usage from last 5 >> milliseconds) instead of full cycle (for example cpu usage from last 5 >> seconds) . If the system check will be defined first (in front of apache >> check), this won't happen, as it will take the initial values (from the >> cycle start) before the apache action occurred. >> >> => it is bug limited to specific configuration: >> 1.) "check process myproc matching …" is used >> 2.) "check system" is defined after the myproc check >> 3.) the myproc service is restarted >> >> Workaround: >> move the "check system" ahead of other services in your monit >> configuration file >> >> >> We'll fix the problem. >> >> Thanks for help with the testing and data :) >> >> Regards, >> Martin >> >> >> >> On Dec 8, 2011, at 2:03 PM, Lawrence, Wayne wrote: >> >> Hi Martin, >> >> did as you instructed here is the output. >> From my untrained eye there is some serious miscalculation in the 4th >> CPUDEBUG statement not a clue how it arrives at that figure. >> >> CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199 >> (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209, >> cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589 >> -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0 >> CPUDEBUG: check_system: time=1323349102: >> systeminfo.total_cpu_user_percent=-1.00%, >> systeminfo.total_cpu_syst_percent=-1.00%, >> systeminfo.total_cpu_wait_percent=-1.00% >> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 >> (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131, >> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 >> -- old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605, >> old_cpu_total=59512589 >> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 >> (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%), >> cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153, >> cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227, >> old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576 >> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229 >> (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132, >> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583 >> -- old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606, >> old_cpu_total=59516582 >> CPUDEBUG: check_system: time=1323349142: >> systeminfo.total_cpu_user_percent=0.00%, >> systeminfo.total_cpu_syst_percent=100.00%, >> systeminfo.total_cpu_wait_percent=0.00% >> CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307 >> (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021, >> cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623 >> -- old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606, >> old_cpu_total=59516616 >> CPUDEBUG: check_system: time=1323349202: >> systeminfo.total_cpu_user_percent=0.90%, >> systeminfo.total_cpu_syst_percent=0.90%, >> systeminfo.total_cpu_wait_percent=0.00% >> >> Regards >> >> Wayne >> >> On 8 December 2011 12:50, Martin Pala <[email protected]> wrote: >> >>> Hi, >>> >>> thanks for update. I have prepared the debug version, which logs the >>> values computed based on /proc/stat right when they are ready and once >>> again before the values are checked, so we can see whether the values were >>> read+computed correctly and whether no memory corruption occurred before >>> they were compared by the validation engine => there are two "CPUDEBUG" log >>> entries per cycle. >>> >>> You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz >>> >>> To compile: >>> tar -xzf monit-5.3.1p2.tar.gz >>> cd monit-5.3.1p2 >>> ./configure >>> make >>> >>> Then stop existing monit instance and run new monit binary: >>> ./monit -vI 2>&1 | grep CPUDEBUG >>> >>> after you'll replicate the problem, terminate monit with ^C and send the >>> whole CPUDEBUG output since monit start >>> >>> Regards,, >>> Martin >>> >>> >>> >>> On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote: >>> >>> Hi Martin just as a side note here i disabled the cpu ssystem test and >>> tried again and it seems that the issue is present with all the cpu >>> monitoring/ >>> >>> I used the restarting of httpd as i knew it would trigger and alert and >>> these were the results. >>> >>> Date: Thu, 08 Dec 2011 10:27:59 >>> Action: alert >>> Host: <hostname removed> >>> Description: cpu user usage of 100.0% matches resource limit [cpu >>> user usage>70.0%] >>> >>> >>> I ran vmstat 1 10 at the same time as you can see its the 4th line. >>> >>> >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> -----cpu----- >>> r b swpd free buff cache si so bi bo in cs us sy >>> id wa st >>> 0 0 0 739220 142536 973532 0 0 4 7 10 6 0 0 >>> 99 0 0 >>> 0 0 0 739088 142536 973532 0 0 0 0 114 160 0 1 >>> 99 0 0 >>> 3 0 0 739088 142536 973536 0 0 0 0 126 169 1 2 >>> 97 0 0 >>> 0 0 0 737336 142536 973544 0 0 0 168 721 796 35 14 >>> 50 1 0 >>> 1 0 0 736964 142536 973544 0 0 0 0 109 160 1 1 >>> 98 0 0 >>> >>> and just to make it a little simpler i ran sar 1 10 as well as it is >>> more human readable. >>> >>> 10:27:55 CPU %user %nice %system %iowait >>> %steal %idle >>> 10:27:56 all 1.01 0.00 1.01 0.00 >>> 0.00 97.98 >>> 10:27:57 all 0.00 0.00 1.00 0.00 >>> 0.00 99.00 >>> 10:27:58 all 3.96 0.00 3.96 0.00 >>> 0.00 92.08 >>> 10:27:59 all 32.00 0.00 12.00 1.00 >>> 0.00 55.00 >>> >>> Something struck me as odd while testing this yesterdays results >>> reporting 50% system usage from 15.84% actual means the reported usage is >>> 3.2 times the actual. todays reported user usage of 100% is 3.2 times the >>> actual 32%. so it seems just need to work out why it is multiplying the >>> results. >>> >>> Regards >>> >>> Wayne >>> >>> On 7 December 2011 11:43, Lawrence, Wayne >>> <[email protected]>wrote: >>> >>>> Hi Martin, >>>> >>>> I downloaded the source from the Monit website and compiled it on the >>>> server. >>>> I have started monit in verbose mode and this is the >>>> relevant information it outputs when the event occurs. >>>> >>>> cpu system usage of 50.0% matches resource limit [cpu system >>>> usage>30.0%] >>>> >>>> ------------------------------------------------------------------------------- >>>> ../tools/bin/monit() [0x41a533] >>>> ../tools/bin/monit(LogError+0x9f) [0x41ad2f] >>>> ../tools/bin/monit(Event_post+0x328) [0x417ba8] >>>> ..t/tools/bin/monit() [0x428071] >>>> ../tools/bin/monit(check_system+0x2b) [0x4285bb] >>>> ../tools/bin/monit(validate+0x226) [0x42ad16] >>>> ../tools/bin/monit() [0x41422d] >>>> ../tools/bin/monit(main+0x511) [0x4149e1] >>>> /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd] >>>> ../tools/bin/monit() [0x40b179] >>>> >>>> ------------------------------------------------------------------------------- >>>> Unfortunately remote access is not an option but I will happily run a >>>> debug version to try and track down this problem as I really would like to >>>> use Monit for my current build. >>>> >>>> Regards >>>> >>>> Wayne >>>> On 7 December 2011 11:17, Martin Pala <[email protected]> wrote: >>>> >>>>> Thanks for data. >>>>> >>>>> The /proc/stat format is this: >>>>> >>>>> cpu <user> <nice> <system> <idle> <wait> <irq> <softirq> >>>>> >>>>> The values count the cpu cycles, so if we subtract the corresponding >>>>> values from your output, we get this: >>>>> >>>>> user nice system idle wait irq softirq >>>>> | total >>>>> 09:57:35 1 0 1 99 0 0 0 >>>>> | 101 >>>>> 09:57:36 1 0 0 98 0 0 >>>>> 0 | 99 >>>>> 09:57:37 25 0 16 59 1 0 0 >>>>> | 101 >>>>> 09:57:38 1 0 2 98 0 0 >>>>> 0 | 101 >>>>> >>>>> => at 09:57:37 the cpu usage was: >>>>> >>>>> user = 24.75% >>>>> system = 15.84% >>>>> wait = 0.99% >>>>> >>>>> This corresponds to the previous vmstat output. Monit counts the cpu >>>>> usage the same way as above and doesn't modify these values => your monit >>>>> really reports strange cpu usage (reported 50% vs. real ~ 16%). >>>>> >>>>> What's the origin of your monit binary? Did you compile it from >>>>> original source code or some 3rd party source code distibution? (such as >>>>> RHEL or Fedora repository). Or do you use the pre-compiled binaries from >>>>> www.mmonit.com? Or some 3rd party binary, patches or source code from >>>>> other site? >>>>> >>>>> Please can you try to run monit in verbose mode and provide full >>>>> output?: >>>>> >>>>> 1.) stop monit >>>>> 2.) run monit in foreground with verbose mode enabled: >>>>> ./monit -vI >>>>> 3.) after the problem happens, stop monit with "^C" and send output >>>>> >>>>> I can also prepare debug version which will dump the cpu usage related >>>>> informations or if you can provide remote access to the system, i can >>>>> troubleshoot the problem remotely. >>>>> >>>>> >>>>> Regards, >>>>> Martin >>>>> >>>>> >>>>> >>>>> On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote: >>>>> >>>>> Hi Martin, >>>>> >>>>> this is the output of the commands you requested. >>>>> >>>>> 1.) uname -m >>>>> >>>>> x86_64 >>>>> >>>>> 2.) file `which monit` >>>>> >>>>> ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically >>>>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped >>>>> I ran the command you supplied to get the cup usage directly as well >>>>> while restarting the httpd service as i know this will generate an alert. >>>>> >>>>> >>>>> >>>>> Date: Wed, 07 Dec 2011 09:57:37 >>>>> Action: exec >>>>> Host: <hostname removed> >>>>> Description: cpu system usage of 50.0% matches resource limit >>>>> [cpu system usage>30.0%] >>>>> >>>>> Wed Dec 7 09:57:34 GMT 2011 >>>>> cpu 207060 501 103542 49452254 25303 83 1569 0 0 >>>>> Wed Dec 7 09:57:35 GMT 2011 >>>>> cpu 207061 501 103543 49452353 25303 83 1569 0 0 >>>>> Wed Dec 7 09:57:36 GMT 2011 >>>>> cpu 207062 501 103543 49452451 25303 83 1569 0 0 >>>>> Wed Dec 7 09:57:37 GMT 2011 >>>>> cpu 207087 501 103559 49452510 25304 83 1569 0 0 >>>>> Wed Dec 7 09:57:38 GMT 2011 >>>>> cpu 207088 501 103561 49452608 25304 83 1569 0 0 >>>>> Wed Dec 7 09:57:40 GMT 2011 >>>>> If my understanding of /proc/stat is coreect this still doesnt make >>>>> any sense but i may be wrong. >>>>> >>>>> Regards >>>>> >>>>> Wayne >>>>> >>>>> >>>>> >>>>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote: >>>>> >>>>>> Please can you check that your monit binary matches the system >>>>>> architecture? (i.e. for example 64-bit monit binary on 64-bit system - >>>>>> not >>>>>> 32-bit monit on 64-bit system) >>>>>> >>>>>> To verify provide please the output of following commands: >>>>>> 1.) uname -m >>>>>> 2.) file `which monit` >>>>>> >>>>>> Monit takes the statistics from the /proc/stat kernel interface. You >>>>>> can collect the statistics manually like this - for example to fetch the >>>>>> state in 1 second intervals (30 samples): >>>>>> >>>>>> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; >>>>>> done >>>>>> >>>>>> Note: monit takes the first /proc/stat line ("cpu") which contains >>>>>> the overall cpu usage in the system (summary of all cpus). The /proc/stat >>>>>> also contains per-cpu statistics if you want to collect all the >>>>>> statistics, >>>>>> replace the "grep 'cpu '" simply with "cat". >>>>>> >>>>>> Regards, >>>>>> Martin >>>>>> >>>>>> >>>>>> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote: >>>>>> >>>>>> Hi Martin, >>>>>> >>>>>> I have tried various methods to dientify the cause of this and took >>>>>> your advice and used vmstat. I simply restarted the httpd process from >>>>>> the >>>>>> monit web interface while the comand was running and got the following >>>>>> warning. >>>>>> >>>>>> Description: cpu system usage of 50.0% matches resource limit >>>>>> [cpu system usage>30.0%] >>>>>> >>>>>> But vmstat doesnt show that level of usage at the point of alert. As >>>>>> you can see there is some usage in the 3rd line of the output when i >>>>>> restarted the httpd service but it doesnt seem enough to trigger an >>>>>> alert. >>>>>> >>>>>> vmstat 1 10 >>>>>> procs -----------memory---------- ---swap-- -----io---- --system-- >>>>>> -----cpu----- >>>>>> r b swpd free buff cache si so bi bo in cs us >>>>>> sy id wa st >>>>>> 0 0 0 859596 114684 856908 0 0 4 6 81 77 >>>>>> 0 0 99 0 0 >>>>>> 0 0 0 859448 114684 856916 0 0 0 0 100 94 1 >>>>>> 0 99 0 0 >>>>>> 0 0 0 898352 114692 815600 0 0 0 168 555 605 23 >>>>>> 15 61 1 0 >>>>>> >>>>>> Not sure if there are any other tests i can run to narrow this down a >>>>>> bit further as it still isn't making sense. >>>>>> >>>>>> Regards >>>>>> >>>>>> Wayne >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote: >>>>>> >>>>>>> Hi Lawrence, >>>>>>> >>>>>>> the test which triggers the alert is "system" cpu => it's the time >>>>>>> the system spend in kernel mode. The cpu usage could be triggered by >>>>>>> some >>>>>>> background kernel task, to verify the monit report matches the system >>>>>>> cpu >>>>>>> usage, you should use either "vmstat" or "top" instead of "ps". >>>>>>> >>>>>>> Best regards, >>>>>>> Martin >>>>>>> >>>>>>> >>>>>>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote: >>>>>>> >>>>>>> Hi Igor, >>>>>>> >>>>>>> the operating system is RHEL6 and monit version is 5.3.1 >>>>>>> >>>>>>> this is what i have in my config >>>>>>> >>>>>>> if cpu usage (user) > 70% then alert >>>>>>> if cpu usage (system) > 30% then alert >>>>>>> if cpu usage (wait) > 20% then alert >>>>>>> >>>>>>> this is one of the errors >>>>>>> Description: cpu system usage of 50.0% matches resource limit [cpu >>>>>>> system usage>30.0%] >>>>>>> >>>>>>> this is what i get in /var/log/messages >>>>>>> Dec 6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> >>>>>>> cpu system usage of 50.0% matches resource limit [cpu system >>>>>>> usage>30.0%] >>>>>>> Dec 6 12:02:29 <hostname-removed> monit[864]: >>>>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded >>>>>>> [current cpu system usage=0.9%] >>>>>>> >>>>>>> this is the output of ps --no-headers -A -o "%*cpu* sz ucomm" | >>>>>>> sort -k1nr | head -20 >>>>>>> >>>>>>> 12:01:29 up 4 days, 20:24, 2 users, load average: 0.04, 0.01, 0.00 >>>>>>> total used free shared buffers >>>>>>> cached >>>>>>> Mem: 2055108 1092176 962932 0 53156 >>>>>>> 811864 >>>>>>> -/+ buffers/cache: 227156 1827952 >>>>>>> Swap: 4128760 0 4128760 >>>>>>> 1.2 44308 perl >>>>>>> 0.0 0 aio/0 >>>>>>> 0.0 0 async/mgr >>>>>>> 0.0 0 ata/0 >>>>>>> 0.0 0 ata_aux >>>>>>> 0.0 0 bdi-default >>>>>>> 0.0 0 cpuset >>>>>>> 0.0 0 crypto/0 >>>>>>> 0.0 0 events/0 >>>>>>> 0.0 0 ext4-dio-unwrit >>>>>>> 0.0 0 flush-253:0 >>>>>>> 0.0 0 jbd2/dm-0-8 >>>>>>> 0.0 0 kacpi_hotplug >>>>>>> 0.0 0 kacpi_notify >>>>>>> 0.0 0 kacpid >>>>>>> 0.0 0 kauditd >>>>>>> 0.0 0 kblockd/0 >>>>>>> 0.0 0 kdmflush >>>>>>> 0.0 0 khelper >>>>>>> 0.0 0 khubd >>>>>>> >>>>>>> Have to say i am at a total loss as there is no way the usage >>>>>>> figures are accurate. >>>>>>> If there is any other info i can supply that will be useful please >>>>>>> let me know. >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Wayne >>>>>>> >>>>>>> >>>>>>> On 6 December 2011 12:03, Igor Homyakov < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Lawrence, >>>>>>>> >>>>>>>> Could you be a little bit more specific ? Please provide >>>>>>>> information >>>>>>>> about you operation system, monit version on which the problem >>>>>>>> occurred and so on. >>>>>>>> >>>>>>>> Regards >>>>>>>> Igor Homyakov >>>>>>>> >>>>>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne >>>>>>>> <[email protected]> wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > I have a few CPU usage checks in my monitrc but it seems monit is >>>>>>>> > misreporting the usage. >>>>>>>> > >>>>>>>> > I have run several tests and it seems that monit is multiplying >>>>>>>> the actual >>>>>>>> > usage by 10. >>>>>>>> > >>>>>>>> > I ran a process with top running in another shell and CPU usage >>>>>>>> for the user >>>>>>>> > was never above 10% yet monit informed me that there was 100% cpu >>>>>>>> usage. >>>>>>>> > >>>>>>>> > I have tried various configurations including the one that came >>>>>>>> with the >>>>>>>> > default config for system cpu monitoring and all seem to >>>>>>>> demonstrate the >>>>>>>> > same issue. >>>>>>>> > >>>>>>>> > Any advice welcomed on this >>>>>>>> > >>>>>>>> > Regards >>>>>>>> > >>>>>>>> > Wayne Lawrence >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> To unsubscribe: >>>>> https://lists.nongnu.org/mailman/listinfo/monit-general >>>>> >>>> >>>> >>> -- >>> To unsubscribe: >>> https://lists.nongnu.org/mailman/listinfo/monit-general >>> >>> >>> >>> -- >>> To unsubscribe: >>> https://lists.nongnu.org/mailman/listinfo/monit-general >>> >> >> -- >> To unsubscribe: >> https://lists.nongnu.org/mailman/listinfo/monit-general >> >> >> >> -- >> To unsubscribe: >> https://lists.nongnu.org/mailman/listinfo/monit-general >> > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general > > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general >
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
