Hi Martin, did as you instructed here is the output. >From my untrained eye there is some serious miscalculation in the 4th CPUDEBUG statement not a clue how it arrives at that figure.
CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199 (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209, cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589 -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0 CPUDEBUG: check_system: time=1323349102: systeminfo.total_cpu_user_percent=-1.00%, systeminfo.total_cpu_syst_percent=-1.00%, systeminfo.total_cpu_wait_percent=-1.00% CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131, cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605, old_cpu_total=59512589 CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%), cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227, old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576 CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229 (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132, cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583 -- old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606, old_cpu_total=59516582 CPUDEBUG: check_system: time=1323349142: systeminfo.total_cpu_user_percent=0.00%, systeminfo.total_cpu_syst_percent=100.00%, systeminfo.total_cpu_wait_percent=0.00% CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307 (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021, cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623 -- old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606, old_cpu_total=59516616 CPUDEBUG: check_system: time=1323349202: systeminfo.total_cpu_user_percent=0.90%, systeminfo.total_cpu_syst_percent=0.90%, systeminfo.total_cpu_wait_percent=0.00% Regards Wayne On 8 December 2011 12:50, Martin Pala <[email protected]> wrote: > Hi, > > thanks for update. I have prepared the debug version, which logs the > values computed based on /proc/stat right when they are ready and once > again before the values are checked, so we can see whether the values were > read+computed correctly and whether no memory corruption occurred before > they were compared by the validation engine => there are two "CPUDEBUG" log > entries per cycle. > > You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz > > To compile: > tar -xzf monit-5.3.1p2.tar.gz > cd monit-5.3.1p2 > ./configure > make > > Then stop existing monit instance and run new monit binary: > ./monit -vI 2>&1 | grep CPUDEBUG > > after you'll replicate the problem, terminate monit with ^C and send the > whole CPUDEBUG output since monit start > > Regards,, > Martin > > > > On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote: > > Hi Martin just as a side note here i disabled the cpu ssystem test and > tried again and it seems that the issue is present with all the cpu > monitoring/ > > I used the restarting of httpd as i knew it would trigger and alert and > these were the results. > > Date: Thu, 08 Dec 2011 10:27:59 > Action: alert > Host: <hostname removed> > Description: cpu user usage of 100.0% matches resource limit [cpu > user usage>70.0%] > > > I ran vmstat 1 10 at the same time as you can see its the 4th line. > > > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy id > wa st > 0 0 0 739220 142536 973532 0 0 4 7 10 6 0 0 > 99 0 0 > 0 0 0 739088 142536 973532 0 0 0 0 114 160 0 1 > 99 0 0 > 3 0 0 739088 142536 973536 0 0 0 0 126 169 1 2 > 97 0 0 > 0 0 0 737336 142536 973544 0 0 0 168 721 796 35 14 > 50 1 0 > 1 0 0 736964 142536 973544 0 0 0 0 109 160 1 1 > 98 0 0 > > and just to make it a little simpler i ran sar 1 10 as well as it is more > human readable. > > 10:27:55 CPU %user %nice %system %iowait %steal > %idle > 10:27:56 all 1.01 0.00 1.01 0.00 0.00 > 97.98 > 10:27:57 all 0.00 0.00 1.00 0.00 0.00 > 99.00 > 10:27:58 all 3.96 0.00 3.96 0.00 0.00 > 92.08 > 10:27:59 all 32.00 0.00 12.00 1.00 0.00 > 55.00 > > Something struck me as odd while testing this yesterdays results reporting > 50% system usage from 15.84% actual means the reported usage is 3.2 times > the actual. todays reported user usage of 100% is 3.2 times the actual 32%. > so it seems just need to work out why it is multiplying the results. > > Regards > > Wayne > > On 7 December 2011 11:43, Lawrence, Wayne <[email protected]>wrote: > >> Hi Martin, >> >> I downloaded the source from the Monit website and compiled it on the >> server. >> I have started monit in verbose mode and this is the relevant information >> it outputs when the event occurs. >> >> cpu system usage of 50.0% matches resource limit [cpu system >> usage>30.0%] >> >> ------------------------------------------------------------------------------- >> ../tools/bin/monit() [0x41a533] >> ../tools/bin/monit(LogError+0x9f) [0x41ad2f] >> ../tools/bin/monit(Event_post+0x328) [0x417ba8] >> ..t/tools/bin/monit() [0x428071] >> ../tools/bin/monit(check_system+0x2b) [0x4285bb] >> ../tools/bin/monit(validate+0x226) [0x42ad16] >> ../tools/bin/monit() [0x41422d] >> ../tools/bin/monit(main+0x511) [0x4149e1] >> /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd] >> ../tools/bin/monit() [0x40b179] >> >> ------------------------------------------------------------------------------- >> Unfortunately remote access is not an option but I will happily run a >> debug version to try and track down this problem as I really would like to >> use Monit for my current build. >> >> Regards >> >> Wayne >> On 7 December 2011 11:17, Martin Pala <[email protected]> wrote: >> >>> Thanks for data. >>> >>> The /proc/stat format is this: >>> >>> cpu <user> <nice> <system> <idle> <wait> <irq> <softirq> >>> >>> The values count the cpu cycles, so if we subtract the corresponding >>> values from your output, we get this: >>> >>> user nice system idle wait irq softirq >>> | total >>> 09:57:35 1 0 1 99 0 0 0 >>> | 101 >>> 09:57:36 1 0 0 98 0 0 0 >>> | 99 >>> 09:57:37 25 0 16 59 1 0 0 >>> | 101 >>> 09:57:38 1 0 2 98 0 0 >>> 0 | 101 >>> >>> => at 09:57:37 the cpu usage was: >>> >>> user = 24.75% >>> system = 15.84% >>> wait = 0.99% >>> >>> This corresponds to the previous vmstat output. Monit counts the cpu >>> usage the same way as above and doesn't modify these values => your monit >>> really reports strange cpu usage (reported 50% vs. real ~ 16%). >>> >>> What's the origin of your monit binary? Did you compile it from original >>> source code or some 3rd party source code distibution? (such as RHEL or >>> Fedora repository). Or do you use the pre-compiled binaries from >>> www.mmonit.com? Or some 3rd party binary, patches or source code from >>> other site? >>> >>> Please can you try to run monit in verbose mode and provide full output?: >>> >>> 1.) stop monit >>> 2.) run monit in foreground with verbose mode enabled: >>> ./monit -vI >>> 3.) after the problem happens, stop monit with "^C" and send output >>> >>> I can also prepare debug version which will dump the cpu usage related >>> informations or if you can provide remote access to the system, i can >>> troubleshoot the problem remotely. >>> >>> >>> Regards, >>> Martin >>> >>> >>> >>> On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote: >>> >>> Hi Martin, >>> >>> this is the output of the commands you requested. >>> >>> 1.) uname -m >>> >>> x86_64 >>> >>> 2.) file `which monit` >>> >>> ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically >>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped >>> I ran the command you supplied to get the cup usage directly as well >>> while restarting the httpd service as i know this will generate an alert. >>> >>> >>> >>> Date: Wed, 07 Dec 2011 09:57:37 >>> Action: exec >>> Host: <hostname removed> >>> Description: cpu system usage of 50.0% matches resource limit >>> [cpu system usage>30.0%] >>> >>> Wed Dec 7 09:57:34 GMT 2011 >>> cpu 207060 501 103542 49452254 25303 83 1569 0 0 >>> Wed Dec 7 09:57:35 GMT 2011 >>> cpu 207061 501 103543 49452353 25303 83 1569 0 0 >>> Wed Dec 7 09:57:36 GMT 2011 >>> cpu 207062 501 103543 49452451 25303 83 1569 0 0 >>> Wed Dec 7 09:57:37 GMT 2011 >>> cpu 207087 501 103559 49452510 25304 83 1569 0 0 >>> Wed Dec 7 09:57:38 GMT 2011 >>> cpu 207088 501 103561 49452608 25304 83 1569 0 0 >>> Wed Dec 7 09:57:40 GMT 2011 >>> If my understanding of /proc/stat is coreect this still doesnt make any >>> sense but i may be wrong. >>> >>> Regards >>> >>> Wayne >>> >>> >>> >>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote: >>> >>>> Please can you check that your monit binary matches the system >>>> architecture? (i.e. for example 64-bit monit binary on 64-bit system - not >>>> 32-bit monit on 64-bit system) >>>> >>>> To verify provide please the output of following commands: >>>> 1.) uname -m >>>> 2.) file `which monit` >>>> >>>> Monit takes the statistics from the /proc/stat kernel interface. You >>>> can collect the statistics manually like this - for example to fetch the >>>> state in 1 second intervals (30 samples): >>>> >>>> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; >>>> done >>>> >>>> Note: monit takes the first /proc/stat line ("cpu") which contains the >>>> overall cpu usage in the system (summary of all cpus). The /proc/stat also >>>> contains per-cpu statistics if you want to collect all the statistics, >>>> replace the "grep 'cpu '" simply with "cat". >>>> >>>> Regards, >>>> Martin >>>> >>>> >>>> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote: >>>> >>>> Hi Martin, >>>> >>>> I have tried various methods to dientify the cause of this and took >>>> your advice and used vmstat. I simply restarted the httpd process from the >>>> monit web interface while the comand was running and got the following >>>> warning. >>>> >>>> Description: cpu system usage of 50.0% matches resource limit >>>> [cpu system usage>30.0%] >>>> >>>> But vmstat doesnt show that level of usage at the point of alert. As >>>> you can see there is some usage in the 3rd line of the output when i >>>> restarted the httpd service but it doesnt seem enough to trigger an alert. >>>> >>>> vmstat 1 10 >>>> procs -----------memory---------- ---swap-- -----io---- --system-- >>>> -----cpu----- >>>> r b swpd free buff cache si so bi bo in cs us sy >>>> id wa st >>>> 0 0 0 859596 114684 856908 0 0 4 6 81 77 0 >>>> 0 99 0 0 >>>> 0 0 0 859448 114684 856916 0 0 0 0 100 94 1 0 >>>> 99 0 0 >>>> 0 0 0 898352 114692 815600 0 0 0 168 555 605 23 15 >>>> 61 1 0 >>>> >>>> Not sure if there are any other tests i can run to narrow this down a >>>> bit further as it still isn't making sense. >>>> >>>> Regards >>>> >>>> Wayne >>>> >>>> >>>> >>>> >>>> >>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote: >>>> >>>>> Hi Lawrence, >>>>> >>>>> the test which triggers the alert is "system" cpu => it's the time the >>>>> system spend in kernel mode. The cpu usage could be triggered by some >>>>> background kernel task, to verify the monit report matches the system cpu >>>>> usage, you should use either "vmstat" or "top" instead of "ps". >>>>> >>>>> Best regards, >>>>> Martin >>>>> >>>>> >>>>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote: >>>>> >>>>> Hi Igor, >>>>> >>>>> the operating system is RHEL6 and monit version is 5.3.1 >>>>> >>>>> this is what i have in my config >>>>> >>>>> if cpu usage (user) > 70% then alert >>>>> if cpu usage (system) > 30% then alert >>>>> if cpu usage (wait) > 20% then alert >>>>> >>>>> this is one of the errors >>>>> Description: cpu system usage of 50.0% matches resource limit [cpu >>>>> system usage>30.0%] >>>>> >>>>> this is what i get in /var/log/messages >>>>> Dec 6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu >>>>> system usage of 50.0% matches resource limit [cpu system usage>30.0%] >>>>> Dec 6 12:02:29 <hostname-removed> monit[864]: >>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded >>>>> [current cpu system usage=0.9%] >>>>> >>>>> this is the output of ps --no-headers -A -o "%*cpu* sz ucomm" | sort >>>>> -k1nr | head -20 >>>>> >>>>> 12:01:29 up 4 days, 20:24, 2 users, load average: 0.04, 0.01, 0.00 >>>>> total used free shared buffers >>>>> cached >>>>> Mem: 2055108 1092176 962932 0 53156 >>>>> 811864 >>>>> -/+ buffers/cache: 227156 1827952 >>>>> Swap: 4128760 0 4128760 >>>>> 1.2 44308 perl >>>>> 0.0 0 aio/0 >>>>> 0.0 0 async/mgr >>>>> 0.0 0 ata/0 >>>>> 0.0 0 ata_aux >>>>> 0.0 0 bdi-default >>>>> 0.0 0 cpuset >>>>> 0.0 0 crypto/0 >>>>> 0.0 0 events/0 >>>>> 0.0 0 ext4-dio-unwrit >>>>> 0.0 0 flush-253:0 >>>>> 0.0 0 jbd2/dm-0-8 >>>>> 0.0 0 kacpi_hotplug >>>>> 0.0 0 kacpi_notify >>>>> 0.0 0 kacpid >>>>> 0.0 0 kauditd >>>>> 0.0 0 kblockd/0 >>>>> 0.0 0 kdmflush >>>>> 0.0 0 khelper >>>>> 0.0 0 khubd >>>>> >>>>> Have to say i am at a total loss as there is no way the usage figures >>>>> are accurate. >>>>> If there is any other info i can supply that will be useful please let >>>>> me know. >>>>> >>>>> Regards >>>>> >>>>> Wayne >>>>> >>>>> >>>>> On 6 December 2011 12:03, Igor Homyakov < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Lawrence, >>>>>> >>>>>> Could you be a little bit more specific ? Please provide information >>>>>> about you operation system, monit version on which the problem >>>>>> occurred and so on. >>>>>> >>>>>> Regards >>>>>> Igor Homyakov >>>>>> >>>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne >>>>>> <[email protected]> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I have a few CPU usage checks in my monitrc but it seems monit is >>>>>> > misreporting the usage. >>>>>> > >>>>>> > I have run several tests and it seems that monit is multiplying the >>>>>> actual >>>>>> > usage by 10. >>>>>> > >>>>>> > I ran a process with top running in another shell and CPU usage for >>>>>> the user >>>>>> > was never above 10% yet monit informed me that there was 100% cpu >>>>>> usage. >>>>>> > >>>>>> > I have tried various configurations including the one that came >>>>>> with the >>>>>> > default config for system cpu monitoring and all seem to >>>>>> demonstrate the >>>>>> > same issue. >>>>>> > >>>>>> > Any advice welcomed on this >>>>>> > >>>>>> > Regards >>>>>> > >>>>>> > Wayne Lawrence >>>>>> >>>>> >>> >>> -- >>> To unsubscribe: >>> https://lists.nongnu.org/mailman/listinfo/monit-general >>> >> >> > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general > > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general >
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
