Hope this helps someone else. I came in today and discovered that one of our check_mk Windows agents was giving 'tcp connection refused'. The check_mk_agent service in CMK itself was showing CRIT, and all other related checks on that Windows host were stale. No notifications had been sent out - still have to dig into why.
Here is my rough troubleshooting flow: - Run manual check - Restart Windows agent - Telnet host 6556 from my workstation (should normally work) - it works - Check the port from the OMD server - a few ways to test - port is responsive - We are running 1.11 (updated earlier this week for cmk BI features), so perhaps the agent needs updating - Updated agent from 1.24p2 to 1.24p3, no change - Noticed the service is not stopping correctly, have to kill the process - Double-checked the configuration, cleaned out extraneous stuff - Lots of googling later... - Ran netstat -anb | find /i "6556" on the troubled Windows box - I see 'CLOSE_WAIT' a number of times - Restart the service (kill, start), see LISTENING - Run manual check, still timing out - CLOSE_WAIT showing up again - Rebooted the Windows server (cuz, ya know) - No change - Started the check_mk_agent service, then from cmd: check_mk_agent.exe test - It was hanging on a particular check - Commented out the related checks - Everything returned to normal The check was a 'cscript script.vbs' that normally outputs appropriate Nagios-readable service data. The host runs about 30 of these checks, all of them work fine except for a select few. The select few were getting 'server does not exist' errors due to a VPN tunnel crashing and not coming back up. Still not certain if this is a cmk agent bug, or if we just need to put better error handling into our vbs code. (it's vbs because legacy and time) Chris
_______________________________________________ omd-users mailing list [email protected] http://lists.mathias-kettner.de/mailman/listinfo/omd-users
