I will fix it :P On Sat, Mar 31, 2012 at 10:41 AM, Daniel Zahn <[email protected]> wrote: > Hi, > > you may have noticed that Labs Nagios reports "Puppet freshness" as > CRIT for all instances. > > First a little background on that, in the end the current problem > left, if you want to skip technical details. > > In production these checks are implemented as "passive checks" (stuff > gets reported TO Nagios instead of Nagios asking remote hosts). > > Passive checks, while more complex to setup, have the general > advantage that Nagios just needs to passively sit there and receive > results from hosts instead of opening connections > to all hosts all the time. This can be implemented f.e. via the NSCA > (Nagios Service Check Acceptor) daemon or via snmp. > > While we currently use both methods in production, the puppet > freshness check is implemented via snmp-traps. > > On the client (instance) side we have this, puppetized in base.pp > > exec { "puppet snmp trap": > .. command => "snmptrap -v 1 ..etc... > > This lets all puppet agents execute snmtrap after a puppet run. > snmptrap uses arguments including the snmp community string "public", > an snmp OID, and the Nagios hostname, and actively sends it out to > Nagios. > > One of the reasons for this to fail was the hostname being hardcoded > to "nagios.wikimedia.org". > > So in base.pp I added an "if $realm == "labs" and turned that into > ${nagios_host} to set it to just "nagios" for labs, after that i could > see incoming traps on the Nagios host, using "tcpdump port 162". > (gerrit change 3988) > > On the server / nagios there are (snmpd), snmptrapd and snmptt. The > configs we use for this are in /etc/snmp/ (/files/snmp in puppet). > snmtrapd is the one listening to the incoming traps, it is configured > to then call "snmptt" as the "traphandle default". > snmptt then uses "EXEC > /usr/local/nagios/libexec/eventhandlers/submit_check_result". > > submit_check_result is a Nagios command that "fakes" a check_result on > the Nagios itself, it finally writes to the "nagios.cmd" command file, > which is a named pipe. > Once Nagios sees this coming in you can see "PASSIVE SERVICE CHECK" > result lines in tail -f nagios.log. > > Next step was the path to this Nagios command file differed from > production. In ./eventhandlers/submit_check_result , i changed the > CommandFile path. (/var/log/nagios in prod vs. /var/lib/nagios3 in > labs). I would like it if we could use the same pathes as in > production for the Nagios configs to avoid these manual fixes. > > But this wasn't it yet, so i compared the running snmp* processes to > production. Though snmptt was running fine, it turned snmptrapd was > not or with different options, i am not 100% sure anymore. Anyways, > once i started it like seen on spence: /usr/sbin/snmptrapd -On -Lsd > -p /var/run/snmptrapd.pid i could finally see incoming check results > in nagios.log. > (Petan, thanks for setting those up, but maybe you wanna check for > those options, i just started that _manually_ but we should test how > it looks after a reboot.) > > Now there is just a tiny problem left :P The hostnames mismatch. So > Nagios gets all the results, but in nagios.log you will see these: > > Warning: Passive check result was received for service 'Puppet > freshness' on host 'i-000000f8', but the host could not be found > > This is why: The full command the instances use to send out the traps > is: command => "snmptrap -v 1 -c public ${nagios_host} > .1.3.6.1.4.1.33298 `hostname` 6 1004 `uptime | awk '{ > split(\$3,a,\":\"); print (a[1]*60+a[2])*60 }'`", > > > See how `hostname` is being used in there. This simply works in > production because production hosts return the same string for > hostname that Nagios uses to define the hosts it knows about. On labs > though, hostname returns the resource name (f.e. i-000000f8), while > Nagios uses the "nice" instance names (f.e. "venus", "wikistats-01" > etc.) > > So the options were: Give me a command that returns the instance name > on an instance itself (as opposed to asking the controller) OR change > Nagios to use the resource names as hostnames. Since I don't think we > really want Nagios to report that "i-000000f8 is DOWN" i tried adding > the other name as an alias to a Nagios host definition. This didnt > work either though, Nagios does not appear to match against the host > aliases here. > > So when trying to find out if it is even possible for an instance to > know it's own instance name with a local command, Andrew Bogott > pointed me to this (thanks!:): > > http://aws.amazon.com/code/1825 (EC2 Instance Metadata Query Tool ), > quoting Andrew " labs runs on openstack which is theoretically > API-compatible with Amazon's EC2. Hence that being an amazon page." > > That looked really promising so i tested it on an instance, and indeed > it does work and can return all kinds of info. > Try "./ec2-metadata --all" after just wget'ing it and making it executable. > > Among these are: > > instance-id: i-000000ea > local-hostname: i-000000ea > public-hostname: i-000000ea > public-ipv4: 208.80.153.223 > > but unfortunately i still don't see the "hostname" we want.:/ > > So if you have an idea how to get that right nice hostname from the > instance itself, please tell me about it, or feel free to just add the > final fix in: > > base.pp (test branch) lines 93 - 100. It needs to keep using > `hostname` in production replaced by _something_ else if $realm is > labs. > > Regards, > > -- > -- > Daniel Zahn <[email protected]> > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
