Various reasons, we have a lot of nagios experience and infrastructure. The questions with all monitoring (in this case monitoring meaning = "if application/service death happens, take action") are often more around escalation path policy and documentation rather than the mechanics of how you check things are alive i.e. "What to do if we get SMS at 3am about delayed job failure?" assuming the sysadmin who receives the message has had very limited exposure to a this server or delayed job.
We have had a lot of activity with systems such as SMS gateway servers and we tend to lean towards end-to-end tests. In the case of SMS gateways, we sent an sms to and from a GSM device that was physically plugged into our infrastructure. Note that in BAU we are not sending out the messages via the GSM device, it's just for testing. All we do is check for the sms message on the GSM device every few minutes. If a message has not arrived in 10 minutes then we know there is a problem and pagers start beeping. This could be anything from the SMSC connection having died, to a power cut, to the GSM device having fried itself (all valid reasons for sysadmin to get involved). We haven't used monit, but it looks to me like it's more of a solution that will try to "solve" the failure i.e. restart apache/postgres if the system goes away. This is fine but what if _this_ (i.e. the monit script) process fails? How do you notice? At what stage, and by what means, should an actual person get involved? It's certainly horses for courses and all that for this sort of stuff. Main thing is that you understand your own approach and it makes sense to all involved On 30/08/11 13:32, Samuel Richardson wrote: > Any reason you didn't go with monit to keep delayed job running? > > Samuel Richardson > www.richardson.co.nz <http://www.richardson.co.nz> | 0405 472 748 > > > On Tue, Aug 30, 2011 at 1:25 PM, Andrew Boag > <[email protected] <mailto:[email protected]>> wrote: > > We use delayed job and it's a very powerful bit of kit. We also > moved from a cron-based system of dealing with regular tasks. > > The one thing to bear in mind is, as with all daemons, you need to > make sure that delayed_job itself is still running (it might > explode). You can use God for this but there is always the chance > that this itself will fail (which has happened to us). > > Our approach was to regularly put a job (use cron for this) in the > delayed_job queue that touches a file somewhere in /var/run or > /tmp ... this way you can easily set up a nagios check on the > machine to test if the file is getting created (and hence, delayed > job is processing the jobs in the queue). > > Just something that we discovered in our travels ... > > On 30/08/11 10:14, Michael Pearson wrote: >> Hi, >> >> I just posted this to serverfault[0] and was hoping that somebody >> here had solved the problem with a nice drop-in Ruby-friendly >> solution: >> >> We're using cron to manage our backups and other jobs in >> multiple locations. Using chef to populate files in >> cron.daily, cron.hourly, etc has worked pretty well for us so >> far, but with some issues: >> >> o I don't want to have to manage a mailserver on the >> system just to receive cron output >> o I want to be able to put output in my cron jobs >> without receiving email about them if nothing went wrong >> o I don't want to have to check /var/log/messages to >> see if jobs failed without output >> o I don't want to have to log in to the system to find >> that the backup job is still running >> >> Optimally, I'd like a web-based frontend that I can use to >> see this information, either as an extension on cron or a >> complete replacement. >> >> I can solve the above problems myself with a bit of >> scripting, but I'm sure that this is a problem that others >> have solved already. >> >> Note that I acknowledge that this is a completely separate >> issue from verifying the backups after they've been completed. >> >> [0] >> http://serverfault.com/questions/306249/what-alternatives-are-there-to-cron-for-running-scheduled-jobs >> >> -- >> Michael Pearson >> The Bon Scotts; http://www.thebonscotts.com >> >> -- >> You received this message because you are subscribed to the >> Google Groups "Ruby or Rails Oceania" group. >> To post to this group, send email to >> [email protected] >> <mailto:[email protected]>. >> To unsubscribe from this group, send email to >> [email protected] >> <mailto:[email protected]>. >> For more options, visit this group at >> http://groups.google.com/group/rails-oceania?hl=en. > > > -- > > ---- > > Andrew Boag - Director > Catalyst IT > [email protected] <mailto:[email protected]> > > mob: +61 421 528 125 <tel:%2B61%20421%20528%20125> > ddi: +61 2 8002 1758 <tel:%2B61%202%208002%201758> > > www.catalyst-au.net <http://www.catalyst-au.net> > > -- > You received this message because you are subscribed to the Google > Groups "Ruby or Rails Oceania" group. > To post to this group, send email to > [email protected] > <mailto:[email protected]>. > To unsubscribe from this group, send email to > [email protected] > <mailto:rails-oceania%[email protected]>. > For more options, visit this group at > http://groups.google.com/group/rails-oceania?hl=en. > > > -- > You received this message because you are subscribed to the Google > Groups "Ruby or Rails Oceania" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/rails-oceania?hl=en. -- ---- Andrew Boag - Director Catalyst IT [email protected] mob: +61 421 528 125 ddi: +61 2 8002 1758 www.catalyst-au.net -- You received this message because you are subscribed to the Google Groups "Ruby or Rails Oceania" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rails-oceania?hl=en.
