I like the suggestions Matthias makes; those suggestions have worked well for us.
RRD updates are very expensive - I am pretty sure without knowing anything more about your system that the RRD writes are causing most of the I/O load. Our current largest Nagios-based system has around 7500 hosts and around 40k active services spread across 4 pollers - the four pollers send perfdata to two report servers that do nothing but host the trap databases for traps from SNMPTT from the pollers, RRD files / PNP web UI, and the server side of our client/server notification system. The snmptt dbs and notification server dbs are replicated master master between the two hosts. Even with rrdcached and raid 10 these hosts regularly have 3 - 10 pct I/O wait. We hope to lower that number a bit by moving the DBs onto separate dedicated DB hosts. - Max On 9/25/10, Matthias Flacke <[email protected]> wrote: > On 9/25/10 2:30 PM, Frost, Mark {PBC} wrote: >> Greetings, listers, >> >> >> >> We've got an on-going issue with i/o contention. There's the obvious >> problem that we've got a whole lot of things all writing to the same >> partition. In this case, there's just one big chunk of RAID 5 disk on a >> single controller so I don't believe that making more partitions is >> going to help. >> >> >> >> On this same partition we have: >> >> >> >> 1) Nagios 3.2.1 running as the central/reporting server for a couple of >> other Nagios nodes that are sending check results via NSCA. >> Approximately 6-7K checks. >> >> >> >> 2) pnp4nagios 0.6.2 (with rrd 1.4.2) writing graph data. >> >> >> >> There's a 2nd server configured identically to the first that's acting >> as a "hot spare" so it also receives check data from the 2 distributed >> nodes and writes its own copy of the graph data locally as well. >> >> >> >> At the moment I'm concerned about the graphdata, but because I can only >> see i/o utilization as an aggregate, I can't tell what is the worst >> component on that filesystem -- status.dat updates? graph data? writes >> to the var/spool directory? We also look at continued growth so this is >> only going to get worse. >> >> >> >> These systems are quite lightly loaded from a CPU (2 dual-core CPUs) and >> memory (4GB) perspective, but the i/o to the nagios filesystem is >> queuing now. >> >> >> >> We're about to order new hardware for these servers and I want to make a >> reasonable choice. I'd like to make some reasonable changes without >> requiring too exotic of a setup. I believe these servers are currently >> Dell 2950s and they're all running Suse Linux 10.3 SP2. >> >> >> >> My first thought was to potentially move the graphs to a NAS share which >> would shift that i/o to the network. I don't know how that would work >> though and it would ultimately be an experiment. >> >> >> >> What experiences do people out there have handling this kind of i/o and >> what have you done to ease it? > > You didn't say how many of your checks create perfdata - but I assume > that most of your disk I/O is related to RRD updates. > RRD cached (see http://docs.pnp4nagios.org/pnp-0.6/rrdcached for PNP > integration) is a good means to collect multiple RRD updates and burst > write the RRD files. > > status.dat and the checkresults directory are always good candidates to > be stored on a ramdisk, especially since they're volatile data. As a > side note: status.dat on ramdisk is a pure boost for the CGIs :). > I know people which also store nagios.log on a ramdisk and regularily > save them via rsync onto a hard disk. > > My own systems with ~4000 checks and ~20.000 performance relevant data > sets went down from 30% to less than 2% wait I/O with rrdcached and > ramdisk use. > > Cheers, > -Matthias > > ------------------------------------------------------------------------------ > Start uncovering the many advantages of virtual appliances > and start using them to simplify application deployment and > accelerate your shift to cloud computing. > http://p.sf.net/sfu/novell-sfdev2dev > _______________________________________________ > Nagios-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null > ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
