Hi everyone.

I'm running collectd 5.1.0 in a client/server setup, with a central monitoring 
server reading from the network plugin, writing out via write_graphite, and 
also exposing UnixSock for nagios.  Everything works fine until the graphite 
server gets overloaded and becomes unresponsive.  Several seconds after that, 
the collectd server starts dropping data points from the cache, causing nagios 
to emit a ton of spurious pages.

Here's a sample of the log output from when I forced it to fail by stopping the 
carbon server:
   collectd[11038]: write_graphite plugin: send failed with status -1 (Broken 
pipe)
   collectd[11038]: write_graphite plugin: error with wg_send_message
   collectd[11038]: write_graphite plugin: Connecting to 
graphite.xxxxx.xxx:2003 failed. The last error was: Connection refused

My admittedly weak understanding is that the cache insert happens before the 
write plugins (based on 
https://collectd.org/wiki/index.php/Chains#Pre-_and_post-cache_chains), so 
failing to write shouldn't stop values from being stored in the cache.  I've 
tried a number of tricks to try and get it to keep the values, like switching 
back to the old python plugin or writing a "null" plugin that always returns 
successfully and runs along with write_graphite.  I'm starting to go down the 
road of trying terrible hacks to work around this, and there's probably 
something fundamental I'm getting wrong.

My entire collectd.conf contains:

    Hostname "monitoring.xxxxx.xxx"
    FQDNLookup true
    BaseDir "/var/lib/collectd"
    PluginDir "/usr/lib/collectd"
    TypesDB "/usr/share/collectd/types.db", 
"/usr/share/collectd/firefall.types.db"
    Interval 10
    ReadThreads 5

    Include "/etc/collectd/plugins/*.conf"
    Include "/etc/collectd/thresholds.conf"

The config for write_graphite has:

    LoadPlugin write_graphite

    <Plugin "write_graphite">
      <Carbon>
        Host "graphite.xxxxx.xxx"
        Port 2003
        Storerates true
      </Carbon>
    </Plugin>

There are other config files for various read plugins, but I doubt they're 
relevant.  I haven't performed an upgrade to a more recent version yet, mostly 
since nothing related to this seemed to be mentioned in the changelogs.  I was 
hoping that this sort of behavior might be something that's been seen before, 
and there might be a known solution to it.

Any ideas?  I'm happy to supply more information if needed.
_______________________________________________
collectd mailing list
[email protected]
http://mailman.verplant.org/listinfo/collectd

Reply via email to