On Mon, 22 Mar 2010 15:52:19 +0100 Schmurfy <[email protected]> wrote:
> Hello, Hello Schmurfy :) > [...] > My problem is that alerts are triggered really often because either > load or df plugin data are missing for data received (it is not > always the same server and when it happens it tends to do it more > than once), I tried different things to solve the problems: There are a number of reasons for your problem, the most common one is the LAN latency. As you are checking for missing pings, I would think that there were previous problems in the network, innit? ;) > My last attempt at solving this was simply to check what was going on > the networks by putting a network sniffer on the central server > (ngrep), results are that every 10s the collectd servers really send > the data as it should BUT not all of these packets contains the load > or df value (it may also happen with other fields but i did not > checked every one of them), the load and df fields can sometimes be > included in 1/3 packets meaning it is sent every 30s instead of 10s > but most of the time it is simply sent half of the time so every 20s. I assume that all clients are reporting at the same interval. Have you any other weird message on collectd log? In some situations (when a plugin fails) the collection of data can be stopped for a while, but in this cases you can see the properly message on collectd log. Also be carefully with the timedate on the nodes, I was some problems in the past with non-synchronized hosts, in this cases RRD often fails before thresholds too and log errors are very verbosed. > The strange things is that the ping plugin never raise a "data are > missing" alert and only triggers an alert when it should, I really > feel lost on that :\ Really this is not an unusual thing, because the ping plugin is dispatched by the server and the problem appears to be in the communication with the clients, so i think that we can discard network related problems... :/ On the other hand you have an easy workarround for thresholds (of course this is not a solution). Recently i post a patch to change the timeout using in thresholds, so you can use: Timeout 3 to increase the checking time for missing values to 30s (3 intervals). Regards, Andres _______________________________________________ collectd mailing list [email protected] http://mailman.verplant.org/listinfo/collectd
