I checked the logs of the collectd instances with problems but there is not much in it, only some warning about configuration block with disabled plugin which shows i have to do some cleaning after my tests :p For now I will follow your suggestion and increase the number of miss for an alert to be triggered but as you said it is only a workaround, I still do not have any clue about why this happens, the servers are not under heavy load, network between them is fine ( all hosted in the same datacenter and when sniffing network i can se the packets each 10s, it is just that some counters are missing in them... ).
Interval is set to 10s for every node and none of our monitored data requires actions in the minute so getting an alert about a missing machine after the server was missing for even 60s is acceptable. It still puzzle me and I really want to find a solution but at least I can use alerts now ^^ And I forgot to write it in first email but all servers are synchronised with an external timeserver (multiple servers in fact in case of failure). Thanks for your answer. Julien A. On 22 March 2010 22:02, Andrés J. Díaz <[email protected]> wrote: > On Mon, 22 Mar 2010 15:52:19 +0100 > Schmurfy <[email protected]> wrote: > > > Hello, > > Hello Schmurfy :) > > > [...] > > My problem is that alerts are triggered really often because either > > load or df plugin data are missing for data received (it is not > > always the same server and when it happens it tends to do it more > > than once), I tried different things to solve the problems: > > There are a number of reasons for your problem, the most common > one is the LAN latency. As you are checking for missing pings, > I would think that there were previous problems in the network, > innit? ;) > > > My last attempt at solving this was simply to check what was going on > > the networks by putting a network sniffer on the central server > > (ngrep), results are that every 10s the collectd servers really send > > the data as it should BUT not all of these packets contains the load > > or df value (it may also happen with other fields but i did not > > checked every one of them), the load and df fields can sometimes be > > included in 1/3 packets meaning it is sent every 30s instead of 10s > > but most of the time it is simply sent half of the time so every 20s. > > I assume that all clients are reporting at the same interval. Have you > any other weird message on collectd log? In some situations (when a > plugin fails) the collection of data can be stopped for a while, but in > this cases you can see the properly message on collectd log. > > Also be carefully with the timedate on the nodes, I was some problems in > the past with non-synchronized hosts, in this cases RRD often fails > before thresholds too and log errors are very verbosed. > > > The strange things is that the ping plugin never raise a "data are > > missing" alert and only triggers an alert when it should, I really > > feel lost on that :\ > > Really this is not an unusual thing, because the ping plugin is > dispatched by the server and the problem appears to be in the > communication with the clients, so i think that we can discard network > related problems... :/ > > On the other hand you have an easy workarround for thresholds (of > course this is not a solution). Recently i post a patch to change the > timeout using in thresholds, so you can use: > > Timeout 3 > > to increase the checking time for missing values to 30s (3 intervals). > > Regards, > Andres > >
_______________________________________________ collectd mailing list [email protected] http://mailman.verplant.org/listinfo/collectd
