Hi, in our network we see ~60 failed updates per client each day.
Logs from collectd (running on "logger3") which is running the network plugin as server: > Dec 2 08:16:49 logger3 collectd[7186]: rrdcached plugin: rrdc_update > (/var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd, > [1385968599:63072:373977], 1) failed with status -1. Logs from strace attached to rrdcached (also running on "logger3"; I am just including the parts containing "if_packets.rrd" from "web4"): > 7203 08:16:49.168136 read(3, "update > /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd > 1385968599:63072:373977\n", 8192) = 111 > 7203 08:16:49.168194 write(3, "-1 illegal attempt to update using time > 1385968599.000000 when last update time is 1385968609.000000 (minimum one > second step)\n", 127) = 127 Going backwards in logs (reversed order!): > 7203 08:16:49.164592 write(8, "date > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_flow_control_xoff.rrd > 1385968609:0\nwrote > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_restart_queue.rrd\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_smbus.rrd > 1385968609:103\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_csum_offload_errors.rrd > 1385968609:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_smbus.rrd > 1385968609:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-corr_ecc_errors.rrd > 1385968609:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-uncorr_ecc_errors.rrd > 1385968609:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_dma_failed.rrd > 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/load/load.rrd > 1385968609:0.000000:0.010000:0.050000\nupdate /var/lib/collectd/rrd/web4.examp le.org/iptables-filter-net2fw/ipt_bytes-Blacklist.rrd 1385968609:6768\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_packets-Blacklist.rrd 1385968609:130\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-dropped_smbus.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_packets.rrd 1385968609:59694\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_packets.rrd 1385968609:28090\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_bytes.rrd 1385968609:4673356\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_bytes.rrd 1385968609:4341247\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_broadcast.rrd 1385968609:31269\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_broadcast.rrd 1385968609:1\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_multicast.rrd 1 385968609:12\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_multicast.rrd 1385968609:7\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_dropped.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-multicast.rrd 1385968609:12\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-collisions.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_length_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_over_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_octets.rrd 1385968609:61081416:457187844\nupdate /var/lib/collectd/rrd/web4.example.org/interface-interna l0/if_errors.rrd 1385968609:0:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_octets.rrd 1385968609:4673356:4341247\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_errors.rrd 1385968609:0:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968609:63073:373985\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_no_buffer_count.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_crc_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_packets.rrd 1385968609:59694:28090\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_aborted_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_frame_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_c", 4096) = 4096 > 7203 08:16:49.163859 write(3, "0 errors, enqueued 1 value(s).\n", 31) = 31 > 7203 08:16:49.163806 read(3, "update > /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd > 1385968609:63073:373985\n", 8192) = 111 > 22310 08:16:29.171935 write(8, "nal0/derive-tx_heartbeat_errors.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_abort_late_coll.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_single_coll_ok.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_multi_coll_ok.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_long_length_errors.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_short_length_errors.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_flow_control_xon.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_restart_queue.rrd > 1385968589:0\nupdate > /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_short_length_errors.rrd > 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example. org/ethstat-external0/derive-tx_timeout_count.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_align_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_flow_control_xoff.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_tcp_seg_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_flow_control_xon.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_tcp_seg_good.rrd 1385968589:66\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_flow_control_xon.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_csum_offload_good.rrd 1385968589:19609\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_c sum_offload_errors.rrd 1385968589:16\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_flow_control_xoff.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_header_split.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-alloc_rx_buff_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-dropped_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_dma_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_dma_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-uncorr_ecc_errors.rrd 1385968589:0\nupdate /var/lib/collect d/rrd/web4.example.org/ethstat-external0/derive-rx_hwtstamp_cleared.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_packets-Blacklist.rrd 1385968589:130\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_bytes-Blacklist.rrd 1385968589:6768\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-corr_ecc_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968589:63071:373969\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-used.rrd 1385968579:0.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-free.rrd 1385968579:4294963200.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-cached.rrd 1385968579:0.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap_io-in.rrd 1385968579:0\nwrote /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-corr_ecc_errors", 4096) = 4096 > 7203 08:16:29.171206 write(3, "0 errors, enqueued 1 value(s).\n", 31) = 31 > 7203 08:16:29.171154 read(3, "update > /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd > 1385968589:63071:373969\n", 8192) = 111 Looks like rrdcached didn't received data for "if_packets.rrd" for "web4" on "08:16:39" but received two updates for this file on "08:16:49" (the first update (08:16:49.163806) seems to be the "regular" expected update; the seconds update (08:16:49.168136) seems to be the missing data from "08:16:39"). The question is "WHY"? Is it a time problem? ntp logs from logger3 (which is also the only time source for web4): > Dec 2 05:55:17 logger3 ntpd[9910]: adjusting local clock by 0.000272s > [...] > Dec 2 08:08:00 logger3 ntpd[9911]: reply from 192.53.103.103: offset > 0.000466 delay 0.024575, next query 1526s > Dec 2 08:16:10 logger3 ntpd[9911]: reply from 131.188.3.222: offset 0.000921 > delay 0.020624, next query 1594s > Dec 2 08:23:27 logger3 ntpd[9911]: reply from 195.145.119.188: offset > -0.003621 delay 0.006547, next query 161 > [...] > Dec 2 08:57:20 logger3 ntpd[9910]: adjusting local clock by 0.000502s ntp logs from web4: > Dec 2 07:23:26 web4 ntpd[11310]: adjusting local clock by 0.000297s > Dec 2 07:50:20 web4 ntpd[11311]: reply from logger3: offset -0.000051 delay > 0.000207, next query 1566s > Dec 2 08:16:26 web4 ntpd[11311]: reply from logger3: offset -0.000069 delay > 0.000172, next query 1540s > Dec 2 08:42:06 web4 ntpd[11311]: reply from logger3: offset -0.000077 delay > 0.000185, next query 1508s > Dec 2 09:07:14 web4 ntpd[11311]: reply from logger3: offset 0.000417 delay > 0.000179, next query 1534s > Dec 2 09:32:48 web4 ntpd[11311]: reply from logger3: offset 0.000419 delay > 0.000227, next query 1560s > Dec 2 09:32:48 web4 ntpd[11310]: adjusting local clock by -0.000046s No changes to the clock are logged, offset before and after 08:16 looks fine, so time should be OK, no? web4 is just an example. We see this for *every* client. Also, the plugins/files will vary. Well yes, 86400 seconds / 10 seconds interval * 8 plugins = 69120 updates per 24h per client So if only 60 updates per client and day will fail this isn't very much I guess. But I want to know if this is "normal" or not. Also: Would you recommend to run the TimeDiff plugin on the "server" to prevent this? We are running collectd v5.4.0 on server and client. The collectd server (which is acting as time source for the clients) currently receives data from 10 clients. All servers are located in the same rack, they are connected through an internal HP ProCurve Switch 2824. Any feedback/comment is appreciated. Thanks. -Thomas _______________________________________________ collectd mailing list [email protected] http://mailman.verplant.org/listinfo/collectd
