I've been messing about with an "sFlow to RRD" utility that takes interface counter samples from sFlow PDUs and shoves them into an RRD. (links to sFlow and RRD below) I face the age-old "He with more than one clock does not know the time" dilema - I have the time the sFlow PDU arrived at my collector (via asking for message arrival timestamping by the kernel/stack via SO_TIMESTAMP not via a gettimeofday() call after a recvfrom()). There is also the "sysUpTime" field of the sFlow PDU header, which is milliseconds since the sFlow agent started.
Presently, I am using the stack timestamp on the collector system (which is syncing time via NTP), and handwaving away the network delays - treating them as a more or less constant skew error. However, I may not always have that luxury, I may have to consume sFlow PDUs which have passed through the guts of other applications, which has then gotten me interested in the stability and accuracy of the sysUpTime field. So, I took two switches (not necessarily those of my employer, I try to have a broad view), configured them to send me sFlow counter samples, and then over 24 to 72 hours captured via tcpdump the sFlow PDUs. I did not enable time syncronization on either switch - I wanted to see just how bad it might be. I took the tcpdump timestamp (this is on my NTP-synced collector, so presumably that is advancing "accurately" over the long term) and sysUpTime from the first PDU, then looked at those from the last PDU in the trace and found that sysUpTime moved away from time on my system by one part in not quite 200000 for one of them and one part in 404 for the other. "No worries," I thought, "all that means is I need to tell the switches to sync time." So I did. I told them to sync time with my collector system. Now, presumably, the clocks on the switches over the long term should not drift all that far from that of my collector, perhaps oscilate back and forth. Rather than run tcpdump again, I hacked my sFlow to RRD utility to keep sFlow agent state, and report on the difference between how far time had advanced on my collector vs how var time had advanced on the switches, and after not quite 24 hours I am seeing: agent switch1 subagent 1 cum_pdu_time delta 70799652322 (usec)\ cum_uptime delta 70800000000 (usec) elapsed diff -347678 (usec) \ seqno delta 1 agent switch2 subagent 1 cum_pdu_time delta 70809506020 (usec) \ cum_uptime delta 70985000000 (usec) elapsed diff -175493980 (usec) \ seqno delta 1 "cum_pdu_time" is time as seen on my collector, "cum_uptime" is from the sysUpTime field of the sFlow PDUs. My switch1 still seems to differ by one part in ~200K, and switch2 by one part in ~400. Both switches claim they are syncing their time with my collector system. If I accept that at face value, about the only thing I can think of is that the sFlow agents are not basing their "time" on the NTP-synced time on the switches, but on something else. Perhaps assuming their timer is firing every N units of time and adding N to the sysUpTime, but the timer is really firing every M units of time. Any thoughts among the chrono-gods as to what I might do to verify that hypothesis? rick jones http://www.sflow.org/ http://oss.oetiker.ch/rrdtool/ I can run an ntpq against switch1: raj@tardy:~$ ntpq -p switch1 remote refid st t when poll reach delay offset jitter ============================================================================== *collector secthrobsurty 2 - 36 64 377 3.664 0.433 0.000 but switch2 isn't running a "full" NTP - it may just be doing an SNTP/ntpdate kind of thing. -- It is not a question of half full or empty - the glass has a leak. The real question is "Can it be patched?" these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... _______________________________________________ questions mailing list [email protected] http://lists.ntp.org/listinfo/questions
