On Mon, Feb 22, 2021 at 3:48 AM Marcelo Magallón < [email protected]> wrote:
> Thanks Ben. > > On Thu, Feb 18, 2021 at 1:37 PM Ben Kochie <[email protected]> wrote: > >> The problem with what you're proposing is you're getting an invalid >> picture of data over time. This is the problem with the original smokeping >> program that the smokeping prober is trying to solve. >> >> The original smokeping software does exactly what you're talking about. >> It sends out a burst of 10 packets at the configured interval (in your >> example, 1 minute). The problem is this does not give you a real picture, >> because the packets are not evenly spaced. >> >> This is why I made the smokeping_prober work the way it does. It sends a >> regular stream, but captures the data in a smarter way, as a histogram. >> >> From the histogram data you can only collect the metrics every minute, >> and generate the same "min / max / avg / dev / loss" values that you're >> looking for. But the actual values are much more statistically valid, as >> it's measuring evenly over time. >> > > That's fair. I do understand the argument for preferring continuous > observations. > > The problem I have with the histogram approach (and this is partly due to > the current way histograms work in Prometheus) is that I don't know the > distribution a priori. > > I let smokeping_prober run for a few days against several IP addresses. > For a particular one, after 250+ thousand observations, it's telling me > that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using > the sum and the count from histogram data I can derive an average (not > mean) over a short window and it's giving me ~ 60 ms. I happen to know > (from the individual observations) that the 95th percentile is also ~ 60 > ms, and that's pretty much the 50th percentile (the spread of the > observations is very small). The actual min/max/avg from observations is > something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram > the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 > ms. I must be missing something, because I don't see how I would extract > the min / max / dev from the available data. I do understand that the > standard deviation for this data is unusually small (compared to what you'd > expect to see in the wild), but still... > The default histogram buckets in the smokeping_prober cover latency durations from localhost to the moon and back. It's relatively easy to adjust the buckets, and easy enough to get within a reasonable range for your network expectations. Without knowing exactly what queries you're running, it's hard to say what you're doing. If you're using the histogram count/sum, this will give you the mean value. There is one known issue with the smokeping_prober right now that I need to fix, the ping library handling of sequence numbers is broken and doesn't wrap correctly. > > I also have to think of the data size. For 1 ICMP packet every 1 second, > I'm at (order of magnitude) 100 MB of data per target per month. Reducing > this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). > This doesn't sound like much for a single target but it does add up. > Yes, this is going to be an issue no matter what you do. I don't see how this relates to any mode of operation. > > As a side note, I noticed that smokeping_prober resolves the IP address > once. With BBE this happens everytime the probe runs, so I don't have to do > anything if I'm monitoring a host where IP addresses might change every now > and then. > Yes, this is currently intentional, but re-resolving is something I'm planning to do eventually. > > Thanks again, > > Marcelo > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABbyFmr9Vrz5%2Buoeo6vJtPfPO3Z18dKC82XBsFYRbAYLxGx6Gw%40mail.gmail.com.

