Apologies for the delay in responding. There have been Lots of fires to fight recently, none of them major but they add up
Hal Murray <[email protected]>: > > I implemented a direct mode. It writes out each batch of slots as soon as it > gets them. Any sort options are ignored. There will be duplicates of any > slots that get updated after they are retrieved. I think the filtering stuff > should still work but I didn't try it. I can't find any commit that looks like it involvs a 'direct' flag. Did you push this, or is it a private set of changes? If the latter, I'd like to see the patch and play with it. > The code and UI need more work, but as a proof of concept it managed to > capture everything from a busy server. > > I think collecting data from a busy server will always be "interesting". I > know about 2 issues. > > The first is the race between collecting data and having slots get moved or > recycled while you are collecting. This is obviously easier if you can run > on the same system as the server so there are no network delays. Right, and it's a *fundamental* problem because the overhead of assembling the batches means pumping them out intrinsically happens slower than logging new requests. I don't think there's anything to be done about this other than document it as a known problem with monitoring heavily loaded servers. Pushing their traffic high enough will reliably push them into this lagging state. The only question is whether this happens in your normal traffic regime. Nothing we do on the client side can prevent this, though a slow client could make it worse if local computation or I/O stalls its network reads. I think stalling due to processor lag is unlikely to happen. Even a low-power ARM has lots of headroom with respect to network speeds these days. On the other hand, if ntpmon's screen I/O happened between spans rather than after the sort and reassembly, that could be pretty bad. > If we can't go fast enough, we should be able to get some of the data and/or > some estimates of how much we are missing. Some of the data, yes. As the Mode 6 protocol is designed I don't see how to get good estimates. On the other hand, I can imagine an inexpensive protocol extension that would help a lot - adding a tag to the front of each span that reports the MRU-list size at the time of transmission. If your client sees this number rising rather than falling during a span sequence then you can at least be warned that you're probably in a losing race. > We can probably test that by > running over a network. (That will also test the lost packet code.) We need > to be sure to debug this case/mode so we will have useful tools when the next > big burst of traffic hits the pool. I'm in favor of *that*... > The other issue is memory and CPU on the system collecting the data. I don't > know which limit will kick in first. It takes a lot of CPU, but that's not a > problem as long as you can keep up with the server. I think that translates > into a threshold for how busy a server you can grab complete data from. Yes, that matches my own analysis. > I think memory will be a serious issue. I saw troubles before switching to > direct mode but it should work on a system with more memory or less traffic. > Direct mode doesn't use much memory so this probably won't be a problem. Can't easily see it being a big problem in the normal mode either, frankly. By definition the client memory usage has to be linearly related to the memory usage on the server, and even in Python I don't think the constant of proportionality can be very large. I'd guess around 2x-3x. If I turn out to be wrong about that, there is recourse. Changing the representation of spans from a list of objects from a list of tuples, for example, would drop about 40 bytes per item. > Any suggestions for a UI/CLI? Not before seeing the patch, no. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> _______________________________________________ devel mailing list [email protected] http://lists.ntpsec.org/mailman/listinfo/devel
