Robert G. Brown
Thu, 25 Sep 2008 12:04:00 -0700
On Wed, 24 Sep 2008, Donald Becker wrote:
xmlsysd is -- I think -- very nicely hierarchically organized. It achieves adequate efficiency for many uses a different way -- it is only called on (client side) demand, so the network isn't cluttered with unwanted or unneeded casts (uni, multi, broad). It is "throttleable" -- the client controls the daemon on each host, and can basically tell itThe very first implementation of what later became BeoStat did this. There was only a single client, a display GUI, and it told the nodes how and how often to report. That turned out to be a bad design. No other tools could rely upon the reporting stream contents, and it couldn't be used for liveness indication. Once we redesigned to a send-only system that reported once per second it became generally useful.
I opted to split off a library that facilitates the building of UIs, as well as a tool that basically unpacks certain displays (predefined clusters of reported stats) and dumps them tablewise to stdout. So anybody can get to the stream, parse it, and use it, even if they only know how to use split in perl and are clueless about the XML tools that exist for pretty much any possible programming environment. In C you can just clone the non-curses part of wulfstat hack it to fit, and it is mostly just a set of library calls. I wouldn't say my split is very good yet -- I'm learned the hard way with dieharder (and libdieharder that does all the work) that getting the library "just right" to support multiple UIs is not easy and not likely to be ideal the first couple of tries. But on the next rewrite it should be PRETTY good. The liveness issue is most definitely a problem with xmlsysd/wulfstat, because frankly TCP sucks for this specific purpose. I'd love to have what amounts to ping built into the UI, but it is a restricted socket command and I don't want to make the UI suid root. It would be trivial to recode xmlsysd to work the other way -- in fact, not too difficult to make it work BOTH ways, allow an initial straight-up TCP connection, configure the daemon to your desired set of statistics (all existing features) and then add two new commands to tell it to close the connection and begin to unicast back to host X, with frequency Y. Or better yet, leave the TCP connection open as a control interface WHILE it casts back at Y, permitting the controlling host to send synchronization feedback and drive its gathering of messages towards "simultaneity" in some narrowly confined sub-interval of time period Y, or to reconfigure the message on the fly, or to request a "snapshot" of information accessible to the daemon but not in the regular message out-of-band. I actually really like this idea -- the live TCP connection actually enables lots of things, all controlled from the client/master node side, while STILL obtaining the benefit of unicast and UDP. That would partly resolve the ping issue and let me maybe make downed host identification and reconnection as it comes back up a bit more robust -- not issues in your design but issues in mine where I have to cope with TCP timeouts and so on to decide when something is down, which can lead to poor performance on the client UI side (the nodes don't care).
For a concrete example: the GUI might not care about CPU utilization percentages and not gather the numbers. But when we run a MPI process mapper (a mapper is a one-off, at-this-instant layout scheduler) we don't want to wait while those numbers are requested and reported. We want to support both continuously-running (GUI display tools) and start-exit (extract data, analyze, report) usage.
Sure. But with a "permanent" direct connection, you just tell the daemon when you want only one small (but predefined -- this isn't about infinite user choice or a lack of design discipline) constellation of outputs, and it stops polling the parts of /proc you don't care about. If you suddenly need something -- a snapshot of running non-root processes, a complete picture of meminfo, the clock and cache size and architecture of the CPU (from different parts of /proc, in the latter case something you will need only VERY infrequently and on user demand) you just say e.g. on cpuinfo and either send to get an immediate reply via TCP or wait until the next unicast to get memory stuff added to the regular stream. Yes, talking to the socket is "bad" as it interferes with synchronicity, but then, you don't do it all the time and it is much cheaper than having to actively (re)connect to the node to make a base configuration change and restart everything. In the meantime you quietly accumulate cycle savings by NOT parsing all the process IDs unless you really need to, by NOT parsing /proc/cpuinfo or even /proc/stat unless the user wants to look at it (well, truthfully xmlsysd reads cpuinfo just once at the beginning and then just RETURNS it if requested anyway, so mostly you save a bit of bw and packet size but people only look at this sort of information for a few seconds anyway, no need to really poll it).
Well, both of them have to be sent by the network. One can choose UDP or TCP for either one, and each has advantages and disadvantages (with -- gulp -- the same tradeoffs between reliability and robustness and speed).There are trade-offs between UDP and TCP. But UDP is the right model for stats. I pretty much don't care about old numbers. I'll go further -- I want to forget old numbers. If there is a problem, communication or processing, that blocks updates for a minute, I want the numbers to look stale. And I would rather get the new stats right away than process and post a series of old messages.
I'm not sure that this latter is an intrinsic difference/feature between UDP and TCP; wulfstat doesn't display stale stats either. That's really a UI choice in what it does when EITHER message fails to get through. UDP you either catch the message or you don't, and with large numbers of hosts replying in a DELIBERATELY small window, I'm guessing you drop a lot of the messages and hosts blink in and out (unless you cache them long enough to mask at least a round or two of missing info). With TCP dealing with per-host random delays without blocking and detecting host crashes is most definitely a pain, but not impossible. I keep telling myself, anyway...:-) And as I said above, it seems as though one could have the best of both -- it isn't really necessary to choose "only" one; both could even be accessible simultaneously within the same running daemon. You've inspired me, in the best of open source traditions, and soon I will have to Write More Code. This will let me "fix" a number of things that have annoyed me about xmlsysd (generally functional as it is). Just as soon as I have time, since I have only six ongoing projects plus two classes and two more independent study students, and dieharder is taking most of my elective time. Humans have their own scheduling woes and I've been thrashing for a decade in spite of modest upgrades in capacity...:-) There you've got an advantage in addition to your natural good looks, I guess, with people who will actually pay you to make changes and improvements to your product. I just do it out of a mix of love and for my own use. I'd rather have the money -- or perhaps would rather ALSO have the money...;-)
Actually, from here on down -- with the exceptions of choosing to use xml to encapsulate the return, TCP instead of UDP, and allowing the client side to control and throttle the daemon so that one can tune the impact of the monitoring to the demands of the cluster and task, the two things sound very similar -- as they should be, given that they're both N>3 generation tools.The bottom line for all of this isn't "mine is better than yours". I would like to see a common cluster state/status/statistics reporting system. It doesn't have to look exactly like BeoStat, but I expect a good one wouldn't be too far from the current BeoStat design.
The interesting thing is that our independently arrived at designs are remarkably SIMILAR -- much more like one another than either one is like ganglia, for example. I'm guessing that our proc parsing code is quite similar on the back end, we both seem to report similar constellations from proc without reporting EVERYTHING from proc or necessarily letting a user muck around with what the tool can deliver. Outside of the functional core, you chose one way to deliver messages and configure (or not) the tool -- efficient but hard to change or debug or human read -- where I chose the other, relatively inefficient but much easier to debug or human read and controllable from a small palette of choices. Yours is tightly integrated, mine isn't really "integrated" at all. They are also "intended" to be used in different kinds of environments. xmlsysd is a standalone object -- drop it onto any linux system and it should just work, providing a connection-oriented relatively lightweight remote client controllable window into the local /proc and systems information space. beostat sounds (correct me if I'm wrong) much more like a fully integrated component of an all-or-nothing package. You wouldn't, maybe couldn't, install it on an plain old workstation and use it as part of a straight sysadmin package to keep an eye on a LAN as easily as a cluster, where sometimes I think wulfstat is MORE useful to LAN admins than it is to a "real cluster" administrator, with their stringent scaling requirements -- it certainly is designed for use on small to midsize LAN-ish clusters with stock kernels more than for 2048 node superclusters. What we really ought to do is exchange our data views (dictionary and encapsulation), kick them around, and arrive at a non-too-horrible consensus where at least one or our data views is an actual subset of the other. I "think" it would be pretty easy to add a command to xmlsysd such as "on beostat" that caused it simply do what it does now but pack the result into beostat-compatible UDP packets. If I DID leave in the "out of band" TCP control channel -- something that the overall scyld package probably accomplishes an entirely different way -- one could perhaps get the best of both worlds -- something with the operational leanness and scalability advantages of beostat but ALSO with the ease of use and debuggability of xmlsysd. EVEN on a Scyld type cluster, there might be times when it is useful to be able to just telnet into a node's xmlsysd port and tell one in human readable form "just what do you think you are doing?", and on a LAN client that might well be the dominant mode until 11 pm when you reboot (or not reboot, merely "repurpose") the LAN client into being part of a beostat-monitored-and-MPI-fronted cluster overnight. It would probably be simpler -- and more philosophically acceptable, since xmlsysd is already a "gorpier" and more general purpose tool -- to teach xmlsysd beostatish than to teach beostat to speak xmlish, but of course you are welcome to copy, grab, etc xmlsysd's GPL code and make it your own, or to otherwise steal the idea of a throttleable/remote controllable command interface if you don't already have one. The point is that if we COULD agree on data content and encapsulation -- or even offer a limited menu of choices of same, as xmlsysd already endeavors to do -- then it would be very simple to make UI and application tools that were interoperable and portable from LAN to cluster, supported by a co-provided library and API. Maybe even make it easy to build a semi-portable load balancer, scheduler, job distribution system etc, or to just build access to this block of information right into applications. Just a thought. rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf