Re: [Ganglia-developers] How do we deal with very large clusters in the webui
Hi, On Thu, Mar 3, 2011 at 11:11 PM, Jim Greene jim.gre...@gmail.com wrote: -Don't show any individual hosts, only the aggregate and the load/network/etc levels for the whole cluster we did this on the main page for grids by adding one line of php that excluded the bulk of our computing grid. We also added a regexp parameter that you could pass in GET and everybody used predefined view without ever hitting the main grid page. So for example you'd have http://ganglia.organization.tld/ ?g...x...regexp=mysql.* which would only display all mysql servers. Of course this means you rely on a naming rule that might not be true for your environment. What are your thoughts on how we can accomplish this? Probably best to look into the new frontend which is being built exactly to address all this sort of limitations. https://github.com/vvuksan/ganglia-misc/tree/master/ganglia-web -- Behind every great man there's a great backpack - B. -- What You Don't Know About Data Connectivity CAN Hurt You This paper provides an overview of data connectivity, details its effect on application quality, and explores various alternative solutions. http://p.sf.net/sfu/progress-d2d ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Sun, Dec 20, 2009 at 7:35 PM, Vladimir Vuksan vl...@vuksan.com wrote: If you lose a day or two or even a week of trending data that is not gonna be disaster as long as that data is present somewhere else. sure, but where? how would the ganglia frontend tell? Thus I proposed a simple solution where even if one of the gmetads (gmetad1) fails you can either a. Get all the rrds (rsync) from gmetad2 before you restart gmetad1 which unless you have small amount or data or fast network between the two nodes won't complete before the next write is initiated, meaning they won't be identical. b. Simply start up gmetad1 and don't worry about the lost data sure As far as which data is going to be displayed you can do either 1. Proxy traffic to Ganglia with most up to date data how do you tell which one has most up to date data? 2. Change DNS record to point to Ganglia with most up to date data same question, which one has most up to date data? if you really mean most recent then both would, because both would have fetched the last reading assuming they are both functional, but gmetad1 would have a hole in its graphs. To me that does not really count as up to date. Up to date would be the one with the most complete data set which you have no way to identify programmatically. Also, assume now gmetad2 fails and both have holes, which one is the most up to date? To your last point there are chances that both gmetads fail in quick succession however I would think that would be a highly unlikely event. it doesn't have to be in quick succession to find yourself in a condition where you have holes in your data and no way to go back, which is my main point: as much as you can say that no data loss requirements aren't really a major concern for most people the fact remains that with the current codebase you can't avoid that situation, which imho isn't right. If you had requirements for such flawless performance you should be able to invest resources to resolve it. I'm sorry, but I don't see it. Even with plenty resources you'd have to either put some heavy restrictions in place like centralized data on a SAN, which is not really something you'd want in a distributed setup, or add plenty hacks to, say for example, replay the content of rrds to some other place, but even in this case it's pretty quirky. Makes sense ? I guess it does if I look at it from your perspective which if I understood it correctly implies that: * some data loss doesn't matter * manual interaction to fix things is ok But that isn't my perspective. Scalable (distributed) applications should be able to guarantee by design no data loss in as many cases as possible and not force you to centralized designs or hackery in order to do so. There are ways to make this possible without changes to the current gmetad code by adding a helper webservice that proxies the access to rrd. This way it's perfectly fine to have different locations with different data and the webservice will take care of interrogating one or more gmetads/backends to retrieve the full set and present it to the user. Fully distributed, no data loss. This could be of course built into gmetad by making something like port 8652 access the rrds, but to me that's the wrong path, makes gmetad's code more complicated and it's potentially a functionality that has nothing to do with ganglia and is backend dependent. thoughts? -- Behind every great man there's a great backpack - B. -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Mon, Dec 14, 2009 at 2:00 AM, Vladimir Vuksan vli...@veus.hr wrote: I think you guys are complicating much :-). Can't you simply have multiple gmetads in different sites poll a single gmond. That way if one gmetad fails data is still available and updated on the other gmetads. That is what we used to do. Would you mind explaining me why having multiple gmetads in different colos pulling form the same gmond is simpler than the infrastructure I presented in my post? Furthermore, could you please show me how your simpler solution addresses the problem of bringing back up the gmetad that failed such has both gmetads would have the same data? And if that's not what you had in mind, what's your strategy? Which data is going to be displayed to the user? and what if the first gmetad that didn't fail now fail while the restored one continues working? thanks for your clarifications. -- Behind every great man there's a great backpack - B. -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Mon, Dec 14, 2009 at 10:28 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: a) you are only concerned with redundancy and not looking for scalability - when I say scalability, I refer to the idea of maybe 3 or more gmetads running in parallel collecting data from huge numbers of agents what is the bottleneck here?, CPUs for polling or IO?, if IO using memory would be most likely all you really need (specially considering RAM is really cheap and RRDs are very small), if CPUs then there might be somethings we can do to help with that, but vertical scalability is what gmetad has, and for that usually means going to a bigger box if you hit the limit on the current one. Ime cpu isnt' really a problem, the big load is I/O and indeed moving the rrds to a ramdisk is the most common solution with pretty decent results. b) you can afford to have duplicate storage - if your storage requirements are huge (retaining a lot of historic data or lot's of data at short polling intervals), you may not want to duplicate everything if you are planning to store a lot of historic data then you should be using instead some sort of database, not RRDs and so I think this shouldn't be an issue unless you explode the RRAs and try to abuse the RRDs as a RDBMs I think there's a middle ground here that'd be interesting to explore, altho that's a different thread, but for kicks this is the gist: the common pattern for rrd storage is hour/day/month/year and I've always found it bogus. In many cases I've needed higher resolution (down to the second) for the last 5-20 minutes, then intervals of an hr to a couple hrs, then a day to three days and then a week to 3 weeks etc etc, which increases your storage requirements, but is imho not an abuse of rrd and still retains the many advantages of rrd over having to maintain a RDBMs. Carlo PS. I like the ideas on this thread, don't get me wrong, just that I agree with Vladimir that gmetad and RRDtool are probably not the sweet spot (cost wise) for scalability work even if I also agree that the vertical scalability of gmetad is suboptimal to say the least. sort of. If you're looking at where your resources go to compute and deal with large amount of data, I agree. If you look at what it costs you or if it's even possible to create a fully scalable and resilient ganglia based monitoring infrastructure, I disagree. -- Behind every great man there's a great backpack - B. -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Wed, Nov 25, 2009 at 4:20 PM, Daniel Pocock dan...@pocock.com.au wrote: One problem I've been wondering about recently is the scalability of gmetad/rrdtool. [cut] In a particularly large organisation, moving around the RRD files as clusters grow could become quite a chore. Is anyone putting their RRD files on shared storage and/or making other arrangements to load balance between multiple gmetad servers, either for efficiency or fault tolerance? We do. We run 8 gmetad servers, 2 in each colo x 3 colos + 2 centrals and rrds are stored in ram disk on each node. Nodes are setup with unicast and data is sent to both heads in the same colo for fault tolerance/redundancy. This is all good until you have a gmetad failure or need to perform maintenance on one of the nodes because at that point as data stops flowing in you will have to rsync back once you're done from the other head and it doesn't matter how you do it (live rsync or stop the other head during the sync process) you will lose data. That said it could be easily argued that you have no guarantee that both heads have the same data to start with because messages are udp and there's no guarantee either node will have not lost some data the other hasn't. Of course there is a noticeable difference between a random message loss and a say 15 window blackout during maintenance, but then if your partitions are small enough a live rsync could possibly incur in a small enough loss... it really depends. As to share storage we haven't tried but my personal experience is that given how a local filesystem can't manage that many small writes and seeks using any kind of remote FS isn't going to work. I see two possible solutions: 1. client caching 2. built-in sync feature In 1. gmond would cache data locally if it could not contact the remote end. This imho is the best solution because it helps not only with head failures and maintenance, but possibly addresses a whole bunch of other failure modes too. 2. instead would make gmetad aware of when it got data last and be able to ask another gmetad for its missing data and keep fetching until the delta (data loss) is small enough (user configured) that it can again receive data from clients. This is probably harder to implement and still would not guarantee no data loss, but I don't think that's a goal. The interesting property of this approach is that it'd open the door for realtime merge of data from multiple gmetads so that as long that at least one node has received a message a client wouldn't ever see a gap effectively providing no data loss. I'm toying with this solution in a personal non-ganglia related project as it's applicable to anything with data stored in rrd over multiple locations. thanks -- Behind every great man there's a great backpack - B. -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Feeble attempt at gmond aliasing
On Fri, Oct 2, 2009 at 9:59 PM, Jesse Becker haw...@gmail.com wrote: On Fri, Oct 2, 2009 at 10:35, Brad Nicholes bnicho...@novell.com wrote: How well does this fit into the previous discussions of using a GUID to identify a box rather than an IP or FQDN? Are aliasing and GUID identifiers related or are they two separate issues? I think that is a separate, but related, discussion. Perhaps I'm wrong, but there doesn't seem to be a clear consensus about using GUIDs vs. FQDN vs. IPs vs. something else (again, someone correct me if I'm wrong). Maybe we should open that discussion again? why a separate discussion? You're adding a config option which you're free to set to whatever you think and that to me covers all cases, you could set it to the hostname, an ip or a GUID. Personally I find that in large infrastructure naming machines meaningfully is a lost game, the host itself is more or less irrelevant and what matters is the service associated to it, so I'd assign a GUID myself and maintain the association with the service somewhere else, maybe as a metric itself. On the other hand for the small shop host names are a pretty decent approach to map your infrastructure so they would prolly want to use that as an identifier. Either way having it as an option is a safe way of handling it and avoids surprises at the gmetad end (I don't like this thing that the received resolves the ip of the sender to decide its name). -- Behind every great man there's a great backpack - B. -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Feeble attempt at gmond aliasing
On Fri, Oct 9, 2009 at 9:48 PM, Jesse Becker haw...@gmail.com wrote: The GUID discussion I refered to was if gmond/gmetad should be rewritten, top-to-bottom, to use GUIDs instead of relying on DNS/IP addresses. My understanding is that everything would have use them, including the .rrd files underneath. That is, IMO, a big overhaul. Adding aliasing is theoretically a smaller change, that I think works within the existing code. This is what I'm proposing to add--something simple, and inexpensive to implement, but hopefully useful to many people. Thus, I see it as separate, but perhaps complementary/related. I see, makes sense. well, I think that until rrd comes up with a way to store arbitrary text/info inside a rrd file[1] we're better off naming the rrd files in a user defined/expect way otherwise manual interaction with the rrd files becomes impossible. Anyway, that's indeed another discussion and personally I'm all for this alias patch. As to Rick's comments I believe they are only valid if we assume that the string representing a host should be its ip or the fqdn resolving to it, which I think is one of the many problems this alias patch is meant to solve (instances on EC2 or with multiple interfaces are a pita if things rely on ips/PTR for identification). what do we need next? people compiling gmond with this patch and testing? [1] I've seen that discussion coming up in several instances on the rrd ML and never go anywhere because of some big change that apparently would be necessary to implement that feature correctly. -- Behind every great man there's a great backpack - B. -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Another interface for Ganglia stats
On Tue, Sep 22, 2009 at 9:05 AM, Vladimir Vuksan vli...@veus.hr wrote: I guess a lot of the conversation depends on what you want and expect Ganglia to be used for. For example there are a lot of people out there that are using Ganglia for performance monitoring and using Nagios NRPE to get user level stats from the host. To me that is redundant. indeed, this is one of the many flaws with the monitoring/alerting setups we have today, it's almost like the people collecting metrics and those making checks didn't like each other and never talked, but have to meet in secret in the sysadmin's bedroom... Thus if you decide you are gonna use Ganglia for providing metric to e.g. Nagios you will have to go the route of parsing the Gmond XML. I checked on my cluster and each host uses about 15 kBytes (average) of XML to define metrics. This works well in small to mid size clusters however as soon as you get over certain threshold it breaks down. Let's say 200 hosts * 15 kB = 3 MB if I wanted to keep track of one metric that would be about 600 MBytes of traffic per minute or 10 Mbytes/sec just to fetch the whole XML tree. More metrics that need to be checked ie. swap_free and you may be doing quite a bit of network traffic. This is just to serve the XML and it doesn't take into account overhead processing and parsing data. You'll say wait a minute :-) if I was doing such a thing I would cache the data etc. I hear some people are doing just that ie. /me raises hand storing XML on local storage. I have couple ideas myself but the point is that such a set up requires yet another thing to setup, monitor and maintain. indeed, not to mention your data has to be cached for longer than it could if there was less of it to exchange each time (on large setup you need caching no matter what) Also perhaps REST API is not really the way to go but a simple HTTP interface would suffice. I hope this makes sense :-). It did, except that last bit... how is a simple HTTP interface the way to go but a REST API perhaps not? Given the pretty simple and easy to represent data model I don't see how structuring your HTTP calls so that they are RESTful is not the way to go. If you said that an http interface is too much and a simpler TCP one would suffice I'd disagreed, but understood, while I'm instead lost on the simple HTTP Vs REST API. cheers -- Behind every great man there's a great backpack - B. -- Come build with us! The BlackBerryreg; Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9#45;12, 2009. Register now#33; http://p.sf.net/sfu/devconf ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Fwd: [Ganglia-general] Another interface for Ganglia stats
On Fri, Sep 18, 2009 at 8:32 AM, Bernard Li bern...@vanhpc.org wrote: Forwarding this to ganglia-developers since this is a more -devel related discussion. Also can get spike's opinions in ;-) remember that you asked for it :P On Wed, Sep 16, 2009 at 11:49 AM, Vladimir Vuksan vli...@veus.hr wrote: There have been some tweets that someone was working on a REST interface for Ganglia. I would have loved to see something more than a tweet about that (which I haven't seen either, but just told about). do you have any more info? what kind of REST interface? it can mean a lot of things and nothing. At first I thought it wasn't such a big deal Care to share why's that? Personally it'd find it a great addition and a basic requirement to make extensibility and interoperability with other applications possible (of course it can be argued that given the user base and scope there is no interest in doing so). but I think that adding a simplistic interface to Ganglia would be a nice addition ie. something like telnet ganglia 8653 METRIC web1 load_one Which would echo out the current value for load_one. That way you can avoid parsing out the XML to get those values. I think for large sites it makes a lot of sense. Granted there are workarounds that could be implemented and people have. as one of those people I wonder what a new interface like that changes, as you say the only difference would be making xml parsing client side unnecessary, which imho is not the problem here. What I'd like to see is a way to access *all* the data gmetad knows about, which means both what's in memory and inside the rrds, and being able to do so for multiple nodes at the same time (I sent a patch for multiple nodes request a while ago that maybe I should try to push for again). The same interface, with obviously only in-memory values available, should exist for gmond. Also, I wouldn't make up another port for it, but rather use 8652 and extend the already supported control parameters. So for example you'd use the interface like this: telnet ganglia 8652 /grid/cluster/host1/metric1/time[interval];/grid/cluster/host2/metric1;...?format=text lastupdated time host1 metric1 value[s] lastupdated time host2 metric1 value if you don't specify a time it's assumed you want most recent reading and it's fetched from memory, otherwise you get it from the rrd. The ?format=text regulates if you get the classic xml output (default if format isn't specified) and that could be amended to be json. something like that to me would start to make a lot more sense, but it's still not a REST api to which you can speak http and use known methods to do useful things like caching results. let's keep this discussion going. Spike -- Behind every great man there's a great backpack - B. -- Come build with us! The BlackBerryreg; Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9#45;12, 2009. Register now#33; http://p.sf.net/sfu/devconf ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] RRD_update illegal attempt to update using time 1252671437 when last update time is 1252671437 (minimum one second step)
Hi, our gmetad boxes (2 of them) with 12 data sources, 6 of which are gmetad and 6 gmonds, are spamming syslog like mad with the following message: Sep 6 06:33:32 localhost.localdomain /usr/sbin/gmetad[2526]: RRD_update (/var/lib/ganglia/rrds/...metric.rrd): illegal attempt to update using time 1252244010 when last update time is 1252244010 (minimum one second step) This happens for both metrics and summary graphs. Looking at the hosts every appear to be fine to me, and ntp is running everywhere and in sync. Looking at the code instead both gmetad/gmetad.c and gmetad/data_thread.c have a possibly suspicious call to sleep: in gmetad.c:417 sleep_time = 10 + ((30-10)*1.0) * rand()/(RAND_MAX + 1.0); sleep(sleep_time); in data_thread.c:193 sleep_time = (d-step - 5) + (10 * (rand()/(float)RAND_MAX)) - (end.tv_sec - start.tv_sec); if( sleep_time 0 ) sleep(sleep_time); two observation: - based on man 3 sleep, if any signal is sent to gmetad, the sleep interval can be 0 - end.tv_sec - start.tv_sec could compute to a considerably high number that along with a short step could result in a sleep_time = 0. thoughts? thanks -- Behind every great man there's a great backpack - B. -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] gmetad spamming logs with unable to write root epilog
Hi, recently we added better monitoring for our ganglia infrastructure and one of the checks for gmetad contacts it on port 8651, looks for some XML string and exits (receiving 20+ MBs of xml every time we run the check isn't an option). The 'exists' part means sending a RST before gmetad has sent all data which causes root_report_end() to fail with subsequent message 'server_thread() %d unable to write root epilog' being logged. Is it really necessary to log an error message if the client goes away early? after all it's not ganglia/gmetad malfunctioning or anything, and we could still keep that for debug mode. If that makes sense to you the one line patch is below. thanks Index: server.c === --- server.c(revision 2058) +++ server.c(working copy) @@ -639,7 +639,7 @@ if(root_report_end(client)) { - err_msg(server_thread() %d unable to write root epilog, pthread_self() ); + debug_msg(server_thread() %d unable to write root epilog, pthread_self() ); } close(client.fd); -- Behind every great man there's a great backpack - B. -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] metric loss and send channel failures in a multi-channel setup
On Mon, Aug 17, 2009 at 7:56 PM, Spike Spiegelfsm...@gmail.com wrote: thanks for your input, I've given this a go and there's a patch attached to this email that I'd like to hear comments about. I've never used apr before, but based on the documentation [1] apr_array_push will allocate new space for the new element so what I've done is pre-allocating only one element and then let apr_array_push do the work. I realize this means we're doing dynamic allocation inside the loop, but given the small number of items I guess the overhead is negligible. The patch is against trunk, but looks like it'll work fine on 3.0 branch too. [1] http://apr.apache.org/docs/apr/0.9/group__apr__tables.html#gc08267b32905197dd023314d9603 I'm linking 0.9 but 1.3 is the same for this function -- Behind every great man there's a great backpack - B. libgmond-trunk-metrics-loss.diff Description: Binary data -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] metric loss and send channel failures in a multi-channel setup
Hi, we have a setup with 2 unicast channels and we recently ran across an issue where we lost a bunch of metrics submitted with gmetric due to a problem with dns that made one of the two channels unreachable. I traced this back to libgmond.c and Ganglia_udp_send_channels_create(...) where the code exit(1) as soon as it fails to create a socket (lines 323:344). I'm not sure if this is intended or not, but it certainly damages redundant setups like ours where we'd definitely prefer to have only some of the channels getting data rather than all or nothing. I'd like to propose that the behavior is changed so that the error_msg() + exit() is replaced with a debug_msg() call and then outside of the loop and before the return we check if any channel has been created at all and fail there in case. I would have gone ahead and attach a patch, but I'm not familiar with the apr API and was unsure what was the best approach to deal with the send_channels array especially given that the code seems to preallocate space for num_udp_send_channels (line 291). thanks for your input, Spike -- Behind every great man there's a great backpack - B. -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Thoughts on host spoofing
On Fri, Feb 6, 2009 at 2:52 PM, Rick Cobb rc...@quantcast.com wrote: My thought is that the fewer underlying services a monitoring system needs to work, the more likely it is to work. Absolutely, but dns itself is actually a good example of how introducing a dependency was necessary to make a service usable. The problem here is that if you don't have context most information are meaningless or possibly misleading and an ip imho doesn't qualify as context. When you do the lookup from the frontend the ip might have moved and this is actually not so far off depending on your infrastructure and the timeline you retain data for. Obviously if you maintain these associations elsewhere you're good, but otherwise being able to store webXX is pretty useful (and the reason I want more control over it). -- Behind every great man there's a great backpack - B. -- Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) software. With Adobe AIR, Ajax developers can use existing skills and code to build responsive, highly engaging applications that combine the power of local resources and data with the reach of the web. Download the Adobe AIR SDK and Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond python module interface
Hi, provided that I haven't had the time to look at this part of the code yet and that I agree it would be much nicer to have a gmetric-like behavior, On Sun, Feb 1, 2009 at 12:21 AM, David Stainton dstainton...@gmail.com wrote: I like using gmetric to monitor... so I wrote gmetric-daemon which is my attempt at a forking standalone daemon which runs Python metric modules and calls gmetric for each metric... in a previous email you call upon a most scalable, most correct and most reliable/highly available design, which is certainly a valuable goal that I don't see met by this proposal. A gmetric-daemon as far as I understand gmetric would defy caching and directives like threshold and timeout, which are very important at least as far as scalability goes. Furthermore as long as there are built-int plugins with collection groups and so on a third party daemon sounds like the wrong approach to me, so as much easier as it might be at first I'd believe that the most scalable, most correct and most reliable design is the one Brad proposes cavia the fact that figuring it all out will take more time. I wanted a slightly different multithreaded approach to monitoring... but it turns out that Python threads really suck. care to share in which way python threads really suck? So I made this a forking daemon. One process per module. Not very memory effecient. But then I don't expect to need many modules... *I* don't? what if somebody else does? what if you do tomorrow/at another job? I don't see how you'd fix something like that at later stage without having to throw everything away. And how does this meet the most scalable design goal? Don't get me wrong, I'm sure everybody agrees on the problems and appreciate the effort, I'm merely pointing out that from my perspective this proposal doesn't meet the design goals and is unlikely to get traction upstream or in the HPC community, even tho it might be just perfect for you and other people. And just in case, I've no affiliation with ganglia and these are my own opinions, maybe upstream folks have completely different thoughts. time and skills permitting I'd be happy to help out with improving the python interface especially since it's something we'd like to heavily leverage at work. thanks -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad protocol and propagating errors back to the client
On Thu, Jan 22, 2009 at 6:55 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: the interactive port was designed to mimic the behaviour from the original gmetad port which always returns the whole tree. why's that? if I wanted the whole tree I'd query the non interactive port, instead I'm asking for specific metrics so I should get them or nothing (or an error). Falling back to whole tree doesn't sound correct to me. if your concern is about returning too much data and the request was missing, it might be better then to return no tree information (which should be also valid) I'm not sure what you mean here with no tree information. Would the DTD + grid tag count as such? I see 2 cases: 1) bad request 2) some/all of the items do not exist 1) happens before root_report_start is ran, so we could easily return nothing or call root_report_start and end before closing the fd 2) happens after root_report__start has ran, so we could add each found metric and nothing for the non-existing ones, and then call root_report_end doing that in both cases you get valid xml with at worst a GRID tag that doesn't contain anything or contains multiple cluster tags for each requested metric and the non-existing ones missing, which should be enough of a hint to the client that they don't exist. would that do? -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] CVE
On Fri, Jan 23, 2009 at 11:52 PM, Brad Nicholes bnicho...@novell.com wrote: * http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-0242 Ganglia 3.1.1 allows remote attackers to cause a denial of service via a request to the gmetad service with a path does not exist, which causes Ganglia to (1) perform excessive CPU computation and (2) send the entire tree, which consumes network bandwidth. this one is IMHO invalid as the CPU and bandwith costs for this in the current code are constant and the wording quoted was most likely taken out of context as it referred originally to a contribution proposal which has not been yet committed. agreed, all the advisories I've seen around have misquoted my original report and missed the link to the feature proposal. As it stands this CVE is invalid. Are we finished hashing this whole patch out yet? Are we ready to apply the current patch to 3.1.2 and release or is there still more discussion going on? as far as I'm concerned #223 is resolved and good to go. thanks everybody. -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Possible REST interface to the interactiveport?
On Wed, Jan 21, 2009 at 2:52 AM, Brad Nicholes bnicho...@novell.com wrote: Yep, I was also thinking that a RESTful output module for gmetad-python would probably be the easiest solution I haven't used gmetad-python yet so one concern would be performances and how it'd behave having to aggregate and serve a lot of data/requests. And another question is how different/harder/easier would it be to scale a RESTful service in gmetad versus say a standalone django/pylons app. Plus it would be nice if you could request a time range or range of values instead of just current, which would require some kind of storage and leads me to what I was playing with: use memcache to store the last n values using hash(hostname+metric) as key and take advantage of expiration to clean up old stuff. At this point you can easily put together a fairly standard web service that can return last or even last-n values without adding complexity to ganglia. You could make it even smarter and make it rrd aware so that if you want older data it can be fetched from there, and you could add support for a freshness check so it pings gmetad to request last reading's timestamp and use that to validate data read from memcache, but anyway let's keep it simple for now. -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] gmetad protocol and propagating errors back to the client
Hi, right now when gmetad fails an error is logged and in some cases the connection to the client interrupted returning invalid XML or in other cases (item not found or broken request) the entire tree is returned. This imho is bad behavior and code should be added to inform the client of the error, but before that's possible it needs to be agreed how this communication should happen. I'm not really fond of XML or ganglia's code, but I'd guess adding an ERROR element to the DTD is possibly a solution. At that point whenever there's an error root_report_start() should be called at the very least and an error element added inside. This should also work nicely for the multi-item per request patch I proposed elsewhere [1] as you'd have an error per requested element. If anybody is willing to lend a hand to kickstart the XML definition (or whatever approach is best) I'd be glad to work on the rest. thanks -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port
On Sun, Jan 18, 2009 at 7:35 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: other than that looks good to me. could you check the simplified one?, this problem was introduced in 2003 and therefore affects all versions of ganglia since then (including 2.5.7 which is not supported anymore and that will need to be patched by the users of it which include Debian/Ubuntu, Novell/OpenSuSE and probably others). apologies but I lost you there, what do you mean with the simplified one? Two things: 1) How has this been tested? I did some myself and got to wonder how you guys did it, do you have any standardized approach? sadly there is no test suite associated with ganglia code and therefore there is no standardized approach other than applying the patch and banging the resulting binary to see if it works reliably. alright, I was thinking of a couple scripts to generate traffic and then do the queries, I think Jesse mentioned something like that on irc based on gmetric. I believe something like that would be useful, and either python of perl could be enough to write something threaded to generate enough load for testing I guess. Is that what you meant when you said banging to resulting binary? -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port
On Mon, Jan 19, 2009 at 5:44 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: agree, but that is to be done in the context of getting multi-patch committed and backported, but not in fixing this buffer overflow in the interactive port, which is what BUG223 is about. ok, guess I'll start a different thread about this later on once we're worked out #223 from what I check while trying some fuzzing we have still a problem (probably introduced with the buffer overflow patch) when the request is too long (over 2048 bytes) as shown by : $ echo /`python -c \print \\%s/%s/%s\\ % ('a'*1700,'b'*300,'c'*48)\` | netcat 127.0.0.1 8652 what problem are you seeing? trunk (r1950) does not reflect what we're talking about as it includes my original return 1 if element is not found which leads to the truncated xml output. Reverting to 1233 and applying the latest patch from #223 works fine for me and I get back the entire tree as there's no a*1700 grid. -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Possible REST interface to the interactive port?
Hi, On Sat, Jan 17, 2009 at 5:04 AM, john allspaw jalls...@yahoo.com wrote: Hey all - Wondering if there's ever been any talk about serving up the interactive port info via REST? I am kinda working on this already although not in the form of a ganglia patch, but as an external application that pulls data out of ganglia. The reason for this being that I don't want to be dependent on ganglia and that it's easier to aggregate other sources of information not to mention development time since I can use python, but this is more of a personal choice since I'm not fluent in C. http://gmetad.hostname:8652/WWW/www1.flickr.mud.yahoo.com/apache_procs_busy/ (and all of the other stuff you can get from the interactive port) I'd bet that all of the requests to bolt-on alerting mechanisms would go away if other alerting/escalation tools could get the real stuff out of ganglia, too. :) this is the reason why I offered that multi-item patch so that I could write smarter monitoring checks able to account for complex scenarios (depending on environment apache_proc_busy itself is much less relevant than apache_proc_busy + incoming_connections + database_connections) Thoughts? my main worry is ganglia getting too complicated and offering something that is not entirely related. This code would end up in gmetad making the server more complex and prone to errors and possibly harming data aggregation since I guess it'd be running in another thread. I haven't thought this through, but one idea I considered was to employ another host to run gmetad-python which would allow an easier creation for a rest interface or even a different backend engine to say store data into a database which then you would build your REST service on top of. That said I appreciate the benefits of a built-in interface, the speed benefits and the reduced number of dependencies on other components. thanks for bringing this up, definitely interesting topic -- Behind every great man there's a great backpack - B. -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetad server BoF andnetwork overload + [Feature] multiple requests per conn oninteractive port
On Fri, Jan 16, 2009 at 7:04 AM, Kostas Georgiou k.georg...@imperial.ac.uk wrote: On Thu, Jan 15, 2009 at 01:41:53PM -0700, Brad Nicholes wrote: On 1/15/2009 at 8:56 AM, in message 496efa2a02ac0003a...@lucius.provo.novell.com, Brad Nicholes bnicho...@novell.com wrote: After taking a little closer look at the patch, I think we are OK as far as the recursive call to process_path() is concerned since this case is an error condition and should stop processing rather than continuing in the recursive loop. indeed, this should work just fine. The other two concerns are still there however. I still think that we are off-by-one in the malloc call. It should be len+1 and I still think that we should limit the malloc to 256 rather than allowing it to be unlimited. I agree about the off-by-one argh, my bad sorry, double dumb since I even considered the case. len+1 it is and the comment should go, thanks. but I am not too worried about a malloc limit, from what I can tell it can only get as high as REQUESTLEN. I agree with Kostas, as I wrote in my initial email I didn't worry about that because of the REQUESTLEN boundary which is enforced in readline. as to limiting the path to 256 I actually did that in my first implementation, but eventually converted to a malloc solution because I was reminded that 640 KB ought to be enough for everybody and I could see no downsides. The malloc call needs to be checked for NULL and the comment that The recursive structure doesn't require any memory allocations is false now if malloc replaces the stack allocation. correct thanks everybody -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers