Re: [Ganglia-developers] How do we deal with very large clusters in the webui

2011-03-07 Thread Spike Spiegel
Hi,

On Thu, Mar 3, 2011 at 11:11 PM, Jim Greene jim.gre...@gmail.com wrote:
 -Don't show any individual hosts, only the aggregate and the
 load/network/etc levels for the whole cluster

we did this on the main page for grids by adding one line of php that
excluded the bulk of our computing grid.

We also added a regexp parameter that you could pass in GET and
everybody used predefined view without ever hitting the main grid
page.

So for example you'd have http://ganglia.organization.tld/
?g...x...regexp=mysql.* which would only display all mysql servers.
Of course this means you rely on a naming rule that might not be true
for your environment.

 What are your thoughts on how we can accomplish this?

Probably best to look into the new frontend which is being built
exactly to address all this sort of limitations.

https://github.com/vvuksan/ganglia-misc/tree/master/ganglia-web

-- 
Behind every great man there's a great backpack - B.

--
What You Don't Know About Data Connectivity CAN Hurt You
This paper provides an overview of data connectivity, details
its effect on application quality, and explores various alternative
solutions. http://p.sf.net/sfu/progress-d2d
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad and rrdtool scalability

2009-12-21 Thread Spike Spiegel
On Sun, Dec 20, 2009 at 7:35 PM, Vladimir Vuksan vl...@vuksan.com wrote:
 If you lose a day or
 two or even a week of trending data that is not gonna be disaster as long
 as that data is present somewhere else.

sure, but where? how would the ganglia frontend tell?

 Thus I proposed a simple solution
 where even if one of the gmetads (gmetad1) fails you can either

 a. Get all the rrds (rsync) from gmetad2 before you restart gmetad1

which unless you have small amount or data or fast network between the
two nodes won't complete before the next write is initiated, meaning
they won't be identical.

 b. Simply start up gmetad1 and don't worry about the lost data

sure

 As far as which data is going to be displayed you can do either

 1. Proxy traffic to Ganglia with most up to date data

how do you tell which one has most up to date data?

 2. Change DNS record to point to Ganglia with most up to date data

same question, which one has most up to date data?

if you really mean most recent then both would, because both would
have fetched the last reading assuming they are both functional, but
gmetad1 would have a hole in its graphs. To me that does not really
count as up to date. Up to date would be the one with the most
complete data set which you have no way to identify programmatically.

Also, assume now gmetad2 fails and both have holes, which one is the
most up to date?

 To your last point there are chances that both gmetads fail in quick
 succession however I would think that would be a highly unlikely event.

it doesn't have to be in quick succession to find yourself in a
condition where you have holes in your data and no way to go back,
which is my main point: as much as you can say that no data loss
requirements aren't really a major concern for most people the fact
remains that with the current codebase you can't avoid that situation,
which imho isn't right.

 If you had requirements for such flawless performance you should be able to
 invest resources to resolve it.

I'm sorry, but I don't see it. Even with plenty resources you'd have
to either put some heavy restrictions in place like centralized data
on a SAN, which is not really something you'd want in a distributed
setup, or add plenty hacks to, say for example, replay the content of
rrds to some other place, but even in this case it's pretty quirky.

 Makes sense ?

I guess it does if I look at it from your perspective which if I
understood it correctly implies that:
* some data loss doesn't matter
* manual interaction to fix things is ok

But that isn't my perspective. Scalable (distributed) applications
should be able to guarantee by design no data loss in as many cases as
possible and not force you to centralized designs or hackery in order
to do so.

There are ways to make this possible without changes to the current
gmetad code by adding a helper webservice that proxies the access to
rrd. This way it's perfectly fine to have different locations with
different data and the webservice will take care of interrogating one
or more gmetads/backends to retrieve the full set and present it to
the user. Fully distributed, no data loss. This could be of course
built into gmetad by making something like port 8652 access the rrds,
but to me that's the wrong path, makes gmetad's code more complicated
and it's potentially a functionality that has nothing to do with
ganglia and is backend dependent.

thoughts?

-- 
Behind every great man there's a great backpack - B.

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad and rrdtool scalability

2009-12-20 Thread Spike Spiegel
On Mon, Dec 14, 2009 at 2:00 AM, Vladimir Vuksan vli...@veus.hr wrote:
 I think you guys are complicating much :-). Can't you simply have multiple
 gmetads in different sites poll a single gmond. That way if one gmetad fails
 data is still available and updated on the other gmetads. That is what we
 used to do.

Would you mind explaining me why having multiple gmetads in different
colos pulling form the same gmond is simpler than the infrastructure I
presented in my post? Furthermore, could you please show me how your
simpler solution addresses the problem of bringing back up the gmetad
that failed such has both gmetads would have the same data? And if
that's not what you had in mind, what's your strategy? Which data is
going to be displayed to the user? and what if the first gmetad that
didn't fail now fail while the restored one continues working?

thanks for your clarifications.

-- 
Behind every great man there's a great backpack - B.

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad and rrdtool scalability

2009-12-20 Thread Spike Spiegel
On Mon, Dec 14, 2009 at 10:28 AM, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 a) you are only concerned with redundancy and not looking for
 scalability - when I say scalability, I refer to the idea of maybe 3 or
 more gmetads running in parallel collecting data from huge numbers of agents

 what is the bottleneck here?, CPUs for polling or IO?, if IO using memory
 would be most likely all you really need (specially considering RAM is really
 cheap and RRDs are very small), if CPUs then there might be somethings we
 can do to help with that, but vertical scalability is what gmetad has, and
 for that usually means going to a bigger box if you hit the limit on the
 current one.

Ime cpu isnt' really a problem, the big load is I/O and indeed moving
the rrds to a ramdisk is the most common solution with pretty decent
results.


 b) you can afford to have duplicate storage - if your storage
 requirements are huge (retaining a lot of historic data or lot's of data
 at short polling intervals), you may not want to duplicate everything

 if you are planning to store a lot of historic data then you should be
 using instead some sort of database, not RRDs and so I think this shouldn't
 be an issue unless you explode the RRAs and try to abuse the RRDs as a RDBMs

I think there's a middle ground here that'd be interesting to explore,
altho that's a different thread, but for kicks this is the gist: the
common pattern for rrd storage is hour/day/month/year and I've always
found it bogus. In many cases I've needed higher resolution (down to
the second) for the last 5-20 minutes, then intervals of an hr to a
couple hrs, then a day to three days and then a week to 3 weeks etc
etc, which increases your storage requirements, but  is imho not an
abuse of rrd and still retains the many advantages of rrd over having
to maintain a RDBMs.

 Carlo

 PS. I like the ideas on this thread, don't get me wrong, just that I agree
    with Vladimir that gmetad and RRDtool are probably not the sweet spot
    (cost wise) for scalability work even if I also agree that the vertical
    scalability of gmetad is suboptimal to say the least.

sort of. If you're looking at where your resources go to compute and
deal with large amount of data, I agree. If you look at what it costs
you or if it's even possible to create a fully scalable and resilient
ganglia based monitoring infrastructure, I disagree.

-- 
Behind every great man there's a great backpack - B.

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad and rrdtool scalability

2009-12-05 Thread Spike Spiegel
On Wed, Nov 25, 2009 at 4:20 PM, Daniel Pocock dan...@pocock.com.au wrote:
 One problem I've been wondering about recently is the scalability of
 gmetad/rrdtool.

[cut]

 In a particularly large organisation, moving around the RRD files as
 clusters grow could become quite a chore.  Is anyone putting their RRD
 files on shared storage and/or making other arrangements to load balance
 between multiple gmetad servers, either for efficiency or fault tolerance?

We do. We run 8 gmetad servers, 2 in each colo x 3 colos + 2 centrals
and rrds are stored in ram disk on each node. Nodes are setup with
unicast and data is sent to both heads in the same colo for fault
tolerance/redundancy. This is all good until you have a gmetad failure
or need to perform maintenance on one of the nodes because at that
point as data stops flowing in you will have to rsync back once you're
done from the other head and it doesn't matter how you do it (live
rsync or stop the other head during the sync process) you will lose
data. That said it could be easily argued that you have no guarantee
that both heads have the same data to start with because messages are
udp and there's no guarantee either node will have not lost some data
the other hasn't. Of course there is a noticeable difference between a
random message loss and a say 15 window blackout during maintenance,
but then if your partitions are small enough a live rsync could
possibly incur in a small enough loss... it really depends.

As to share storage we haven't tried but my personal experience is
that given how a local filesystem can't manage that many small writes
and seeks using any kind of remote FS isn't going to work.

I see two possible solutions:
1. client caching
2. built-in sync feature

In 1. gmond would cache data locally if it could not contact the
remote end. This imho is the best solution because it helps not only
with head failures and maintenance, but possibly addresses a whole
bunch of other failure modes too.
2. instead would make gmetad aware of when it got data last and be
able to ask another gmetad for its missing data and keep fetching
until the delta (data loss) is small enough (user configured) that it
can again receive data from clients. This is probably harder to
implement and still would not guarantee no data loss, but I don't
think that's a goal. The interesting property of this approach is that
it'd open the door for realtime merge of data from multiple gmetads so
that as long that at least one node has received a message a client
wouldn't ever see a gap effectively providing no data loss. I'm toying
with this solution in a personal non-ganglia related project as it's
applicable to anything with data stored in rrd over multiple
locations.

thanks

-- 
Behind every great man there's a great backpack - B.

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Feeble attempt at gmond aliasing

2009-10-09 Thread Spike Spiegel
On Fri, Oct 2, 2009 at 9:59 PM, Jesse Becker haw...@gmail.com wrote:
 On Fri, Oct 2, 2009 at 10:35, Brad Nicholes bnicho...@novell.com wrote:
 How well does this fit into the previous discussions of using a GUID to 
 identify a box rather than an IP or FQDN?  Are aliasing and GUID identifiers 
 related or are they two separate issues?

 I think that is a separate, but related, discussion.  Perhaps I'm
 wrong, but there doesn't seem to be a clear consensus about using
 GUIDs vs. FQDN vs. IPs vs. something else (again, someone correct me
 if I'm wrong).  Maybe we should open that discussion again?

why a separate discussion? You're adding a config option which you're
free to set to whatever you think and that to me covers all cases, you
could set it to the hostname, an ip or a GUID. Personally I find that
in large infrastructure naming machines meaningfully is a lost game,
the host itself is more or less irrelevant and what matters is the
service associated to it, so I'd assign a GUID myself and maintain the
association with the service somewhere else, maybe as a metric itself.
On the other hand for the small shop host names are a pretty decent
approach to map your infrastructure so they would prolly want to use
that as an identifier. Either way having it as an option is a safe way
of handling it and avoids surprises at the gmetad end (I don't like
this thing that the received resolves the ip of the sender to decide
its name).

-- 
Behind every great man there's a great backpack - B.

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Feeble attempt at gmond aliasing

2009-10-09 Thread Spike Spiegel
On Fri, Oct 9, 2009 at 9:48 PM, Jesse Becker haw...@gmail.com wrote:
 The GUID discussion I refered to was if gmond/gmetad should be
 rewritten, top-to-bottom, to use GUIDs instead of relying on DNS/IP
 addresses.  My understanding is that everything would have use them,
 including the .rrd files underneath.  That is, IMO, a big overhaul.

 Adding aliasing is theoretically a smaller change, that I think works
 within the existing code.  This is what I'm proposing to
 add--something simple, and inexpensive to implement, but hopefully
 useful to many people.

 Thus, I see it as separate, but perhaps complementary/related.

I see, makes sense. well, I think that until rrd comes up with a way
to store arbitrary text/info inside a rrd file[1] we're better off
naming the rrd files in a user defined/expect way otherwise manual
interaction with the rrd files becomes impossible. Anyway, that's
indeed another discussion and personally I'm all for this alias patch.
As to Rick's comments I believe they are only valid if we assume that
the string representing a host should be its ip or the fqdn resolving
to it, which I think is one of the many problems this alias patch is
meant to solve (instances on EC2 or with multiple interfaces are a
pita if things rely on ips/PTR for identification).

what do we need next? people compiling gmond with this patch and testing?


[1] I've seen that discussion coming up in several instances on the
rrd ML and never go anywhere because of some big change that
apparently would be necessary to implement that feature correctly.

-- 
Behind every great man there's a great backpack - B.

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Another interface for Ganglia stats

2009-09-26 Thread Spike Spiegel
On Tue, Sep 22, 2009 at 9:05 AM, Vladimir Vuksan vli...@veus.hr wrote:
 I guess a lot of the conversation depends on what you want and expect
 Ganglia to be used for. For example there are a lot of people out there
 that are using Ganglia for performance monitoring and using Nagios NRPE
 to get user level stats from the host. To me that is redundant.

indeed, this is one of the many flaws with the monitoring/alerting
setups we have today, it's almost like the people collecting metrics
and those making checks didn't like each other and never talked, but
have to meet in secret in the sysadmin's bedroom...

 Thus if
 you decide you are gonna use Ganglia for providing metric to e.g. Nagios
 you will have to go the route of parsing the Gmond XML. I checked on my
 cluster and each host uses about 15 kBytes (average) of XML to define
 metrics. This works well in small to mid size clusters however as soon
 as you get over certain threshold it breaks down. Let's say

 200 hosts * 15 kB = 3 MB

 if I wanted to keep track of one metric that would be about 600 MBytes
 of traffic per minute or 10 Mbytes/sec just to fetch the whole XML tree.
 More metrics that need to be checked ie. swap_free and you may be doing
 quite a bit of network traffic. This is just to serve the XML and it
 doesn't take into account overhead processing and parsing data.

 You'll say wait a minute :-) if I was doing such a thing I would cache
 the data etc. I hear some people are doing just that ie.

/me raises hand

 storing XML on
 local storage. I have couple ideas myself but the point is that such a
 set up requires yet another thing to setup, monitor and maintain.

indeed, not to mention your data has to be cached for longer than it
could if there was less of it to exchange each time (on large setup
you need caching no matter what)

 Also perhaps REST API is not really the way to go but a simple HTTP
 interface would suffice.

 I hope this makes sense :-).

It did, except that last bit... how is a simple HTTP interface the way
to go but a REST API perhaps not? Given the pretty simple and easy to
represent data model I don't see how structuring your HTTP calls so
that they are RESTful is not the way to go. If you said that an http
interface is too much and a simpler TCP one would suffice I'd
disagreed, but understood, while I'm instead lost on the simple HTTP
Vs REST API.

cheers

-- 
Behind every great man there's a great backpack - B.

--
Come build with us! The BlackBerryreg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9#45;12, 2009. Register now#33;
http://p.sf.net/sfu/devconf
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Fwd: [Ganglia-general] Another interface for Ganglia stats

2009-09-17 Thread Spike Spiegel
On Fri, Sep 18, 2009 at 8:32 AM, Bernard Li bern...@vanhpc.org wrote:
 Forwarding this to ganglia-developers since this is a more -devel
 related discussion.  Also can get spike's opinions in ;-)

remember that you asked for it :P

 On Wed, Sep 16, 2009 at 11:49 AM, Vladimir Vuksan vli...@veus.hr wrote:
 There have been some tweets that someone was working on a REST interface
 for Ganglia.

I would have loved to see something more than a tweet about that
(which I haven't seen either, but just told about). do you have any
more info? what kind of REST interface? it can mean a lot of things
and nothing.

 At first I thought it wasn't such a big deal

Care to share why's that? Personally it'd find it a great addition and
a basic requirement to make extensibility and interoperability with
other applications possible (of course it can be argued that given the
user base and scope there is no interest in doing so).

 but I think that
 adding a simplistic interface to Ganglia would be a nice addition ie.
 something like

 telnet ganglia 8653
 METRIC web1 load_one

 Which would echo out the current value for load_one. That way you can
 avoid parsing out the XML to get those values. I think for large sites it
 makes a lot of sense. Granted there are workarounds that could be
 implemented and people have.

as one of those people I wonder what a new interface like that
changes, as you say the only difference would be making xml parsing
client side unnecessary, which imho is not the problem here.

What I'd like to see is a way to access *all* the data gmetad knows
about, which means both what's in memory and inside the rrds, and
being able to do so for multiple nodes at the same time (I sent a
patch for multiple nodes request a while ago that maybe I should try
to push for again). The same interface, with obviously only in-memory
values available, should exist for gmond.

Also, I wouldn't make up another port for it, but rather use 8652 and
extend the already supported control parameters. So for example you'd
use the interface like this:
telnet ganglia 8652
/grid/cluster/host1/metric1/time[interval];/grid/cluster/host2/metric1;...?format=text
lastupdated time host1 metric1 value[s]
lastupdated time host2 metric1 value

if you don't specify a time it's assumed you want most recent reading
and it's fetched from memory, otherwise you get it from the rrd. The
?format=text regulates if you get the classic xml output (default if
format isn't specified) and that could be amended to be json.

something like that to me would start to make a lot more sense, but
it's still not a REST api to which you can speak http and use known
methods to do useful things like caching results.

let's keep this discussion going.

Spike

-- 
Behind every great man there's a great backpack - B.

--
Come build with us! The BlackBerryreg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9#45;12, 2009. Register now#33;
http://p.sf.net/sfu/devconf
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


[Ganglia-developers] RRD_update illegal attempt to update using time 1252671437 when last update time is 1252671437 (minimum one second step)

2009-09-11 Thread Spike Spiegel
Hi,

our gmetad boxes (2 of them) with 12 data sources, 6 of which are
gmetad and 6 gmonds, are spamming syslog like mad with the following
message:

Sep  6 06:33:32 localhost.localdomain /usr/sbin/gmetad[2526]:
RRD_update (/var/lib/ganglia/rrds/...metric.rrd): illegal attempt to
update using time 1252244010 when last update time is 1252244010
(minimum one second step)

This happens for both metrics and summary graphs.

Looking at the hosts every appear to be fine to me, and ntp is running
everywhere and in sync.

Looking at the code instead both gmetad/gmetad.c and
gmetad/data_thread.c have a possibly suspicious call to sleep:

in gmetad.c:417
 sleep_time = 10 + ((30-10)*1.0) * rand()/(RAND_MAX + 1.0);
 sleep(sleep_time);

in data_thread.c:193
 sleep_time = (d-step - 5) + (10 * (rand()/(float)RAND_MAX))
- (end.tv_sec - start.tv_sec);
 if( sleep_time  0 )
sleep(sleep_time);

two observation:
- based on man 3 sleep, if any signal is sent to gmetad, the sleep
interval can be 0
- end.tv_sec - start.tv_sec could compute to a considerably high
number that along with a short step could result in a sleep_time  =
0.

thoughts?

thanks

-- 
Behind every great man there's a great backpack - B.

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


[Ganglia-developers] gmetad spamming logs with unable to write root epilog

2009-09-11 Thread Spike Spiegel
Hi,

recently we added better monitoring for our ganglia infrastructure and
one of the checks for gmetad contacts it on port 8651, looks for some
XML string and exits (receiving 20+ MBs of xml every time we run the
check isn't an option). The 'exists' part means sending a RST before
gmetad has sent all data which causes root_report_end() to fail with
subsequent message 'server_thread() %d unable to write root epilog'
being logged. Is it really necessary to log an error message if the
client goes away early? after all it's not ganglia/gmetad
malfunctioning or anything, and we could still keep that for debug
mode. If that makes sense to you the one line patch is below.

thanks

Index: server.c
===
--- server.c(revision 2058)
+++ server.c(working copy)
@@ -639,7 +639,7 @@

  if(root_report_end(client))
 {
-   err_msg(server_thread() %d unable to write root
epilog, pthread_self() );
+   debug_msg(server_thread() %d unable to write root
epilog, pthread_self() );
 }

  close(client.fd);

-- 
Behind every great man there's a great backpack - B.

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] metric loss and send channel failures in a multi-channel setup

2009-08-22 Thread Spike Spiegel
On Mon, Aug 17, 2009 at 7:56 PM, Spike Spiegelfsm...@gmail.com wrote:

 thanks for your input,

I've given this a go and there's a patch attached to this email that
I'd like to hear comments about. I've never used apr before, but based
on the documentation [1] apr_array_push will allocate new space for
the new element so what I've done is pre-allocating only one element
and then let apr_array_push do the work. I realize this means we're
doing dynamic allocation inside the loop, but given the small number
of items I guess the overhead is negligible.

The patch is against trunk, but looks like it'll work fine on 3.0 branch too.

[1] 
http://apr.apache.org/docs/apr/0.9/group__apr__tables.html#gc08267b32905197dd023314d9603
I'm linking 0.9 but 1.3 is the same for this function

-- 
Behind every great man there's a great backpack - B.


libgmond-trunk-metrics-loss.diff
Description: Binary data
--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


[Ganglia-developers] metric loss and send channel failures in a multi-channel setup

2009-08-17 Thread Spike Spiegel
Hi,

we have a setup with 2 unicast channels and we recently ran across an
issue where we lost a bunch of metrics submitted with gmetric due to a
problem with dns that made one of the two channels unreachable. I
traced this back to libgmond.c and
Ganglia_udp_send_channels_create(...) where the code exit(1) as soon
as it fails to create a socket (lines 323:344). I'm not sure if this
is intended or not, but it certainly damages redundant setups like
ours where we'd definitely prefer to have only some of the channels
getting data rather than all or nothing. I'd like to propose that the
behavior is changed so that the error_msg() + exit() is replaced with
a debug_msg() call and then outside of the loop and before the return
we check if any channel has been created at all and fail there in
case. I would have gone ahead and attach a patch, but I'm not familiar
with the apr API and was unsure what was the best approach to deal
with the send_channels array especially given that the code seems to
preallocate space for num_udp_send_channels (line 291).

thanks for your input,

Spike

-- 
Behind every great man there's a great backpack - B.

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Thoughts on host spoofing

2009-02-06 Thread Spike Spiegel
On Fri, Feb 6, 2009 at 2:52 PM, Rick Cobb rc...@quantcast.com wrote:
 My thought
 is that the fewer underlying services a monitoring system needs to work, the
 more likely it is to work.

Absolutely, but dns itself is actually a good example of how
introducing a dependency was necessary to make a service usable. The
problem here is that if you don't have context most information are
meaningless or possibly misleading and an ip imho doesn't qualify as
context. When you do the lookup from the frontend the ip might have
moved and this is actually not so far off depending on your
infrastructure and the timeline you retain data for. Obviously if you
maintain these associations elsewhere you're good, but otherwise being
able to store webXX is pretty useful (and the reason I want more
control over it).

-- 
Behind every great man there's a great backpack - B.

--
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond python module interface

2009-01-31 Thread Spike Spiegel
Hi,

provided that I haven't had the time to look at this part of the code
yet and that I agree it would be much nicer to have a gmetric-like
behavior,

On Sun, Feb 1, 2009 at 12:21 AM, David Stainton dstainton...@gmail.com wrote:

 I like using gmetric to monitor... so I wrote gmetric-daemon which
 is my attempt at a forking standalone daemon
 which runs Python metric modules and calls gmetric for each metric...

in a previous email you call upon a most scalable, most correct and
most reliable/highly available design, which is certainly a valuable
goal that I don't see met by this proposal. A gmetric-daemon as far as
I understand gmetric would defy caching and directives like threshold
and timeout, which are very important at least as far as scalability
goes. Furthermore as long as there are built-int plugins with
collection groups and so on a third party daemon sounds like the wrong
approach to me, so as much easier as it might be at first I'd believe
that the most scalable, most correct and most reliable design is the
one Brad proposes cavia the fact that figuring it all out will take
more time.

 I wanted a slightly different multithreaded approach to monitoring...
 but it turns out
 that Python threads really suck.

care to share in which way python threads really suck?

 So I made this a forking daemon.
 One process per module. Not very memory effecient. But then I don't
 expect to need many modules...

*I* don't? what if somebody else does? what if you do tomorrow/at
another job? I don't see how you'd fix something like that at later
stage without having to throw everything away. And how does this meet
the most scalable design goal?

Don't get me wrong, I'm sure everybody agrees on the problems and
appreciate the effort, I'm merely pointing out that from my
perspective this proposal doesn't meet the design goals and is
unlikely to get traction upstream or in the HPC community, even tho it
might be just perfect for you and other people. And just in case, I've
no affiliation with ganglia and these are my own opinions, maybe
upstream folks have completely different thoughts.

time and skills permitting I'd be happy to help out with improving the
python interface especially since it's something we'd like to heavily
leverage at work.

thanks

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad protocol and propagating errors back to the client

2009-01-23 Thread Spike Spiegel
On Thu, Jan 22, 2009 at 6:55 PM, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 the interactive port was designed to mimic the behaviour from the
 original gmetad port which always returns the whole tree.

why's that? if I wanted the whole tree I'd query the non interactive
port, instead I'm asking for specific metrics so I should get them or
nothing (or an error). Falling back to whole tree doesn't sound
correct to me.

 if your concern is about returning too much data and the request was
 missing, it might be better then to return no tree information (which
 should be also valid)

I'm not sure what you mean here with no tree information. Would the
DTD + grid tag count as such?

I see 2 cases:
1) bad request
2) some/all of the items do not exist

1) happens before root_report_start is ran, so we could easily return
nothing or call root_report_start and end before closing the fd
2) happens after root_report__start has ran, so we could add each
found metric and nothing for the non-existing ones, and then call
root_report_end

doing that in both cases you get valid xml with at worst a GRID tag
that doesn't contain anything or contains multiple cluster tags for
each requested metric and the non-existing ones missing, which should
be enough of a hint to the client that they don't exist.

would that do?

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] CVE

2009-01-23 Thread Spike Spiegel
On Fri, Jan 23, 2009 at 11:52 PM, Brad Nicholes bnicho...@novell.com wrote:

  * http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-0242

 Ganglia 3.1.1 allows remote attackers to cause a denial of service via
 a request to the gmetad service with a path does not exist, which causes
 Ganglia to (1) perform excessive CPU computation and (2) send the entire
 tree, which consumes network bandwidth.

 this one is IMHO invalid as the CPU and bandwith costs for this in the
 current code are constant and the wording quoted was most likely taken
 out of context as it referred originally to a contribution proposal
 which has not been yet committed.


agreed, all the advisories I've seen around have misquoted my original
report and missed the link to the feature proposal. As it stands this
CVE is invalid.


 Are we finished hashing this whole patch out yet?  Are we ready to apply the 
 current patch to 3.1.2 and release or is there still more discussion going on?

as far as I'm concerned #223 is resolved and good to go.

thanks everybody.

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Possible REST interface to the interactiveport?

2009-01-21 Thread Spike Spiegel
On Wed, Jan 21, 2009 at 2:52 AM, Brad Nicholes bnicho...@novell.com wrote:
 Yep, I was also thinking that a RESTful output module for gmetad-python would 
 probably be the easiest solution

I haven't used gmetad-python yet so one concern would be performances
and how it'd behave having to aggregate and serve a lot of
data/requests. And another question is how different/harder/easier
would it be to scale a RESTful service in gmetad versus say a
standalone django/pylons app. Plus it would be nice if you could
request a time range or range of values instead of just current, which
would require some kind of  storage and leads me to what I was playing
with: use memcache to store the last n values using
hash(hostname+metric) as key and take advantage of expiration to clean
up old stuff. At this point you can easily put together a fairly
standard web service that can return last or even last-n values
without adding complexity to ganglia. You could make it even smarter
and make it rrd aware so that if you want older data it can be fetched
from there, and you could add support for a freshness check so it
pings gmetad to request last reading's timestamp and use that to
validate data read from memcache, but anyway let's keep it simple for
now.

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


[Ganglia-developers] gmetad protocol and propagating errors back to the client

2009-01-21 Thread Spike Spiegel
Hi,

right now when gmetad fails an error is logged and in some cases the
connection to the client interrupted returning invalid XML or in other
cases (item not found or broken request) the entire tree is returned.
This imho is bad behavior and code should be added to inform the
client of the error, but before that's possible it needs to be agreed
how this communication should happen. I'm not really fond of XML or
ganglia's code, but I'd guess adding an ERROR element to the DTD is
possibly a solution. At that point whenever there's an error
root_report_start() should be called at the very least and an error
element added inside. This should also work nicely for the multi-item
per request patch I proposed elsewhere [1] as you'd have an error per
requested element.

If anybody is willing to lend a hand to kickstart the XML definition
(or whatever approach is best) I'd be glad to work on the rest.

thanks

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port

2009-01-18 Thread Spike Spiegel
On Sun, Jan 18, 2009 at 7:35 PM, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 other than that looks good to me.

 could you check the simplified one?, this problem was introduced in
 2003 and therefore affects all versions of ganglia since then (including
 2.5.7 which is not supported anymore and that will need to be patched by
 the users of it which include Debian/Ubuntu, Novell/OpenSuSE and
 probably others).

apologies but I lost you there, what do you mean with the simplified one?


 Two things:
 1) How has this been tested? I did some myself and got to wonder how
 you guys did it, do you have any standardized approach?

 sadly there is no test suite associated with ganglia code and therefore
 there is no standardized approach other than applying the patch and
 banging the resulting binary to see if it works reliably.

alright, I was thinking of a couple scripts to generate traffic and
then do the queries, I think Jesse mentioned something like that on
irc based on gmetric. I believe something like that would be useful,
and either python of perl could be enough to write something threaded
to generate enough load for testing I guess. Is that what you meant
when you said banging to resulting binary?

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port

2009-01-18 Thread Spike Spiegel
On Mon, Jan 19, 2009 at 5:44 AM, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:

 agree, but that is to be done in the context of getting multi-patch
 committed and backported, but not in fixing this buffer overflow in the
 interactive port, which is what BUG223 is about.

ok, guess I'll start a different thread about this later on once we're
worked out #223

 from what I check while trying
 some fuzzing we have still a problem (probably introduced with the
 buffer overflow patch) when the request is too long (over 2048 bytes) as
 shown by :

  $ echo /`python -c \print \\%s/%s/%s\\ % ('a'*1700,'b'*300,'c'*48)\` | 
 netcat 127.0.0.1 8652

what problem are you seeing? trunk (r1950) does not reflect what we're
talking about as it includes my original return 1 if element is not
found which leads to the truncated xml output. Reverting to 1233 and
applying the latest patch from #223 works fine for me and I get back
the entire tree as there's no a*1700 grid.

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] Possible REST interface to the interactive port?

2009-01-17 Thread Spike Spiegel
Hi,

On Sat, Jan 17, 2009 at 5:04 AM, john allspaw jalls...@yahoo.com wrote:

 Hey all -

 Wondering if there's ever been any talk about serving up the interactive port 
 info via REST?

I am kinda working on this already although not in the form of a
ganglia patch, but as an external application that pulls data out of
ganglia. The reason for this being that I don't want to be dependent
on ganglia and that it's easier to aggregate other sources of
information not to mention development time since I can use python,
but this is more of a personal choice since I'm not fluent in C.

 http://gmetad.hostname:8652/WWW/www1.flickr.mud.yahoo.com/apache_procs_busy/

 (and all of the other stuff you can get from the interactive port)

 I'd bet that all of the requests to bolt-on alerting mechanisms would go away 
 if other alerting/escalation tools could get the real stuff out of ganglia, 
 too. :)

this is the reason why I offered that multi-item patch so that I could
write  smarter monitoring checks able to account for complex scenarios
(depending on environment apache_proc_busy itself is much less
relevant than apache_proc_busy + incoming_connections +
database_connections)

 Thoughts?

my main worry is ganglia getting too complicated and offering
something that is not entirely related. This code would end up in
gmetad making the server more complex and prone to errors and possibly
harming data aggregation since I guess it'd be running in another
thread.  I haven't thought this through, but one idea I considered was
to employ another host to run gmetad-python which would allow an
easier creation for a rest interface or even a different backend
engine to say store data into a database which then you would build
your REST service on top of. That said I appreciate the benefits of a
built-in interface, the speed benefits and the reduced number of
dependencies on other components.

thanks for bringing this up, definitely interesting topic

-- 
Behind every great man there's a great backpack - B.

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] patches for: [Sec] Gmetad server BoF andnetwork overload + [Feature] multiple requests per conn oninteractive port

2009-01-15 Thread Spike Spiegel
On Fri, Jan 16, 2009 at 7:04 AM, Kostas Georgiou
k.georg...@imperial.ac.uk wrote:
 On Thu, Jan 15, 2009 at 01:41:53PM -0700, Brad Nicholes wrote:

  On 1/15/2009 at 8:56 AM, in message
 496efa2a02ac0003a...@lucius.provo.novell.com, Brad Nicholes
 bnicho...@novell.com wrote:

 After taking a little closer look at the patch, I think we are OK as
 far as the recursive call to process_path() is concerned since this
 case is an error condition and should stop processing rather than
 continuing in the recursive loop.

indeed, this should work just fine.

  The other two concerns are still
 there however.  I still think that we are off-by-one in the malloc
 call.  It should be len+1 and I still think that we should limit the
 malloc to 256 rather than allowing it to be unlimited.

 I agree about the off-by-one

argh, my bad sorry, double dumb since I even considered the case.
len+1 it is and the comment should go, thanks.

 but I am not too worried about a malloc
 limit, from what I can tell it can only get as high as REQUESTLEN.

I agree with Kostas, as I wrote in my initial email I didn't worry
about that because of the REQUESTLEN boundary which is enforced in
readline.

as to limiting the path to 256 I actually did that in my first
implementation, but eventually converted to a malloc solution because
I was reminded that 640 KB ought to be enough for everybody and I
could see no downsides.


 The malloc call needs to be checked for NULL and the comment that
 The recursive structure doesn't require any memory allocations is
 false now if malloc replaces the stack allocation.

correct

thanks everybody

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers