>>> On 11/18/2009 at 8:19 AM, in message <20091118151950.ga13...@porcupine.cita.utoronto.ca>, Robin Humble <robin.humble+gang...@anu.edu.au> wrote: > Hi Brad, > > I appreciate you taking the time to look at the patch. > > On Tue, Nov 17, 2009 at 09:54:11AM -0700, Brad Nicholes wrote: >> On 11/7/2009 at 12:06 AM, in message > <20091107070643.ga20...@porcupine.cita.utoronto.ca>, Robin Humble > <robin.humble+gang...@anu.edu.au> wrote: >>> turns out that there's a SPOOF_HOST EXTRA_ELEMENT attached to each >>> spoof'd metric, and when 100's of hosts (>40 or so should trigger it) >>> have spoof'd entries, then those add up and then corrupt the summary >>> Metric structure enough to destroy the .type and stop the rrd being >>> generated. >>> I'm guessing it's the same as the MAX_EXTRA_ELEMENTS problem, except >>> for the summary table instead of the host table. >>I took a look at this patch and since I am not able to reproduce the >>problem, it makes it a little unclear as to what is happening. I can't >>really figure out how this patch fixes a problem with the hash table. >>According to the source code, whenever an extra element is parsed, the >>code inserts the extra element into a list of extra data on a per >>metric basis. This means that only one extra element for a spoof host >>is ever stored for a metric. > > yes, it's the summary table that's the problem, not the host table. > >> Then when the code moves into the summary >>data portion, it specifically checks to make sure that it is not >>duplicating an extra element value before it inserts it into the >>summary node (check the for loop at around line #827 in the 3.1.2 >>version of the source code). If it detects a duplicate value, then it >>skips the insert and just updates the rest of the summary node in the >>hash table. > > in this loop -> > > for (i = 0; i < sum_metric.ednameslen; i++) { > char *chk_name = getfield(sum_metric.strings, sum_metric.ednames[i]); > char *chk_value = getfield(sum_metric.strings, > sum_metric.edvalues[i]); > > if (!strcasecmp(chk_name, new_name) && !strcasecmp(chk_value, > new_value)) { > found = TRUE; > break; > } > } > > here's an example of what happens for a spoof'd metric -> > > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.30:v30 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.31:v31 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.32:v32 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.33:v33 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.34:v34 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.35:v35 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.80:v176 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.36:v36 new_value > 10.1.1.37:v37 > (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name > SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.81:v177 new_value > 10.1.1.37:v37 > ... > > you can see that every EXTRA_ELEMENT "name" field matches, but as > each spoof'd entry comes from a different host, then every "value" is > different, so 'found' is always FALSE. > > so a new EXTRA_ELEMENT is always inserted for every spoof'd host. > ie. for one spoof'd metric on N hosts then there would be N > EXTRA_ELEMENT's stored next to it in the summary table. > > when the number of spoofed hosts is > few * MAX_EXTRA_ELEMENTS, then > corruption occurs in the summary hash. the upshot of which is that the > summary table gets corrupted and the checks in gmetad.c mean that > (unless you get very lucky) the __SummaryInfo__/* rrd file for the > spoof'd metric is never written. >
Now I get it. I'll take a look at it from that angle. Brad ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers