Hi Brad,

I appreciate you taking the time to look at the patch.

On Tue, Nov 17, 2009 at 09:54:11AM -0700, Brad Nicholes wrote:
> On 11/7/2009 at 12:06 AM, in message 
> <20091107070643.ga20...@porcupine.cita.utoronto.ca>, Robin Humble 
> <robin.humble+gang...@anu.edu.au> wrote:
>> turns out that there's a SPOOF_HOST EXTRA_ELEMENT attached to each
>> spoof'd metric, and when 100's of hosts (>40 or so should trigger it)
>> have spoof'd entries, then those add up and then corrupt the summary
>> Metric structure enough to destroy the .type and stop the rrd being
>> generated.
>> I'm guessing it's the same as the MAX_EXTRA_ELEMENTS problem, except
>> for the summary table instead of the host table.
>I took a look at this patch and since I am not able to reproduce the
>problem, it makes it a little unclear as to what is happening.  I can't
>really figure out how this patch fixes a problem with the hash table. 
>According to the source code, whenever an extra element is parsed, the
>code inserts the extra element into a list of extra data on a per
>metric basis.  This means that only one extra element for a spoof host
>is ever stored for a metric.

yes, it's the summary table that's the problem, not the host table.

> Then when the code moves into the summary
>data portion, it specifically checks to make sure that it is not
>duplicating an extra element value before it inserts it into the
>summary node (check the for loop at around line #827 in the 3.1.2
>version of the source code).  If it detects a duplicate value, then it
>skips the insert and just updates the rest of the summary node in the
>hash table. 

in this loop ->

  for (i = 0; i < sum_metric.ednameslen; i++) {
      char *chk_name = getfield(sum_metric.strings, sum_metric.ednames[i]);
      char *chk_value = getfield(sum_metric.strings, sum_metric.edvalues[i]);
      
      if (!strcasecmp(chk_name, new_name) && !strcasecmp(chk_value, new_value)) 
{
          found = TRUE;
          break;
      }
  }

here's an example of what happens for a spoof'd metric ->

  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.30:v30 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.31:v31 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.32:v32 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.33:v33 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.34:v34 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.35:v35 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.80:v176 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.36:v36 new_value 10.1.1.37:v37
  (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.81:v177 new_value 10.1.1.37:v37
  ...

you can see that every EXTRA_ELEMENT "name" field matches, but as
each spoof'd entry comes from a different host, then every "value" is
different, so 'found' is always FALSE.

so a new EXTRA_ELEMENT is always inserted for every spoof'd host.
ie. for one spoof'd metric on N hosts then there would be N
EXTRA_ELEMENT's stored next to it in the summary table.

when the number of spoofed hosts is > few * MAX_EXTRA_ELEMENTS, then
corruption occurs in the summary hash. the upshot of which is that the
summary table gets corrupted and the checks in gmetad.c mean that
(unless you get very lucky) the __SummaryInfo__/* rrd file for the
spoof'd metric is never written.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

>Since I am not able to duplicate the problem, could you
>step further through the original source code to make sure that the
>check for a duplicate value is actually happening and that the code is
>not taking some other path that could be causing the problem.

>You might also want to check in the source code at the point where the
>summary table is actually written to see if there is some clue there
>why your summary rrd files are not being created or updated.
>
>Brad
>
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to