>>> On 11/18/2009 at 8:19 AM, in message
<20091118151950.ga13...@porcupine.cita.utoronto.ca>, Robin Humble
<robin.humble+gang...@anu.edu.au> wrote:
> Hi Brad,
> 
> I appreciate you taking the time to look at the patch.
> 
> On Tue, Nov 17, 2009 at 09:54:11AM -0700, Brad Nicholes wrote:
>> On 11/7/2009 at 12:06 AM, in message 
> <20091107070643.ga20...@porcupine.cita.utoronto.ca>, Robin Humble 
> <robin.humble+gang...@anu.edu.au> wrote:
>>> turns out that there's a SPOOF_HOST EXTRA_ELEMENT attached to each
>>> spoof'd metric, and when 100's of hosts (>40 or so should trigger it)
>>> have spoof'd entries, then those add up and then corrupt the summary
>>> Metric structure enough to destroy the .type and stop the rrd being
>>> generated.
>>> I'm guessing it's the same as the MAX_EXTRA_ELEMENTS problem, except
>>> for the summary table instead of the host table.
>>I took a look at this patch and since I am not able to reproduce the
>>problem, it makes it a little unclear as to what is happening.  I can't
>>really figure out how this patch fixes a problem with the hash table. 
>>According to the source code, whenever an extra element is parsed, the
>>code inserts the extra element into a list of extra data on a per
>>metric basis.  This means that only one extra element for a spoof host
>>is ever stored for a metric.
> 
> yes, it's the summary table that's the problem, not the host table.
> 
>> Then when the code moves into the summary
>>data portion, it specifically checks to make sure that it is not
>>duplicating an extra element value before it inserts it into the
>>summary node (check the for loop at around line #827 in the 3.1.2
>>version of the source code).  If it detects a duplicate value, then it
>>skips the insert and just updates the rest of the summary node in the
>>hash table. 
> 
> in this loop ->
> 
>   for (i = 0; i < sum_metric.ednameslen; i++) {
>       char *chk_name = getfield(sum_metric.strings, sum_metric.ednames[i]);
>       char *chk_value = getfield(sum_metric.strings, 
> sum_metric.edvalues[i]);
>       
>       if (!strcasecmp(chk_name, new_name) && !strcasecmp(chk_value, 
> new_value)) {
>           found = TRUE;
>           break;
>       }
>   }
> 
> here's an example of what happens for a spoof'd metric ->
> 
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.30:v30 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.31:v31 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.32:v32 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.33:v33 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.34:v34 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.35:v35 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.80:v176 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.1.36:v36 new_value 
> 10.1.1.37:v37
>   (chk_name == new_name) 1 && (chk_value == new_value) 0 ==> 0 - chk_name 
> SPOOF_HOST new_name SPOOF_HOST chk_value 10.1.2.81:v177 new_value 
> 10.1.1.37:v37
>   ...
> 
> you can see that every EXTRA_ELEMENT "name" field matches, but as
> each spoof'd entry comes from a different host, then every "value" is
> different, so 'found' is always FALSE.
> 
> so a new EXTRA_ELEMENT is always inserted for every spoof'd host.
> ie. for one spoof'd metric on N hosts then there would be N
> EXTRA_ELEMENT's stored next to it in the summary table.
> 
> when the number of spoofed hosts is > few * MAX_EXTRA_ELEMENTS, then
> corruption occurs in the summary hash. the upshot of which is that the
> summary table gets corrupted and the checks in gmetad.c mean that
> (unless you get very lucky) the __SummaryInfo__/* rrd file for the
> spoof'd metric is never written.
> 

Now I get it.  I'll take a look at it from that angle.

Brad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to