Re: [Ganglia-general] Gmond strange metric TN too large issue ormulticast metric lost?

Brad Nicholes Wed, 24 Jun 2009 10:30:30 -0700

The TN value is simple indicating the time offset from the reported
timestamp that the metric was last received from the managed node.  In
other words it is the age of the metric.  A large number would indicated
that the metric value has not been updated for a long period of time. 
This might be because the reporting interval has been set to a very
large time period or it may have something to do with multicast packets
being lost in your network.  I would suggest that you run gmond in debug
mode "-d 10" on both the managed node and the collecting node and then
try to correlate when the reporting node send one of your custom metrics
to when or if the collecting node received it.  The most obvious thing
that this would tell you is if your multicast packets are being lost or
blocked by a router in your network.  If the collecting node is actually
receiving the packet in a timely manner but the TN value is still large,
then we would have to look at a possible bug in gmond.
   The fact that you seem to be losing only the python metrics seems to
indicate that this might be either a configuration error or a problem
with the metric definition of the custom python metric.  Do you have the
same problem with any of the standard shipping python metrics?


Brad

>>> On 6/24/2009 at 9:39 AM, in message
<bay140-w76585dd65f08f17a35d1bb3...@phx.gbl>, liangfan
<xfanli...@hotmail.com>
wrote:

> I'm trying to figure out some very puzzling issue in our ganglia
system. 
> We are using ganglia 3.1.1.We get strange issue that some metric
always have 
> too large TN.
>  
> The system is configured as:
> -We have gmond deployed on 16 nodes (A-001-->A008, B001-->B008).
> -Gmond is configured to use multicast mode, each node have all
metrics
>  
> The issues are:
> -TN on some nodes is ok, while others have errors.
> -Some metric of one host are too large, while other metrics of the
same node 
> are ok.
> We guess kernel may drop these packages. You can see the detailed
analysis 
> in the end.
>  
> I find a thread on mail list may relate to this:
>
http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg02942.
> html
>  
> MonAMI also has a page might relate to this:
> http://monami.sourceforge.net/tutorial/ar01s06.html 
>  
> In preventing metric-update loss, it says that:
> The current Ganglia architecture requires each metric update be sent
as an 
> individual metric-update message. On a moderate-to-heavily loaded
machine, there 
> is a chance that gmond may not be scheduled to run as the messages
arrive. If 
> this happens, the incoming messages will be placed within the network
buffer. 
> Once the buffer is full, any subsequent metric-update messages will
be lost. 
> This places a limit on how may metric update messages can be sent in
one go. 
> For 2.4-series Linux kernels the limit is around 220 metric-update
messages; 
> for 2.6-series kernels, the limit is around 400. 
>  
> However, we are still confusing about the symptoms:
> -We do not see much buffer in port 8649 REV_Q and our node are not
heavy 
> load.
> -Why all the core metric are received and update to now ,while almost
all the 
> custom python metric are lost and TN get too large?
> -Why some node always gets outdated custom python metric, while other
nodes 
> are ok?
>  
> I've been scratching my head over this for almost a week now; I’ve
searched 
> ganglia mailing list archives, but can not get more info.
>  
> Any help/suggestions/advice would be very much appreciated -- it's
really very 
> frustrating!
>  
> Below is the detailed analysis
> -----Detailed analysis----
> Here is part of the xml out from A-001.
> telnet localhost 8649:
>  
> <HOST NAME="B-002" IP="X.X.X.119" REPORTED="1245822864" TN="9"
TMAX="20" 
> DMAX="0" LOCATION="" GMOND_STARTED="1245710345">
> <METRIC NAME="proc_run" VAL="0" TYPE="uint32" UNITS=" " TN="45"
TMAX="950" 
> DMAX="0" SLOPE="both">
> <EXTRA_DATA>
> <EXTRA_ELEMENT 
NAME="GROUP" VAL="process"/>
> </EXTRA_DATA>
> </METRIC>
> <METRIC NAME="load_five" VAL="1.13" TYPE="float" UNITS=" " TN="7"
TMAX="325" 
> DMAX="0" SLOPE="both">
> <EXTRA_DATA>
> <EXTRA_ELEMENT NAME="GROUP" VAL="load"/>
> </EXTRA_DATA>
> </METRIC>
> ....
> <METRIC NAME="WritesPerSec" VAL="0.00" TYPE="float" UNITS=""
TN="112225" 
> TMAX="60" DMAX="0" SLOPE="both">
> <EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
> </EXTRA_DATA>
> </METRIC>
> <METRIC NAME="db_used" VAL="20233" TYPE="uint32" UNITS="" TN="112225"

> TMAX="60" DMAX="0" SLOPE="both">
> <EXTRA_DATA>
> <EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
> </EXTRA_DATA>
> </METRIC>
> ....
>>From the xml, we can see gmond gets heartbeat info from B-002.TN of
all the 
> core metric collected by gmond (ex, proc_run,l oad_five) are ok,while
TN of 
> most metrics collected by our python module(ex, WritesPerSec,
db_used) 
> extension are large(TN="112225").
>  
> We use tcpdump on B-002 and find B-002 send out all the metric to
multicast  
> address(X.X.X.119-->239.X.X.110:8649).
> On A-001, we find A-001 receive all the multicast message
accordingly.(Get the 
> same X.X.X.119-->239.X.X.110:8649 message in tcpdump).
> These means the multicast message reaches to A-001.
>  
> Then we look use strace to trace gmond and find:
> On B-002 gmond send out all the core metric value and receive it
accordingly.
> Some thing like this:
> 17190 13:35:45.188240 write(6, 
> "\0\0\0\204\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0"..., 
> 48) = 48
> 17190 13:35:45.188764 recvfrom(4, 
> "\0\0\0\205\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0\0"..., 
> 1472, 0, {sa_family=AF_INET, sin_port=htons(56717), 
> sin_addr=inet_addr("X.X.X119")}, [16]) = 48
> However, we lost almost all the python custom metric.
> We see it send metric like this:
> 17910 13:36:44.051205 write(6, 
> "\0\0\0\200\0\0\0\5B-002\0\0\0\0\0\0\fWritesPerSec"..., 200) 
> = 200...
> But do not receive them back. Some times it only receives about 2
custom 
> metric and the following 20 custom metrics are not received.
> Look at gmond code, we see write() is called by 
> Ganglia_udp_send_message(),and lsof -p <gmond-pid> show fd 6 is:
> gmond   17190 root    6u  IPv4 UDP B-002:56717->239.X.X.110:8649.
> recvfrom() is called by process_udp_recv_channel(),and fd 4 is:
> gmond   17190 root    4u  IPv4 UDP 239.X.X.110:8649 
> On node A-001 we find A-001 receive all the core metric value but
lost all the 
> python custom metric.
>  
> Sometimes on A-001 tcpdump report there are some packets dropped by
kernel 
> too.
>  
> We use the same method on B-001 it runs well. All the metric TN is
ok.
>  
> Thank a lot for your time and input.
>  
> Regards,
> fan
> _________________________________________________________________
> 打工，挣钱，买房子，快来MClub一起”金屋藏娇”！
> http://club.msn.cn/?from=10 



------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Gmond strange metric TN too large issue ormulticast metric lost?

Reply via email to