[Ganglia-general] Gmond strange metric TN too large issue or multicast metric lost?

liangfan Wed, 24 Jun 2009 08:44:00 -0700

I'm trying to figure out some very puzzling issue in our ganglia system. 
We are using ganglia 3.1.1.We get strange issue that some metric always have 
too large TN.
 
The system is configured as:
-We have gmond deployed on 16 nodes (A-001-->A008, B001-->B008).
-Gmond is configured to use multicast mode, each node have all metrics
 
The issues are:
-TN on some nodes is ok, while others have errors.
-Some metric of one host are too large, while other metrics of the same node 
are ok.
We guess kernel may drop these packages. You can see the detailed analysis in 
the end.
 
I find a thread on mail list may relate to this:
http://www.mail-archive.com/[email protected]/msg02942.html
 
MonAMI also has a page might relate to this:
http://monami.sourceforge.net/tutorial/ar01s06.html
 
In preventing metric-update loss, it says that:
The current Ganglia architecture requires each metric update be sent as an 
individual metric-update message. On a moderate-to-heavily loaded machine, 
there is a chance that gmond may not be scheduled to run as the messages 
arrive. If this happens, the incoming messages will be placed within the 
network buffer. Once the buffer is full, any subsequent metric-update messages 
will be lost. This places a limit on how may metric update messages can be sent 
in one go. For 2.4-series Linux kernels the limit is around 220 metric-update 
messages; for 2.6-series kernels, the limit is around 400. 
 
However, we are still confusing about the symptoms:
-We do not see much buffer in port 8649 REV_Q and our node are not heavy load.
-Why all the core metric are received and update to now ,while almost all the 
custom python metric are lost and TN get too large?
-Why some node always gets outdated custom python metric, while other nodes are 
ok?
 
I've been scratching my head over this for almost a week now; I’ve searched 
ganglia mailing list archives, but can not get more info.
 
Any help/suggestions/advice would be very much appreciated -- it's really very 
frustrating!
 
Below is the detailed analysis
-----Detailed analysis----
Here is part of the xml out from A-001.
telnet localhost 8649:
 
<HOST NAME="B-002" IP="X.X.X.119" REPORTED="1245822864" TN="9" TMAX="20" 
DMAX="0" LOCATION="" GMOND_STARTED="1245710345">
<METRIC NAME="proc_run" VAL="0" TYPE="uint32" UNITS=" " TN="45" TMAX="950" 
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="process"/>
</EXTRA_DATA>
</METRIC>
<METRIC NAME="load_five" VAL="1.13" TYPE="float" UNITS=" " TN="7" TMAX="325" 
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="load"/>
</EXTRA_DATA>
</METRIC>
....
<METRIC NAME="WritesPerSec" VAL="0.00" TYPE="float" UNITS="" TN="112225" 
TMAX="60" DMAX="0" SLOPE="both">
<EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
</EXTRA_DATA>
</METRIC>
<METRIC NAME="db_used" VAL="20233" TYPE="uint32" UNITS="" TN="112225" TMAX="60" 
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
</EXTRA_DATA>
</METRIC>
....
>From the xml, we can see gmond gets heartbeat info from B-002.TN of all the 
>core metric collected by gmond (ex, proc_run,l oad_five) are ok,while TN of 
>most metrics collected by our python module(ex, WritesPerSec, db_used) 
>extension are large(TN="112225").
 
We use tcpdump on B-002 and find B-002 send out all the metric to multicast  
address(X.X.X.119-->239.X.X.110:8649).
On A-001, we find A-001 receive all the multicast message accordingly.(Get the 
same X.X.X.119-->239.X.X.110:8649 message in tcpdump).
These means the multicast message reaches to A-001.
 
Then we look use strace to trace gmond and find:
On B-002 gmond send out all the core metric value and receive it accordingly.
Some thing like this:
17190 13:35:45.188240 write(6, 
"\0\0\0\204\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0"..., 48) = 48
17190 13:35:45.188764 recvfrom(4, 
"\0\0\0\205\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0\0"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(56717), sin_addr=inet_addr("X.X.X119")}, 
[16]) = 48
However, we lost almost all the python custom metric.
We see it send metric like this:
17910 13:36:44.051205 write(6, 
"\0\0\0\200\0\0\0\5B-002\0\0\0\0\0\0\fWritesPerSec"..., 200) = 200...
But do not receive them back. Some times it only receives about 2 custom metric 
and the following 20 custom metrics are not received.
Look at gmond code, we see write() is called by Ganglia_udp_send_message(),and 
lsof -p <gmond-pid> show fd 6 is:
gmond   17190 root    6u  IPv4 UDP B-002:56717->239.X.X.110:8649.
recvfrom() is called by process_udp_recv_channel(),and fd 4 is:
gmond   17190 root    4u  IPv4 UDP 239.X.X.110:8649 
On node A-001 we find A-001 receive all the core metric value but lost all the 
python custom metric.
 
Sometimes on A-001 tcpdump report there are some packets dropped by kernel too.
 
We use the same method on B-001 it runs well. All the metric TN is ok.
 
Thank a lot for your time and input.
 
Regards,
fan
_________________________________________________________________
打工，挣钱，买房子，快来MClub一起”金屋藏娇”！
http://club.msn.cn/?from=10

------------------------------------------------------------------------------

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Gmond strange metric TN too large issue or multicast metric lost?

Reply via email to