I'm trying to figure out some very puzzling issue in our ganglia system.
We are using ganglia 3.1.1.We get strange issue that some metric always have
too large TN.
The system is configured as:
-We have gmond deployed on 16 nodes (A-001-->A008, B001-->B008).
-Gmond is configured to use multicast mode, each node have all metrics
The issues are:
-TN on some nodes is ok, while others have errors.
-Some metric of one host are too large, while other metrics of the same node
are ok.
We guess kernel may drop these packages. You can see the detailed analysis in
the end.
I find a thread on mail list may relate to this:
http://www.mail-archive.com/[email protected]/msg02942.html
MonAMI also has a page might relate to this:
http://monami.sourceforge.net/tutorial/ar01s06.html
In preventing metric-update loss, it says that:
The current Ganglia architecture requires each metric update be sent as an
individual metric-update message. On a moderate-to-heavily loaded machine,
there is a chance that gmond may not be scheduled to run as the messages
arrive. If this happens, the incoming messages will be placed within the
network buffer. Once the buffer is full, any subsequent metric-update messages
will be lost. This places a limit on how may metric update messages can be sent
in one go. For 2.4-series Linux kernels the limit is around 220 metric-update
messages; for 2.6-series kernels, the limit is around 400.
However, we are still confusing about the symptoms:
-We do not see much buffer in port 8649 REV_Q and our node are not heavy load.
-Why all the core metric are received and update to now ,while almost all the
custom python metric are lost and TN get too large?
-Why some node always gets outdated custom python metric, while other nodes are
ok?
I've been scratching my head over this for almost a week now; I’ve searched
ganglia mailing list archives, but can not get more info.
Any help/suggestions/advice would be very much appreciated -- it's really very
frustrating!
Below is the detailed analysis
-----Detailed analysis----
Here is part of the xml out from A-001.
telnet localhost 8649:
<HOST NAME="B-002" IP="X.X.X.119" REPORTED="1245822864" TN="9" TMAX="20"
DMAX="0" LOCATION="" GMOND_STARTED="1245710345">
<METRIC NAME="proc_run" VAL="0" TYPE="uint32" UNITS=" " TN="45" TMAX="950"
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="process"/>
</EXTRA_DATA>
</METRIC>
<METRIC NAME="load_five" VAL="1.13" TYPE="float" UNITS=" " TN="7" TMAX="325"
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="load"/>
</EXTRA_DATA>
</METRIC>
....
<METRIC NAME="WritesPerSec" VAL="0.00" TYPE="float" UNITS="" TN="112225"
TMAX="60" DMAX="0" SLOPE="both">
<EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
</EXTRA_DATA>
</METRIC>
<METRIC NAME="db_used" VAL="20233" TYPE="uint32" UNITS="" TN="112225" TMAX="60"
DMAX="0" SLOPE="both">
<EXTRA_DATA>
<EXTRA_ELEMENT NAME="GROUP" VAL="Status"/>
</EXTRA_DATA>
</METRIC>
....
>From the xml, we can see gmond gets heartbeat info from B-002.TN of all the
>core metric collected by gmond (ex, proc_run,l oad_five) are ok,while TN of
>most metrics collected by our python module(ex, WritesPerSec, db_used)
>extension are large(TN="112225").
We use tcpdump on B-002 and find B-002 send out all the metric to multicast
address(X.X.X.119-->239.X.X.110:8649).
On A-001, we find A-001 receive all the multicast message accordingly.(Get the
same X.X.X.119-->239.X.X.110:8649 message in tcpdump).
These means the multicast message reaches to A-001.
Then we look use strace to trace gmond and find:
On B-002 gmond send out all the core metric value and receive it accordingly.
Some thing like this:
17190 13:35:45.188240 write(6,
"\0\0\0\204\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0"..., 48) = 48
17190 13:35:45.188764 recvfrom(4,
"\0\0\0\205\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0\0"..., 1472, 0,
{sa_family=AF_INET, sin_port=htons(56717), sin_addr=inet_addr("X.X.X119")},
[16]) = 48
However, we lost almost all the python custom metric.
We see it send metric like this:
17910 13:36:44.051205 write(6,
"\0\0\0\200\0\0\0\5B-002\0\0\0\0\0\0\fWritesPerSec"..., 200) = 200...
But do not receive them back. Some times it only receives about 2 custom metric
and the following 20 custom metrics are not received.
Look at gmond code, we see write() is called by Ganglia_udp_send_message(),and
lsof -p <gmond-pid> show fd 6 is:
gmond 17190 root 6u IPv4 UDP B-002:56717->239.X.X.110:8649.
recvfrom() is called by process_udp_recv_channel(),and fd 4 is:
gmond 17190 root 4u IPv4 UDP 239.X.X.110:8649
On node A-001 we find A-001 receive all the core metric value but lost all the
python custom metric.
Sometimes on A-001 tcpdump report there are some packets dropped by kernel too.
We use the same method on B-001 it runs well. All the metric TN is ok.
Thank a lot for your time and input.
Regards,
fan
_________________________________________________________________
打工,挣钱,买房子,快来MClub一起”金屋藏娇”!
http://club.msn.cn/?from=10------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general