The TN value is simple indicating the time offset from the reported timestamp that the metric was last received from the managed node. In other words it is the age of the metric. A large number would indicated that the metric value has not been updated for a long period of time. This might be because the reporting interval has been set to a very large time period or it may have something to do with multicast packets being lost in your network. I would suggest that you run gmond in debug mode "-d 10" on both the managed node and the collecting node and then try to correlate when the reporting node send one of your custom metrics to when or if the collecting node received it. The most obvious thing that this would tell you is if your multicast packets are being lost or blocked by a router in your network. If the collecting node is actually receiving the packet in a timely manner but the TN value is still large, then we would have to look at a possible bug in gmond. The fact that you seem to be losing only the python metrics seems to indicate that this might be either a configuration error or a problem with the metric definition of the custom python metric. Do you have the same problem with any of the standard shipping python metrics?
Brad >>> On 6/24/2009 at 9:39 AM, in message <bay140-w76585dd65f08f17a35d1bb3...@phx.gbl>, liangfan <xfanli...@hotmail.com> wrote: > I'm trying to figure out some very puzzling issue in our ganglia system. > We are using ganglia 3.1.1.We get strange issue that some metric always have > too large TN. > > The system is configured as: > -We have gmond deployed on 16 nodes (A-001-->A008, B001-->B008). > -Gmond is configured to use multicast mode, each node have all metrics > > The issues are: > -TN on some nodes is ok, while others have errors. > -Some metric of one host are too large, while other metrics of the same node > are ok. > We guess kernel may drop these packages. You can see the detailed analysis > in the end. > > I find a thread on mail list may relate to this: > http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg02942. > html > > MonAMI also has a page might relate to this: > http://monami.sourceforge.net/tutorial/ar01s06.html > > In preventing metric-update loss, it says that: > The current Ganglia architecture requires each metric update be sent as an > individual metric-update message. On a moderate-to-heavily loaded machine, there > is a chance that gmond may not be scheduled to run as the messages arrive. If > this happens, the incoming messages will be placed within the network buffer. > Once the buffer is full, any subsequent metric-update messages will be lost. > This places a limit on how may metric update messages can be sent in one go. > For 2.4-series Linux kernels the limit is around 220 metric-update messages; > for 2.6-series kernels, the limit is around 400. > > However, we are still confusing about the symptoms: > -We do not see much buffer in port 8649 REV_Q and our node are not heavy > load. > -Why all the core metric are received and update to now ,while almost all the > custom python metric are lost and TN get too large? > -Why some node always gets outdated custom python metric, while other nodes > are ok? > > I've been scratching my head over this for almost a week now; I’ve searched > ganglia mailing list archives, but can not get more info. > > Any help/suggestions/advice would be very much appreciated -- it's really very > frustrating! > > Below is the detailed analysis > -----Detailed analysis---- > Here is part of the xml out from A-001. > telnet localhost 8649: > > <HOST NAME="B-002" IP="X.X.X.119" REPORTED="1245822864" TN="9" TMAX="20" > DMAX="0" LOCATION="" GMOND_STARTED="1245710345"> > <METRIC NAME="proc_run" VAL="0" TYPE="uint32" UNITS=" " TN="45" TMAX="950" > DMAX="0" SLOPE="both"> > <EXTRA_DATA> > <EXTRA_ELEMENT NAME="GROUP" VAL="process"/> > </EXTRA_DATA> > </METRIC> > <METRIC NAME="load_five" VAL="1.13" TYPE="float" UNITS=" " TN="7" TMAX="325" > DMAX="0" SLOPE="both"> > <EXTRA_DATA> > <EXTRA_ELEMENT NAME="GROUP" VAL="load"/> > </EXTRA_DATA> > </METRIC> > .... > <METRIC NAME="WritesPerSec" VAL="0.00" TYPE="float" UNITS="" TN="112225" > TMAX="60" DMAX="0" SLOPE="both"> > <EXTRA_ELEMENT NAME="GROUP" VAL="Status"/> > </EXTRA_DATA> > </METRIC> > <METRIC NAME="db_used" VAL="20233" TYPE="uint32" UNITS="" TN="112225" > TMAX="60" DMAX="0" SLOPE="both"> > <EXTRA_DATA> > <EXTRA_ELEMENT NAME="GROUP" VAL="Status"/> > </EXTRA_DATA> > </METRIC> > .... >>From the xml, we can see gmond gets heartbeat info from B-002.TN of all the > core metric collected by gmond (ex, proc_run,l oad_five) are ok,while TN of > most metrics collected by our python module(ex, WritesPerSec, db_used) > extension are large(TN="112225"). > > We use tcpdump on B-002 and find B-002 send out all the metric to multicast > address(X.X.X.119-->239.X.X.110:8649). > On A-001, we find A-001 receive all the multicast message accordingly.(Get the > same X.X.X.119-->239.X.X.110:8649 message in tcpdump). > These means the multicast message reaches to A-001. > > Then we look use strace to trace gmond and find: > On B-002 gmond send out all the core metric value and receive it accordingly. > Some thing like this: > 17190 13:35:45.188240 write(6, > "\0\0\0\204\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0"..., > 48) = 48 > 17190 13:35:45.188764 recvfrom(4, > "\0\0\0\205\0\0\0\5B-002\0\0\0\0\0\0\10location\0\0\0\0"..., > 1472, 0, {sa_family=AF_INET, sin_port=htons(56717), > sin_addr=inet_addr("X.X.X119")}, [16]) = 48 > However, we lost almost all the python custom metric. > We see it send metric like this: > 17910 13:36:44.051205 write(6, > "\0\0\0\200\0\0\0\5B-002\0\0\0\0\0\0\fWritesPerSec"..., 200) > = 200... > But do not receive them back. Some times it only receives about 2 custom > metric and the following 20 custom metrics are not received. > Look at gmond code, we see write() is called by > Ganglia_udp_send_message(),and lsof -p <gmond-pid> show fd 6 is: > gmond 17190 root 6u IPv4 UDP B-002:56717->239.X.X.110:8649. > recvfrom() is called by process_udp_recv_channel(),and fd 4 is: > gmond 17190 root 4u IPv4 UDP 239.X.X.110:8649 > On node A-001 we find A-001 receive all the core metric value but lost all the > python custom metric. > > Sometimes on A-001 tcpdump report there are some packets dropped by kernel > too. > > We use the same method on B-001 it runs well. All the metric TN is ok. > > Thank a lot for your time and input. > > Regards, > fan > _________________________________________________________________ > 打工,挣钱,买房子,快来MClub一起”金屋藏娇”! > http://club.msn.cn/?from=10 ------------------------------------------------------------------------------ _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general