I would say a few things. There are a lot of things going on in the software that are interesting.
We have several queues and thread pools. It makes sense to put http://metrics.dropwizard.io/3.1.0/getting-started/#gauges around those. This will give us visibility as to how close those are to 0 at any given time. We now have per-node data: https://issues.apache.org/jira/browse/GOSSIP-21 https://issues.apache.org/jira/browse/GOSSIP-25 It makes sense to use gauges to record the size of these. We should also use meters to count how operations/sec are caused by users adding data as well as the internode process replicating data. For PassiveGossipThread I could see us counting messages received as a meter. We could corrupt messages separately as a meter. We could aslo capture this data per host: gossipfrom.node1.goodmessages gossipfrom.node1.badmessages As well as globally gossipfrom.badmessages gossipfrom.goodmessages For ActiveGossip we could use histograms to track the time to process sendSharedData sendPerNodeData sendMembership We could use a gauge to track the size of this.scheduledExecutorService = Executors.newScheduledThreadPool(2); and other executors tom make sure that that queue is not backing up/blocked. Again you can track this per host and globally I am an ex-system administrator so I am generally ok with as many metrics as possible as long as we do not clutter the code. There are ways to do aspect/annotation driven counters as well so we can always look to refactor around those things if we want to. If you see something that seems like a point of possible contention or something that you believe is important to track I would capture that. In the long run there is something to consider about tracking metrics from 1k node clusters but we are not there yet and metrics is generally lighter than the code anyway. Thanks for taking the time to look at this. Edward On Tue, Oct 11, 2016 at 2:04 PM, chandresh pancholi < [email protected]> wrote: > Hi, > > I wanted to know where to begin working on this issue. > Someone please help me out with where to start and how to proceed with it. > > For Histogram i see ActiveThreadGroup and PassiveThreadGroup are doing > inter-node operation. > > Where are we tracking success and failure request so generate meter > metrics? > > Any kind of help is appreciable. > > -- > Chandresh Pancholi > Senior Software Engineer > Flipkart.com > Email-id:[email protected] > Contact:08951803660 >
