New question #240699 on Graphite:
https://answers.launchpad.net/graphite/+question/240699

In our Graphite(version 0.9.9) setup there are two carbon-relays and a total of 
11 carbon-cache instances. There are three carbon-cache instances behind one of 
the relays and two carbon-cache instances behind the other one. All the relays 
and caches are running in one single box. This seemingly wierd setup came into 
existence over an extended period of time; initially one carbon-cache instance 
was dedicated to each team to send their metrics. Then few teams started 
sending a lot of metrics, so the number of carbon-cache instances was increased 
for them, putting these behind a relay and using consisten hashing to 
distribute metrics across carbon-cache instances. 
Due to increased number of metrics our Graphite box started limiting on disk 
I/O. On some googling we found that, we will have to go for Flashcache. 
Besides, during the course of our googling we encountered, 
https://answers.launchpad.net/graphite/+question/178969 
The thing that caught our attention here was batch-update of whisper files. We 
felt that this would improve disk performance. The box had no dearth of memory, 
so we could afford keeping metrics in memory.   
So we modified writer.py to make carbon-cache do a whisper.update_many() only 
for those metrics which have more than a certain configurable(adding a line in 
conf.py) threshold number of datapoints in carbon-cache. And this threshold is 
removed when the cache size grows above 80% of MAX_CACHE_SIZE, and comes back 
again when cache size reduces. The result has been quite encouraging; before 
the change, disk utilization used to very frequently reach 100% and persist 
there for sometime. After the change, the average disk utilization is almost 
half and it still reaches 100% but much less frequently and persists there for 
much less time.  
But now there are a lot of metrics in cache. We have listed all the 
carbon-cache instances in CARBONLINK_HOSTS, but we get the most recent 
datapoints of very few metrics. For most of the metrics we get only that many 
datapoints in graph as have been written to whisper files. We figured out the 
reason for this as well:
the graphite webapp makes a hash-ring of the instances specified in 
CARBONLINK_HOSTS. For each metric in a query, hash of the metric-name is 
computed, which determines which carbon-cache instance would be queried for the 
metric. In our setup, few carbon-cache instances are behind one relay, few are 
behind the other relay and the rest are not behind any relay. So an incoming 
metric reaches a carbon-cache either directly or via a hash-ring formed by a 
subset of carbon-cache instances. During query on the other hand, there is a 
hash-ring made up of all the carbon-cache instances.
For graphs to be useful to the respective teams, it is important for them to 
show the most recent datapoints. So for now, we have modified the webapp to 
query each carbon-cache instance in CARBONLINK_HOSTS for each metric (in the 
same order as given in CARBONLINK_HOSTS). But this increases the query-time for 
each metric. Need some better way to do it.  
Secondly, given the way the webapp determines the the carbon-cache instance to 
query, how would the correct instance be queried in case relay-rules are used 
instead of consistent-hashing?
  


-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.

_______________________________________________
Mailing list: https://launchpad.net/~graphite-dev
Post to     : graphite-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~graphite-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to