New question #287525 on Graphite:
https://answers.launchpad.net/graphite/+question/287525

While this looks similar to 
https://answers.launchpad.net/graphite/+question/285063 - I'm reaching out here 
to see if anyone has seen similar behavior or any advice/tips.

Relevant information:

0.9.15 graphite-web/carbon/whisper
1T SSD storage for whisper files
8 core machine, 32G memory

3 relays using carbon-c-relay forward to send to 1 server running two 
carbon_cache.  Initially using carbon-relay on the storage server, switched out 
to carbon-c-relay using carbon_ch, and now using any_of to split metrics 
between the instances.

~2 million metrics/minute

About 2 weeks ago, an influx of metrics happened all at once, and ever since 
then (after turned off what was shipping that influx of metrics a few days 
later), carbon-cache-a has increasing its cache size, eventually causing the 
machine to start swapping eventually.  I've since put in a limit of MAX_CACHE, 
which the instance hits within a few minutes.  I've attempting turning dirty 
pages, changing the disk scheduler from deadline to noop, and turning on 
WHISPER_AUTOFLUSH, all without seeing any improvements.

Disk utilization was about 50-90%, CPU usage was not maxed.  Carbon-cache-b had 
no issues whatsoever.  Checking carbon-agent's committed points, it looks like 
carbon-cache-a is doing 1/4 of the committed points of carbon-cache-b without 
the server being overloaded.  Looking at carbon-c-relay's carbon metrics, each 
carbon-cache instance is getting the same amount of metrics.

Flash forward to yesterday.  I've now clustered out the metrics to two servers, 
running four carbon-cache instances now (was running 2, saw the same issue.  
New server was hitting 100% cpu on the caches, so I bumped both to use 4).  

3 relays using carbon_ch to ship to the two servers, each server using any_of 
to ship to the 4 carbon-cache instances.  Now each server has 1 million 
metrics/minute going to it, and upon increasing the number of carbon-cache 
instances on the initial server, now carbon-cache-a and carbon-cache-c shoot up 
to MAX_CACHE.  Turning off either a or c will cause b or d to increase in 
cache.size, until I start up a/c again - which will cause the cache.size of b/d 
to decrease as a/c increases again.

CPU use is ~20% for each instance, disk utilization is 75-90%.  Carbon-relay-c 
shows the same number of metrics going to each carbon-cache.

Second new server (with newer/better ssds) looks like a healthy server - 100% 
cpu for each carbon-cache (looking at expanding that), 10-30% disk utilization, 
cache.size of each cache isn't constantly growing.

Are there any performance tips I can look further into for my initial server?  
Changing MAX_UPDATES_PER_SECOND from anywhere between 2500 to 50000 doesn't 
seem to make a difference, the changes mentioned above didn't improve the 
issue, I'm down to thinking upgrading to 0.9.15 caused some issue at some 
point, the SSDs are too old (the server is about 2-3 years old now), or I 
should switch carbon-c-relay/carbon-cache on the storage servers to using UDP 
to reduce overhead, or some other sysctl tuning is in order.

Happy to post any graphs/configs that may help assist with this - the issue is 
odd in that the server has been running fine for months before this with about 
the same number of metrics.

-- 
You received this question notification because your team graphite-dev
is an answer contact for Graphite.

_______________________________________________
Mailing list: https://launchpad.net/~graphite-dev
Post to     : graphite-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~graphite-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to