New question #240674 on Graphite:
https://answers.launchpad.net/graphite/+question/240674

I am trying to set up a Graphite cluster capable of handling 500K metric 
datapoints every 10 seconds - as a starting point. After navigating through 
some of the answers in this site, blog posts and other documentation, I have 
set up the following configuration:

- 2 machines with 8 cores, 32 GB of memory and 3 TB of storage each
- On each machine:
     - 5 carbon-relays
     - 9 carbon-aggregators
     - 9 carbon-caches

In total, there are 10 relays, 18 aggregators and 18 caches in the cluster. 
Each aggregator communicates with a single cache - it's 1-to-1. The webapps are 
configured to speak to their corresponding host's caches. An haproxy load 
balancer receives all the metric traffic and distributes the load among the 10 
relays. The 18 aggregators are specified as the destinations of each of the 
relays in the configuration file. The relays are configured with 
aggregated-consistent-hashing to group metrics that would be aggregated, based 
on the aggregation rules, in the same cache.

This setup behaves well. I have been able to run stress tests on the cluster, 
publishing larger sets of metrics incrementally to monitor the cluster health 
at every point. However, I have noticed that there are issues with the 
aggregated metrics.

For example, in the screenshot linked below, the graph on the right shows the 
raw values received. The graph on the left shows the aggregated values computed 
from the raw values. In this case, this metric's aggregation is defined as a 
sum in the aggregation rules configuration file.

http://bit.ly/1hO6bBQ

If I do the sum by hand, the result is a value around 750. Clearly not what 
Graphite is computing. This happens for *all* aggregated metrics in my cluster. 
While investigating this issue, I also noticed something strange when comparing 
the number of metrics received by the relays against the number of metrics sent 
by the relays to the aggregators. In the screenshot below, the graph on the 
right shows that the relays received around 280K metrics. However, only around 
140K of those are sent to the aggregators.

http://bit.ly/1f91Q85

If I enable whitelists and reduce the number of metrics processed by the 
cluster, the aggregations start functioning properly again and the relay's 
received vs sent metrics start to match again. See screenshots linked below:

http://bit.ly/1aXJyST
http://bit.ly/18EP28u

Questions:

- Any insight into why the aggregated metrics are "spiky" while the 
corresponding raw values look correct?
- Is there a scenario in which a relay will send less metrics than it receives?
- Does my setup makes sense? Is there a better way to scale a Graphite cluster?

I would greatly appreciate any help.

Thanks!

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.

_______________________________________________
Mailing list: https://launchpad.net/~graphite-dev
Post to     : graphite-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~graphite-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to