Question #178969 on Graphite changed: https://answers.launchpad.net/graphite/+question/178969
chrismd proposed the following answer: Sorry for the delayed response, I have been on hiatus for the past 2 weeks and just read your write up. First off, thanks for being so thorough and detailed. Second off, you officially win the trophy for Biggest Graphite System Ever (would make for a good t-shirt I think), 3M metrics/min on one machine is very impressive. Third off, I think there are some ways we can better utilize your vast system resources so I'm psyched to see how far you might be able to push this system if you were interested in doing a benchmark once we've optimized everything :). So down to the details. Your observation that rapid small writes can hamper performance is quite correct, and that is exactly the motivation behind the MAX_UPDATES_PER_SECOND setting in carbon.conf. It's default value (1,000) is too high, I swear I already fixed this by lowering it to 500 but I'm looking at trunk and it's 1,000. Sorry about that, just committed it at 500 now. Either way when you've got N carbon-caches you need to divide the total value you're after by N. A system with as many disks as yours probably can handily do 1,000 updates/sec but 10,000/sec would certainly be excessive. This approach should result in a constant rate of write operations where the number of datapoints written to disk is proportional to this rate and the cache size. This also is a good way to limit how hard the backend will work the disks (in terms of seeks, the writes are negligibly small) to leave a relatively fixed amount of disk utilization headroom for the frontend requests or other processes on the system. If there's contention, the cache simply grows. As long as it doesn't max out there is generally no visible impact. Some good news is you might be able to do away with all your relays. The next release, 0.9.10, is going to be based on this branch: https://code.launchpad.net/~chrismd/+junk/graphite-megacarbon. The 'megacarbon' part refers to the fact that all of carbon's functionality has been unified in a single carbon-daemon.py that has a configurable processing pipeline, so any instance can aggregate, relay, rename, cache & write, combinations thereof, etc. I'll suggest some ideas in a moment how you could leverage this. The carbon-relay daemon (or equivalently, a carbon-daemon instance that just relays) has been somewhat obsoleted by the new carbon-client.py script that is basically a client-side relay. It has all of the same functionality as the standard relay (uses same carbon libraries), the only difference is that it reads metrics from its stdin so it's suited for client-side use. If you don't want to deploy twisted & carbon to all 1,000 client machines that's understandable, and that's a case where you'd still want a relaying daemon, it centralizes configuration & maintenance burden as well as processing load (you win some, you lose some). If you use carbon-clients you can still separate the 5 metrics that need to get aggregated by using a second carbon-client. Use one for aggregated metrics that connects to the aggregator(s), and one for non- aggregated metrics that connects directly to the carbon-caches / carbon- daemons that write. The aggregator daemons can forward the 5 original metrics on to the writers. Technically the two separate carbon-clients (which is basically the same idea as your top/middle relay split) would be unnecessary if the relay code could mix the use of relay rules and consistent hashing but currently that isn't supported. Come to think of it, it wouldn't be that hard to implement (just add an option to the relay-rule configurations to use consistent hashing to determine destinations from the rule's destinations rather than a global destination list). That's a really good idea actually, thanks for pointing that out :), Bug #899543. Implementing that would remove the need for your mid-tier relays without needing to change anything else. If you want to try out the new carbon-daemon.py feel free, it is already working and tested on the megacarbon branch. There are just a few webapp changes in that branch that are mid-flight before I merge it all to trunk so don't use the webapp code from that branch yet. Given the constraints of carbon 0.9.9 though, you solved the problem quite well, the only tweak I can think of that involves no graphite code changes would be to have your clients separately send the 5 aggregated metrics directly to the aggregator and all their other metrics to the haproxy, that would eliminate the need for the mid-tier relays. Thanks again for the excellent write-up. -- You received this question notification because you are a member of graphite-dev, which is an answer contact for Graphite. _______________________________________________ Mailing list: https://launchpad.net/~graphite-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~graphite-dev More help : https://help.launchpad.net/ListHelp

