Oof. It's never fun to hear about folks having problems using your software, and it's even worse when the headaches are enough to make you want to roll back your deployment. Even so, I appreciate you taking the time to write up this post-mortem; we're very motivated to resolve the issues that have been causing you pain. More on the details in-line below:
On 03/17/2015 11:36 AM, Matt Bostock wrote:
*## Memory leak from dashboard plugin* We initially saw Heka use a significant amount of memory over time, more than 3GB in some cases. You can see this is the attached graph which shows Heka's resident memory usage over the past 6 weeks across approximately 100 servers. The drops show when we either restarted Heka or when the OOM killer killed the process.
Okay, we'll take a look and will resolve this leak. I've opened an issue for tracking: https://github.com/mozilla-services/heka/issues/1422.
*## Idle packs; logs and metrics not received for some machines* While metrics and logs are successfully received by our Graphite and Elasticsearch machines for our application servers,*logs and metrics are not being received for certain high-traffic servers in our stack* (i.e. cache/router machines). This is the issue that has caused us the most pain. We see this warning for idle packs recurring in our logs: https://gist.github.com/mattbostock/241ae690a4c09acd8145 In a bid to try to alleviate some of the pressure on the plugins, we tried increasing plugin max chan size from 30 to 300 (as you can see in the configuration linked to above). Just for reference. this did not have any significant impact on Heka's memory usage.
No, it wouldn't. You're changing the size of the input channels for all of the plugins, which doesn't really change memory consumption. Changing the pool size would make a difference, since that changes the number of message structs that are allocated.
This is the output of from Heka's reporting mechanism `sudo kill -SIGUSR1 $(pgrep hekad)` on one of our high-traffic cache nodes and the central logging box which receives the logs and pushes them into to Elasticsearch: https://gist.github.com/mattbostock/ed5213b0607effc6813d The Heka logs on the central logging node are otherwise mostly empty apart from the occasional Elasticsearch connection error (i.e. once a day or less). As you can see from the reports linked, inputRecycleChan and injectRecycleChan are both at full capacity (100). Could this be due to a failure in one of the plugins?
The "capacity" of a channel is the number of slots that a channel has, that remains the same for the entire life of the Heka process. The "length" of a channel is how many items are in the channel at a given time, and that's constantly changing, obviously. So when I look at your report from the cache node I can see that the inputRecycleChan is completely empty. Also, looking a bit further down, pretty much all of the packs are sitting on the nginx_access_stat filters input channel, which has a length of 99. So clearly the StatFilter is what's gumming up the works. You've got a deadlock happening. Because your pool_size is the default of 100, that means you have 100 packs (i.e. messages) for all of the inputs to use. But you've cranked your plugin channel size all the way up to 300. Currently 99 of your 100 packs are sitting on the StatFilter's InChan. The other pack is inside the stat filter itself, being currently processed. The StatFilter is extracting data from it, and it's trying to hand that data to the StatAccumInput to be tallied up. The StatAccumInput, however, is trying to flush the last interval of aggregated data. To do that, it needs a pack from the inputRecycleChan. Which is empty. Because all of the packs are tied up in the StatFilter's input channel. Which can't move until the StatFilter can give a stat to the StatAccumInput. Rinse, lather, repeat. So you're probably seeing by now that cranking the chan size up to 300 wasn't the right choice. You don't really want a situation where it's easy for all of your packs to be tied up in a single plugin's channel like that. There are two possible ways I can see to resolve this issue. The first is trivially easy, but it might not be sufficient. The second takes a bit more work, but I can help if it comes to that, and you're willing to give it another go. The easy solution is to remove the plugin_chansize change, and to bump your global pool_size setting up to about 130. This will mean that even if the match chan and the in chan for your StatFilter and your TcpOutput are completely full, there will still be some packs left for the StatAccumInput to use. If you try this and data continues to flow w/o getting gummed up, great. Unfortunately, it might be the case that you're producing data more quickly than the StatFilter can process it. The StatFilter works, but it's trying to handle the general case and it's not very efficient, alas. If it can't handle the volume of messages it's receiving, then there's gonna be backpressure, and tweaking the pool and channel sizes isn't going to help. If that's the case, the other option is to stop using the StatFilter and StatAccumInput altogether. Instead we can write some Lua in a SandboxFilter that will catch the same messages the StatFilter is currently catching, much more efficiently extract only the data you care about from those messages, and periodically emit the accumulated tallies in the same format that is currently being generated by the StatAccumInput. This would probably be about 30 minutes of work, if you're interested in trying it out I'd be happy to whip something up for you to try out. Anyway, sorry for the headaches, and thanks again for letting us know about them. Cheers! -r _______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

