Re: [heka] Diagnostics: x packs have been idle more than 120 seconds

bruno binet Fri, 22 May 2015 05:42:37 -0700

On 19 May 2015 at 20:50, Rob Miller <[email protected]> wrote:

> On 05/19/2015 03:16 AM, bruno binet wrote:
>
>> On 18 May 2015 at 20:10, Rob Miller <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     - The `injectRecycleChan` is completely empty. That means that all
>>     100 of those packs are in use, and any filter that is trying to
>>     inject a message will be blocked waiting for one of the packs in the
>>     injection pool to be freed up.
>>
>>
>> Ok. For my information, what is the difference between
>> `inputRecycleChan` and `injectRecycleChan`?
>>
> Input plugins get their unpopulated message structs from the
> inputRecycleChan. Filter plugins get their unpopulated message structs from
> the injectRecycleChan. When a message is recycled enough times that it's
> reference count hits zero, its data will be cleared out and the pack will
> be returned to the recycle channel from which it originated.
>
>>     - Both the input and matcher channel for the
>>     `MainRegexDispatchFilter` are completely full, which means that this
>>     is the filter that is blocking everything. Probably this filter is
>>     blocked trying to inject a message, which it can't do because the
>>     injectRecycleChan is empty. Both channels being full accounts for 60
>>     messages being tied up, which I'm guessing are 60 out of the 100 in
>>     our injection pool.
>>
>>
>> For the MainRegexDispatchFilter, what is the difference between
>> InChanCapacity and MatchChanCapacity?
>>
> Every message matcher has a channel onto which the router places each
> message, incrementing the message's reference count. The matchers pull from
> their channels and check the message to see if there's a match. If no, then
> the message is recycled, which decrements the ref count. If yes, the
> message is placed on the corresponding filter or output plugin's channel.
> So the MatchChan* values are related to the channel leading up to the
> message matcher, while the InChan* values are related to the channel that
> is between the matcher and the plugin.
>


Thanks for the explanations, it helps to better understand heka internals
and then how to fine tune hekad configuration. Maybe it would be useful to
others the copy what you've just described about MatchChan*, InChan*,
inputRecycleChan, injectRecycleChan to heka monitoring internal state
documentation here:
http://hekad.readthedocs.org/en/v0.9.2/monitoring/index.html


>      This doesn't get us all the way there, though. I'm guessing that the
>>     log output about '38 packs have been idle more than 120 seconds' is
>>     a clue about where most of the other 40 packs are. I'd need to see
>>     the code in those filters, and the config that's wiring it all
>>     together, to dig further.
>>
>>
>> Here is the repository for the filters and configuration in use:
>> https://github.com/bbinet/heka-hl-sandboxes
>>
>> And here is the MainRegexDispatchFilter code and config:
>>
>> https://github.com/bbinet/heka-hl-sandboxes/blob/master/filters/regex_dispatch_metric.lua
>>
>> https://github.com/bbinet/heka-hl-sandboxes/blob/master/toml/filter_dispatch.toml#L33-74
>>
>> Do you see anything wrong here?
>>
> A superficial look is all I have time for, but at first glance I don't see
> anything wrong, per se. But I do see some stuff that supports the deadlock
> hypothesis. It looks like you have some filters emitting messages that
> other filters are then consuming.


Yes.


> Also, you have filters that are potentially emitting several output
> messages for every single input message.
>

Hum, I don't remember any filter that would emit several output messages
for every single input message: which ones are you referring to?


> Those are absolutely reasonable things to be doing, but if the pool size
> is too small they can lead to the situation I described. One filter might
> get a big slice of scheduler time, during which it emits a bunch of
> messages to the router. Then the router and the matchers get time, so the
> router and match channel fill up. Then finally the second filter gets time,
> and while doing its job it calls inject_message ten times, which means it's
> trying to pull ten messages off the injectRecycleChan. If there aren't ten
> messages available, then it blocks, waiting for more messages to become
> available. But they won't become available, b/c most of them are sitting on
> the match chan and input chan for that very filter. Hence the deadlock.
>
> The solution is to tweak the pool size and/or chan size so the number of
> available messages is high enough that your use case won't consume them all
> and get stuck.
>

Yes that makes sense.


>         Why does it work during ~40min before it starts to fail?
>>
>>     There are a couple of possibilities. If it's packs not being
>>     recycled that's the problem, then this would mean there's a slow
>>     drip. First you lose one pack, then another, then another, until
>>     after 40 minutes enough of them are unavailable that pack exhaustion
>>     becomes a problem and you hit the deadlock.
>>
>>
>> I think it's not the case here because it can fail sometimes much sooner
>> or longer than 40min.
>>
>>     Another possibility is that things are working fine until the Go
>>     scheduler happens to land you in an unlucky place. Imagine that you
>>     have a filter (filter A) that is injecting a lot of messages, all of
>>     which will be picked up by another filter (filter B) that will in
>>     turn inject more messages. If both filters are given about equal
>>     time by the scheduler then everything will flow smoothly. But if the
>>     first filter gets a longer chunk of time, then it might consume all
>>     of the messages in the injection pool, most of which will be backed
>>     up in filter B's input channels. Then the scheduler switches to
>>     filter B, but it's now blocked because it needs a pack from the
>>     injection pool, and there aren't any.
>>
>>     My guess is that second scenario is what you're hitting. If I'm
>>     right, then your problem might go away if you just increase the
>>     poolsize setting (see
>>
>> https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options
>> ).
>>     Try cranking it up to 200 and see if that has any impact, and let us
>>     know how it goes.
>>
>>
>> Yes, it seems I'm facing this second scenario. I've increased the
>> poolsize to 200 yesterday evening and Heka is now running smoothly for
>> more than 17 hours.
>>
> Great!
>
>> One more question: if I don't want to increase the poolsize value (to
>> minimize the memory that heka consumes), is it equivalent to decrease
>> the `plugin_chansize` value down to 15?
>>
> Roughly, yes. The one thing to pay attention to is that reducing the chan
> size too much will cause more scheduler churn, which might impact
> performance. In my testing things seem to flow smoothly all the way down to
> a chan size of about 25, but below that you can start to see throughput
> drop a bit since context switching happens more often. I'd try tweaking the
> pool size and chan size settings until you find what works best for you.
>

I see, I'll make some performance tests and pick the best values for my use
case.

Thank you very much for taking the time to answer all my questions!

Bruno

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Diagnostics: x packs have been idle more than 120 seconds

Reply via email to