Re: [heka] Diagnostics: x packs have been idle more than 120 seconds

Rob Miller Tue, 19 May 2015 11:51:03 -0700

On 05/19/2015 03:16 AM, bruno binet wrote:

On 18 May 2015 at 20:10, Rob Miller <[email protected]
<mailto:[email protected]>> wrote:


    - The `injectRecycleChan` is completely empty. That means that all
    100 of those packs are in use, and any filter that is trying to
    inject a message will be blocked waiting for one of the packs in the
    injection pool to be freed up.


Ok. For my information, what is the difference between
`inputRecycleChan` and `injectRecycleChan`?

Input plugins get their unpopulated message structs from the inputRecycleChan. 
Filter plugins get their unpopulated message structs from the 
injectRecycleChan. When a message is recycled enough times that it's reference 
count hits zero, its data will be cleared out and the pack will be returned to 
the recycle channel from which it originated.

    - Both the input and matcher channel for the
    `MainRegexDispatchFilter` are completely full, which means that this
    is the filter that is blocking everything. Probably this filter is
    blocked trying to inject a message, which it can't do because the
    injectRecycleChan is empty. Both channels being full accounts for 60
    messages being tied up, which I'm guessing are 60 out of the 100 in
    our injection pool.


For the MainRegexDispatchFilter, what is the difference between
InChanCapacity and MatchChanCapacity?

Every message matcher has a channel onto which the router places each message, 
incrementing the message's reference count. The matchers pull from their 
channels and check the message to see if there's a match. If no, then the 
message is recycled, which decrements the ref count. If yes, the message is 
placed on the corresponding filter or output plugin's channel. So the 
MatchChan* values are related to the channel leading up to the message matcher, 
while the InChan* values are related to the channel that is between the matcher 
and the plugin.

    This doesn't get us all the way there, though. I'm guessing that the
    log output about '38 packs have been idle more than 120 seconds' is
    a clue about where most of the other 40 packs are. I'd need to see
    the code in those filters, and the config that's wiring it all
    together, to dig further.


Here is the repository for the filters and configuration in use:
https://github.com/bbinet/heka-hl-sandboxes

And here is the MainRegexDispatchFilter code and config:
https://github.com/bbinet/heka-hl-sandboxes/blob/master/filters/regex_dispatch_metric.lua
https://github.com/bbinet/heka-hl-sandboxes/blob/master/toml/filter_dispatch.toml#L33-74

Do you see anything wrong here?

A superficial look is all I have time for, but at first glance I don't see 
anything wrong, per se. But I do see some stuff that supports the deadlock 
hypothesis. It looks like you have some filters emitting messages that other 
filters are then consuming. Also, you have filters that are potentially 
emitting several output messages for every single input message.

Those are absolutely reasonable things to be doing, but if the pool size is too 
small they can lead to the situation I described. One filter might get a big 
slice of scheduler time, during which it emits a bunch of messages to the 
router. Then the router and the matchers get time, so the router and match 
channel fill up. Then finally the second filter gets time, and while doing its 
job it calls inject_message ten times, which means it's trying to pull ten 
messages off the injectRecycleChan. If there aren't ten messages available, 
then it blocks, waiting for more messages to become available. But they won't 
become available, b/c most of them are sitting on the match chan and input chan 
for that very filter. Hence the deadlock.

The solution is to tweak the pool size and/or chan size so the number of 
available messages is high enough that your use case won't consume them all and 
get stuck.

        Why does it work during ~40min before it starts to fail?

    There are a couple of possibilities. If it's packs not being
    recycled that's the problem, then this would mean there's a slow
    drip. First you lose one pack, then another, then another, until
    after 40 minutes enough of them are unavailable that pack exhaustion
    becomes a problem and you hit the deadlock.


I think it's not the case here because it can fail sometimes much sooner
or longer than 40min.

    Another possibility is that things are working fine until the Go
    scheduler happens to land you in an unlucky place. Imagine that you
    have a filter (filter A) that is injecting a lot of messages, all of
    which will be picked up by another filter (filter B) that will in
    turn inject more messages. If both filters are given about equal
    time by the scheduler then everything will flow smoothly. But if the
    first filter gets a longer chunk of time, then it might consume all
    of the messages in the injection pool, most of which will be backed
    up in filter B's input channels. Then the scheduler switches to
    filter B, but it's now blocked because it needs a pack from the
    injection pool, and there aren't any.

    My guess is that second scenario is what you're hitting. If I'm
    right, then your problem might go away if you just increase the
    poolsize setting (see
    
https://hekad.readthedocs.org/en/latest/config/index.html#global-configuration-options).
    Try cranking it up to 200 and see if that has any impact, and let us
    know how it goes.


Yes, it seems I'm facing this second scenario. I've increased the
poolsize to 200 yesterday evening and Heka is now running smoothly for
more than 17 hours.

Great!

One more question: if I don't want to increase the poolsize value (to
minimize the memory that heka consumes), is it equivalent to decrease
the `plugin_chansize` value down to 15?

Roughly, yes. The one thing to pay attention to is that reducing the chan size 
too much will cause more scheduler churn, which might impact performance. In my 
testing things seem to flow smoothly all the way down to a chan size of about 
25, but below that you can start to see throughput drop a bit since context 
switching happens more often. I'd try tweaking the pool size and chan size 
settings until you find what works best for you.

Thanks a lot for your help!
Bruno

You're welcome!

-r

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Diagnostics: x packs have been idle more than 120 seconds

Reply via email to