On 11/01/2014 03:23 PM, David Birdsong wrote:
I have a heka instance that's upstream of another heka instance that's
queued ~60GB of output. So the moment the upstream heka starts, it gets
pounded with a heavy workload.
The upstream heka runs for exactly 20 mins before halting all progress
and prints out the idle packs error message. As always, the error
includes every plugin attributing the same count to every one of them:
https://gist.github.com/davidbirdsong/f5d1bf140c865f9b2d72
Sheesh. Just took a look at our packet tracking code and found a bug. We're
adding the tracking stamps to the packs when they get handed to the message
matchers
(https://github.com/mozilla-services/heka/blob/dev/pipeline/router.go#L169)
when we *should* be adding the stamps only when the message matcher actually
returns a match
(https://github.com/mozilla-services/heka/blob/dev/pipeline/router.go#L311).
This will still have the limitation that it will show you all of the plugins
that are processing the pack, but that's still an improvement over including
every filter and output plugin that you have.
Luckily this is an easy fix, I've opened an issue on it:
https://github.com/mozilla-services/heka/issues/1167
Prior to the error, the http dashboard indicates that the decoder
plugin channels are usually full or close to full, but are still moving
messages through them--roughly ~8-11k/sec decode rates depending on
whether there are 1 or 2 downstream heka's connected. After the error,
message matchers and plugin channels all show 0 messages.
You might know this already, but it's worth noting that after Heka is wedged
you can no longer trust the HTTP dashboard, since it actually uses Heka's
message routing mechanism to receive the information that its showing. If
messages aren't flowing, the dashboard won't be updated. Don't despair,
however, if you send Heka a SIGUSR1 signal it will generate all of the report
data and dump all of it to stdout, bypassing the message routing altogether,
this will give you an accurate picture of what's going on when you're wedged.
What's a good course of action to rooting out and fixing the cause?
The two first steps would be what I mentioned above: fix the bug so that the
number of plugins shown to be holding the packs will be narrowed down at least
a little bit, and use SIGUSR1 to make sure you have an accurate picture of the
state of the system once the wedging happens. Obviously the latter step is
easier than the former, hopefully that points you in the right direction w/o
having to actually touch the code.
Since this heka is upstream and is well positioned to apply back
pressure, I'm trying out configuring the plugin channel to 0. It seems
that's the best way to cut down on messages lost on restart, but I don't
get why it would impact back pressure since a full channel is the same
as a blocking channel (right?).
Yup. Plus when you're channels are 0 length you'll end up w/ a lot more
blocking and overall performance will suffer.
All hail Heka son of Atum.
Praise Bob!
-r
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka