Re: [heka] Fallback output

Denis Shashkov Mon, 20 Oct 2014 00:13:04 -0700

Thank you, Rob.

You gave very particular answer on all questions. It was interesting and 
satisfying to know plans of your team about heka and its reliability. We've 
chose heka for it effectiveness, heka just do exactly what it is intended for 
while not being monster-like.
At this moment, we decided to use heka with file parsing, it seems like a disk 
buffering on input. Actually, we are decreasing probability of message losing 
by increasing probability of message duplicating. Maybe, it's more effective to 
find and eliminate (or just to keep storing) duplicates than fight for total 
reliability.



---- On Fri, 17 Oct 2014 00:57:17 +0700 Rob Miller &lt;[email protected]&gt; 
wrote ---- 

On 10/15/2014 11:58 PM, Denis Shashkov wrote: 
&gt; 
&gt; (Sorry, I hadn't notice this issues when I browsed all them) 
&gt; 
&gt; Thank you, it's a great news about buffering! 
 
Glad you think so. 
 
&gt; Have you consider making heka pipeline synchronous or transactional? 
 
We've never considered making the pipeline synchronous or transactional, 
no. We have considered taking other methods to improve reliability, such 
as writing data to a disk queue in the input layer, and tracking what 
messages have been successfully processed by the entire pipeline (for 
some definition of "successfully processed"), so that we'd be able to 
re-process any messages not marked as such upon restart after a crash. 
We've also considered adding disk queues to more places in Heka's 
pipeline, so that any place that's processing messages has its own queue 
that could be reprocessed rather than thinking of the whole pipeline as 
a single entity. But, alas, our resources are limited, and tackling this 
particular issue isn't on our short list, beyond the output buffering 
that I already mentioned. 
 
&gt; I mean something like that: 
&gt; - if output plugin cannot write the message pack (more than one) outside 
&gt; or do writing, it just blocks 
 
Currently our output disk buffer will continue to grow indefinitely. We 
have an issue open to improve this: 
 
https://github.com/mozilla-services/heka/issues/1110 
 
An obvious question this brings up is "what happens when the buffer 
grows to the maximum size?" I see three choices: shut down Heka, drop 
data on the floor, or stop pulling from the input channel which will 
cause back pressure to be applied to the rest of the pipeline. The third 
choice is what you're describing, and once the hooks for that are in 
place it could be used whether or not a disk buffer was actually in 
play, just think of it as the max buffer size set to 0. 
 
This would need to be configurable, obviously, b/c in many cases 
blocking the rest of Heka will not be desired behavior, but having this 
as a possibility is on our radar. 
 
&gt; - all tied modules (encoders, filters, pipeline, decoders, inputs) also 
&gt; eventually blocks 
 
It's a bit tricky to say exactly what defines a "tied module" here. One 
of the reasons Heka is so versatile is that the inputs, filters, and 
outputs are all loosely coupled, with the router and the 
message_matchers being the glue that ties things together. Currently, if 
any of the outputs or filters block, then that back pressure will flow 
to the router, which will in turn block *all* of the inputs, so the 
whole pipeline stops. This is already true, it's just not very obvious 
b/c hardly any of the outputs ever block, except when there are bugs, so 
the behavior doesn't show up very often. 
 
&gt; - while input plugin is blocked, it doesn't acknowledge input data (e.g. 
&gt; LogStreamer doesn't write position to journal or HttpListen doesn't 
&gt; response). 
 
This is already the case in most cases. If the router is backed up, the 
input (or the decoder that an input is using) will eventually block on 
dropping messages on the router's channel, which will prevent the input 
from continuing to process incoming data. LogstreamerInput will stop 
reading from the input files, TcpInput will stop accepting data, etc. I 
haven't looked into what HttpListenInput will do; it's possible that it 
will continue to accept incoming HTTP requests, accumulating a growing 
set of goroutines, each blocked on the stuck router. I'd say this is a 
bug, and each input should be looked at on a case by case basis to make 
sure that when Heka is backed up the failure modes are reasonable. 
 
&gt; This mode may prevent losing messages at all. (But it might decrease 
&gt; performance.) 
&gt; Now heka doesn't protected from crashes or something else. If OOM killer 
&gt; kills heka, I'll lose 3 messages at minimum (because of channels in the 
&gt; pipeline + input and output plugins). 
 
You'll lose many more than 3 in the default configuration. Every decoder 
has an in channel, the router has an in channel, every message matcher 
has an in channel, as well as every filter and output. By default each 
of these channels is 50 deep, so a busy Heka that crashes could actually 
be losing hundreds of messages. This channel size is configurable, you 
could even set it to 0 if you want unbuffered channels. We could 
probably stand to lower the default a bit. But once you get to a channel 
size of less than about 20, you might see a slight performance drop, and 
when you get down to very low numbers (less than 3), you'll see 
considerable loss of throughput as blocking increases significantly. 
 
We've been very clear from the beginning that a) Heka isn't making any 
guarantees w.r.t. message delivery and b) if you absolutely *can't* 
afford to lose any data, you should use Heka in connection w/ additional 
tools and processes to make sure you're not relying on Heka itself to 
never drop anything, or you should maybe not use Heka at all. We'd love 
to support a much higher level of reliability, we have ideas about how 
to do so, and some of them we're planning to implement, but getting it 
to a rock-solid "we promise we won't ever lose a message" is not our 
highest priority, unfortunately. If any individuals or companies out 
there are interested in supporting such an undertaking, I'd be thrilled 
to work with them providing guidance to help make it happen, but that's 
all I can offer. 
 
That being said, our experience actually using Heka is that it's 
generally pretty reliable, and message loss hasn't been a huge issue. 
And there are tons of use cases (most of them?) where a bit of loss is 
perfectly acceptable. We have Heka aggregators processing around 500 
million messages per day; at that volume, and with what we're doing with 
our data, losing a few hundred messages here and there isn't a big deal. 
Every case is different, though, and we do plan on continuing to 
incrementally work towards becoming more and more reliable, being up 
front about our limitations along the way. 
 
Hope this helps clarify, 
 
-r 
 
&gt; 
&gt; ---- On Wed, 15 Oct 2014 22:58:07 +0700 *Rob Miller 
&lt;[email protected] 
&gt; &lt;mailto:[email protected]&gt;&gt;* wrote ---- 
&gt; 
&gt; Nimi is right that the TcpOutput actually does buffer messages to disk. 
&gt; Originally that functionality was built directly in to the TcpOutput, 
&gt; but we later abstracted it out so it could be used by other output 
&gt; plugins. 
&gt; 
&gt; What hasn't been mentioned so far is that we have plans to change the 
&gt; interaction btn encoders and outputs and, as part of that, we plan on 
&gt; making the output buffering automatically available as a configuration 
&gt; option for *every* output. There are already a couple of issues open to 
&gt; capture this: 
&gt; 
&gt; https://github.com/mozilla-services/heka/issues/930 
&gt; 
&gt; and: 
&gt; 
&gt; https://github.com/mozilla-services/heka/issues/1103 
&gt; 
&gt; which contains this relevant comment: 
&gt; 
&gt; https://github.com/mozilla-services/heka/issues/1103#issuecomment-58548339 
&gt; 
&gt; 
&gt; This will be a fair amount of work. It is all intended to land 
&gt; before we 
&gt; release Heka 1.0, which is targeted for January 2015. 
&gt; 
&gt; -r 
&gt; 
&gt; On 10/15/2014 03:48 AM, Denis Shashkov wrote: 
&gt; &gt; 
&gt; &gt; Hello! 
&gt; &gt; 
&gt; &gt; AFAIK, now heka doesn't guarantee message delivering in case of 
&gt; output 
&gt; &gt; plugin couldn't write message: 
&gt; &gt; - output plugins didn't buffer or re-try write operations, 
&gt; &gt; - there is no interior buffer between pipeline and output plugins (I 
&gt; &gt; know about channels, but they have finite length and they located in 
&gt; &gt; memory), 
&gt; &gt; - you cannot write a buffering filter (because you cannot write all 
&gt; &gt; messages back into pipeline). 
&gt; &gt; 
&gt; &gt; (Please, correct me if I wrong). 
&gt; &gt; 
&gt; &gt; I thought a lot about how not to lost messages if my storage 
&gt; (e.g. HTTP 
&gt; &gt; server) is unavailable. I decided it will be great if some output 
&gt; plugin 
&gt; &gt; will be special: if other outputs cannot write messages, this 
&gt; fallback 
&gt; &gt; output will write instead of them. 
&gt; &gt; 
&gt; &gt; Can I do this without touching pipeline code? 
&gt; &gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; _______________________________________________ 
&gt; Heka mailing list 
&gt; [email protected] 
&gt; https://mail.mozilla.org/listinfo/heka 
&gt;

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] Fallback output

Reply via email to