2015-08-21 12:19 GMT+02:00 Otis Gospodnetić <[email protected]>: > Hi, > > This sounds like something that should be om-specific. What Radu is > suggesting would definitely help with ES, but may not be relevant for other > output targets. > What I think is overlooked here is the ES side - more specifically ES and > searches that ES has to handle. If we don't care about maxing out ES and > just pushing data in it as fast as it arrives, then how > rsyslog/omelasticsearch works today makes sense. But this approach if > focused on ingestion and ignores how this can hurt ES's ability to handle > queries in a timely manner. Exposing controls Radu suggested would help > people avoid this problem. I know David would like to see numbers :) I > love numbers, too, but I'm not sure if we'll have the time to provide them > :( That said, we work with ES 24/7 and have been doing that for years > (many hundreds of ES deployments under our belt by now), so I am hoping > somebody will trust us this option would be great to have in > omelasticsearch. :)
Not reading the full thread, I, too, think this makes sense. It would need to go into the queue engine, as this is the only place where it can decently be done. Done properly, it should not hurt performance for other cases. But it needs careful implementation. I suggest to open a gitub issue tracker, so that I can remember when I have time later this year (probably November+). Rainer > > Thanks, > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > On Fri, Aug 21, 2015 at 2:24 AM, David Lang <[email protected]> wrote: > >> >> On Fri, 21 Aug 2015, Radu Gheorghe wrote: >> >> Hello rsyslog users :) >>> >>> We've seen a problem that is similar to the one reported here: >>> http://www.gossamer-threads.com/lists/rsyslog/users/17550 While that >>> looks >>> like a bug, ours seems like a design issue. >>> >>> Basically we see bulks of one document all over the place. Not 100% what's >>> the root cause, but I'm thinking: if you have many machines with rsyslog >>> installed that send logs to Elasticsearch, but most of them send little >>> logs, they would never get enough messages in the queue to push in large >>> batches. Unless you add a slowdown, in which case you restrict rsyslog's >>> ability to push data when it's under load. >>> >> >> if you have all your systems send to a central aggregation point, rather >> than into ES directly, that aggregation point is going to have the combined >> traffic, and is much more likely to have data available to send. >> >> If you have 10K docs/s coming in 1 doc batches (say, from 10K machines), >>> there's a lot of unnecessary load on ES. Sure, if ES is overloaded things >>> will get better (as documents will add up in queues, resulting in bigger >>> batches) but even then I'd imagine things will look quite inefficient. >>> Plus, I'd like to avoid ES being overloaded in the first place. >>> >>> The solution, in my mind, was to add two options: >>> - one that says "if you don't have at least N items in the bulk, wait a >>> bit >>> until you have" >>> - one that overrides it saying "if M seconds passed since the last bulk, >>> send the bulk anyway" >>> >> >> this sort of logic tends to be rather fragile (and setting timers, >> checking how long it's been, etc ends up realy hurting you when you are >> under load. It's also the sort of thing that is routinely misconfigured to >> really hurt you. >> >> The approach that Rsyslog takes is to send something as soon as it's >> available, let things queue up while that's being processed, and then send >> what's queued up (with a max limit) >> >> This has the advantage of simplicity and performance. There are no timers >> to setup, not timestamps to check, and teh latency in message delivery is >> the minimum possible. >> >> As a result, the sort of change you are looking for will almost certinly >> not go into the core. I believe that the ES module has it's own buffer of >> messages that it's sending, so it could go there (IIRC the omelasticsearch >> module was contributed) >> >> >> >> Now, where does this help >> >> when traffic is really slow, this won't help, everything will still be >> singletons >> >> when traffic is heavy (just above the minimum batch size) this won't help, >> everything will be sent the same way with either set of logic. >> >> There is a middle ground where fewer, but larger batches are being sent, >> and things will flow more efficiently. How much of a difference does this >> make? >> >> >> I don't think it will make much difference, but I can be convinced by >> numbers. >> >> Let's invetigate what the best case situation is (I don't have the numbers >> for this, so we'll have to do some research) >> >> the best case is where without this setting, rsyslog would send singleton >> messages, but with this setting, it would batch up exactly minbatch >> messages and send them. >> >> what sort of setting are you thinking of for your 'minimum size' batch? >> >> on the sender side, each batch sent has a fairly small overhead, the >> message being sent doesn't have a lot of overhead besides the messages >> being inserted. There is going to be some amount of additional RAM used to >> hold on to these logs, but the system is idle, so it really shouldn't hurt. >> What I think is more likely to hurt is that when things go wrong, more data >> will be lost. >> >> On the receivers side, how much of a performance benefit is there? (this >> depends on the internals of ES) >> >> batch mode was created in rsylog because my testing was showing that on >> low-end hardware I could insert ~1000 records into postgres as a batch in >> the same time that it took to insert two records individually. >> >> can we get someone who has an ES setup to run a test, force the batch size >> to 1 and hammer it until you have the max rate, then set the batch size >> really large, and keep increasing the dequeue delay time until the total >> rate of inserts drops back to the same and report how large the delay time >> needs to be for them to even out. Also the load on the ES server under the >> 'many small' vs 'few large' (vmstat and iostat output, and possibly >> /proc/meminfo so that we can see disk, ram, and cpu utilization) >> >> the recent request to add compression to the ES transaction will matter >> here as well, a larger batch will compress better >> >> >> For example, if you are talking 1 vs 5 messages/batch, will that really >> make a difference on the ES server? if so, how big a difference? if it's >> 500% improvement, but the 'bad' situation is only using 20% of a cpu on ES, >> do we care? If ES does the insert into a datastructue in RAM, and then >> pushes it out to disk and updates it's indexes to make things visible only >> every several seconds, then it may be that there is no noticable difference >> between the two modes for quite a while. If we push the single-item rate >> until the server can't keep up, we will be hiting some limit. >> >> There's also the question of what is the value of what's being saved? >> depending on what resource on the ES server ends up being the bottleneck >> that is saved by using larger batches, it may be that it's not something >> that would really make the ES server noticably better if it wasn't being >> used. It's also possible that we will find that it's a really critical >> resource and would make a huge difference. >> >> Also, should the minimum queue size be based on the number of messages, or >> the size of the data being sent? >> >> >> you could also test this by having a program to insert into ES that reads >> from stdin and set it up as omprog and cache everything up until the >> minimum batch size, possibly with a signal that forces it to flush it's >> cache _now_ so you can experiment with timing by changing the rate that you >> send signals to it from an external script and the sending code doesn't >> need to have any of the clock logic in it. >> >> >> we know the cost to rsylog for doing something like this, but we don't >> know the benefits. >> >> >> Now the big questions: >>> - is this possible? where would one apply such a change? >>> - would it have a significant impact on the performance of outputs that >>> work well with the current design? Like omfwd, where the receiving end >>> wouldn't care how many docs it receives I imagine >>> - if it does have a significant impact, can we restrict such a change to >>> omelasticsearch, or does it have to go under rsyslog's core (in the way it >>> handles queues)? >>> - do you see better solutions? >>> >> >> I think the answer is that it would hurt in the general case and be very >> invasive and not the right thing for many outputs, but it may be the right >> thing for soem outputs, so let's test and see. >> >> David Lang >> >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

