Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Rainer Gerhards Sun, 23 Aug 2015 23:29:15 -0700

2015-08-24 7:42 GMT+02:00 Radu Gheorghe <[email protected]>:
> On Sat, Aug 22, 2015 at 6:26 AM, David Lang <[email protected]> wrote:
>
>> On Fri, 21 Aug 2015, Otis Gospodnetić wrote:
>>
>> Hi,
>>>
>>> This sounds like something that should be om-specific.  What Radu is
>>> suggesting would definitely help with ES, but may not be relevant for
>>> other
>>> output targets.
>>> What I think is overlooked here is the ES side - more specifically ES and
>>> searches that ES has to handle.  If we don't care about maxing out ES and
>>> just pushing data in it as fast as it arrives, then how
>>> rsyslog/omelasticsearch works today  makes sense.  But this approach if
>>> focused on ingestion and ignores how this can hurt ES's ability to handle
>>> queries in a timely manner.  Exposing controls Radu suggested would help
>>> people avoid this problem.  I know David would like to see numbers :)  I
>>> love numbers, too, but I'm not sure if we'll have the time to provide them
>>> :(  That said, we work with ES 24/7 and have been doing that for years
>>> (many hundreds of ES deployments under our belt by now), so I am hoping
>>> somebody will trust us this option would be great to have in
>>> omelasticsearch. :)
>>>
>>
>> I think that this really should be addressed on the ElasticSearch side of
>> things.
>>
>> This really shouldn't be a numerical limit thing.
>>
>> What is ideal is that if ES is lightly loaded, things get pushed into ES
>> with the minimum latency. But if ES is more heavily loaded, batch things up.
>>
>> The right way to do this (as I said in another discussion) is for ES to
>> have a way to prioritize searches over inputting new data. That way as the
>> load climbs, the rate of processing new inserts will slow and inserts will
>> get batched more.
>
>
> While that would be an option (and I guess it can be done by tuning sizes
> and priorities of threadpools - I don't see another way), I don't agree
> that it's the right way to do this. In my experience, you'd want to avoid
> to put load on ES in the first place. ES does lots of things besides
> actually indexing and searching. Cluster management, for instance, where
> nodes are pinging each other and gathering statistics of what each node is
> doing and each shard hosted on said node. Re-opening searchers to make
> newly indexed data available for searches, warming up caches, backing up
> data and so on. There's a semi-complete list of thread pools here:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
> and obviously a single threadpool doesn't only do one job. And ideally,
> you'd want all these tasks to be snappy, you don't want a node to drop out
> of the cluster because it didn't reply to requests in a timely manner.
>
> As a result, I wouldn't put load on the indexing end just because I can
> (i.e. I'm not generating "enough load" to justify batching). Plus,
> forwarding data "immediately" (as opposed to every second or every 5
> seconds...) isn't necessarily helping the user, either. Elasticsearch is
> "near realtime" in the sense that by default, it "refreshes" the view on
> the index periodically to expose newly indexed data. This trades off some
> "realtime-ness" for "cache-ability" (both internally and at the OS level).
> Normally, users would make this refresh interval as long as possible
> without impacting the user experience too much, in order to reduce the load
> and increase the indexing throughput (some numbers here:
> http://blog.sematext.com/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/).
> Because of this, in the logging case a refresh interval of 5-10 seconds or
> even more is common, especially when you do lots of indexing. That's why
> I'm saying it doesn't really matter if rsyslog sends data immediately or
> waits a second or two for batches to be larger.


I am mostly with Radu on this topic. I think there are some use cases
where it really would be advantageous to submit a larger batch, even
if this means waiting. True, these use cases were very seldom in the
early days of rsyslog and may still be, but I think it's something one
might validly want.

Just my 2cts...

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Reply via email to