Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Radu Gheorghe Sun, 23 Aug 2015 22:42:47 -0700

On Sat, Aug 22, 2015 at 6:26 AM, David Lang <[email protected]> wrote:

> On Fri, 21 Aug 2015, Otis Gospodnetić wrote:
>
> Hi,
>>
>> This sounds like something that should be om-specific.  What Radu is
>> suggesting would definitely help with ES, but may not be relevant for
>> other
>> output targets.
>> What I think is overlooked here is the ES side - more specifically ES and
>> searches that ES has to handle.  If we don't care about maxing out ES and
>> just pushing data in it as fast as it arrives, then how
>> rsyslog/omelasticsearch works today  makes sense.  But this approach if
>> focused on ingestion and ignores how this can hurt ES's ability to handle
>> queries in a timely manner.  Exposing controls Radu suggested would help
>> people avoid this problem.  I know David would like to see numbers :)  I
>> love numbers, too, but I'm not sure if we'll have the time to provide them
>> :(  That said, we work with ES 24/7 and have been doing that for years
>> (many hundreds of ES deployments under our belt by now), so I am hoping
>> somebody will trust us this option would be great to have in
>> omelasticsearch. :)
>>
>
> I think that this really should be addressed on the ElasticSearch side of
> things.
>
> This really shouldn't be a numerical limit thing.
>
> What is ideal is that if ES is lightly loaded, things get pushed into ES
> with the minimum latency. But if ES is more heavily loaded, batch things up.
>
> The right way to do this (as I said in another discussion) is for ES to
> have a way to prioritize searches over inputting new data. That way as the
> load climbs, the rate of processing new inserts will slow and inserts will
> get batched more.

While that would be an option (and I guess it can be done by tuning sizes
and priorities of threadpools - I don't see another way), I don't agree
that it's the right way to do this. In my experience, you'd want to avoid
to put load on ES in the first place. ES does lots of things besides
actually indexing and searching. Cluster management, for instance, where
nodes are pinging each other and gathering statistics of what each node is
doing and each shard hosted on said node. Re-opening searchers to make
newly indexed data available for searches, warming up caches, backing up
data and so on. There's a semi-complete list of thread pools here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
and obviously a single threadpool doesn't only do one job. And ideally,
you'd want all these tasks to be snappy, you don't want a node to drop out
of the cluster because it didn't reply to requests in a timely manner.

As a result, I wouldn't put load on the indexing end just because I can
(i.e. I'm not generating "enough load" to justify batching). Plus,
forwarding data "immediately" (as opposed to every second or every 5
seconds...) isn't necessarily helping the user, either. Elasticsearch is
"near realtime" in the sense that by default, it "refreshes" the view on
the index periodically to expose newly indexed data. This trades off some
"realtime-ness" for "cache-ability" (both internally and at the OS level).
Normally, users would make this refresh interval as long as possible
without impacting the user experience too much, in order to reduce the load
and increase the indexing throughput (some numbers here:
http://blog.sematext.com/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/).
Because of this, in the logging case a refresh interval of 5-10 seconds or
even more is common, especially when you do lots of indexing. That's why
I'm saying it doesn't really matter if rsyslog sends data immediately or
waits a second or two for batches to be larger.

Best regards,
Radu
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.

Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Reply via email to