Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

chenlin rao Tue, 25 Aug 2015 20:06:54 -0700

I have another reason to support this idea.

By now, we can use `queue.dequeueslowdown` to force a little larger bulk
size for omelasticsearch queue. but when this run into DA queue, the
consume of DA queue is consoled by the queue.dequeueslowdown options too.


So, I saw my DA size decearse only 8 per sec when I set
`queue.dequeueslowdown=900000` when I want a larger bulk size...

BTW, why not DA queue read more than 8 lines and dequeue then? In the other
mail, davidlang told me rsyslog need to send msgs as fast as we can, but
the msgs in DA queue were already later than they need to be, maybe DA
queue need to read a larger block for sending?

2015-08-24 14:28 GMT+08:00 Rainer Gerhards <[email protected]>:

> 2015-08-24 7:42 GMT+02:00 Radu Gheorghe <[email protected]>:
> > On Sat, Aug 22, 2015 at 6:26 AM, David Lang <[email protected]> wrote:
> >
> >> On Fri, 21 Aug 2015, Otis Gospodnetić wrote:
> >>
> >> Hi,
> >>>
> >>> This sounds like something that should be om-specific.  What Radu is
> >>> suggesting would definitely help with ES, but may not be relevant for
> >>> other
> >>> output targets.
> >>> What I think is overlooked here is the ES side - more specifically ES
> and
> >>> searches that ES has to handle.  If we don't care about maxing out ES
> and
> >>> just pushing data in it as fast as it arrives, then how
> >>> rsyslog/omelasticsearch works today  makes sense.  But this approach if
> >>> focused on ingestion and ignores how this can hurt ES's ability to
> handle
> >>> queries in a timely manner.  Exposing controls Radu suggested would
> help
> >>> people avoid this problem.  I know David would like to see numbers :)
> I
> >>> love numbers, too, but I'm not sure if we'll have the time to provide
> them
> >>> :(  That said, we work with ES 24/7 and have been doing that for years
> >>> (many hundreds of ES deployments under our belt by now), so I am hoping
> >>> somebody will trust us this option would be great to have in
> >>> omelasticsearch. :)
> >>>
> >>
> >> I think that this really should be addressed on the ElasticSearch side
> of
> >> things.
> >>
> >> This really shouldn't be a numerical limit thing.
> >>
> >> What is ideal is that if ES is lightly loaded, things get pushed into ES
> >> with the minimum latency. But if ES is more heavily loaded, batch
> things up.
> >>
> >> The right way to do this (as I said in another discussion) is for ES to
> >> have a way to prioritize searches over inputting new data. That way as
> the
> >> load climbs, the rate of processing new inserts will slow and inserts
> will
> >> get batched more.
> >
> >
> > While that would be an option (and I guess it can be done by tuning sizes
> > and priorities of threadpools - I don't see another way), I don't agree
> > that it's the right way to do this. In my experience, you'd want to avoid
> > to put load on ES in the first place. ES does lots of things besides
> > actually indexing and searching. Cluster management, for instance, where
> > nodes are pinging each other and gathering statistics of what each node
> is
> > doing and each shard hosted on said node. Re-opening searchers to make
> > newly indexed data available for searches, warming up caches, backing up
> > data and so on. There's a semi-complete list of thread pools here:
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
> > and obviously a single threadpool doesn't only do one job. And ideally,
> > you'd want all these tasks to be snappy, you don't want a node to drop
> out
> > of the cluster because it didn't reply to requests in a timely manner.
> >
> > As a result, I wouldn't put load on the indexing end just because I can
> > (i.e. I'm not generating "enough load" to justify batching). Plus,
> > forwarding data "immediately" (as opposed to every second or every 5
> > seconds...) isn't necessarily helping the user, either. Elasticsearch is
> > "near realtime" in the sense that by default, it "refreshes" the view on
> > the index periodically to expose newly indexed data. This trades off some
> > "realtime-ness" for "cache-ability" (both internally and at the OS
> level).
> > Normally, users would make this refresh interval as long as possible
> > without impacting the user experience too much, in order to reduce the
> load
> > and increase the indexing throughput (some numbers here:
> >
> http://blog.sematext.com/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/
> ).
> > Because of this, in the logging case a refresh interval of 5-10 seconds
> or
> > even more is common, especially when you do lots of indexing. That's why
> > I'm saying it doesn't really matter if rsyslog sends data immediately or
> > waits a second or two for batches to be larger.
>
> I am mostly with Radu on this topic. I think there are some use cases
> where it really would be advantageous to submit a larger batch, even
> if this means waiting. True, these use cases were very seldom in the
> early days of rsyslog and may still be, but I think it's something one
> might validly want.
>
> Just my 2cts...
>
> Rainer
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Can we have a minimum bulk size for omelasticsearch?

Reply via email to