Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Rainer Gerhards Tue, 21 Apr 2009 23:09:56 -0700

Hi Ken,

glad to have you here. I am a bit silent at the moment, because I am not a
real database guy and so I am primarily listening to any information that is
incoming. If you have a couple of minutes, it would be useful to review this
thread here:


http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html

...one comment inline below...

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of Kenneth Marshall
> Sent: Tuesday, April 21, 2009 7:52 PM
> To: [email protected]
> Cc: rsyslog-users
> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log
> insertion
> 
> David,
> 
> Okay, I am now subscribed to the mailing list. We are currently using
> rsyslog-3.20.x. As far as implementing a prototype of the prepared
> statement, I was sidetracked by other duties and have not had a chance
> to do anything but an initial evaluation. As far as the rsyslog
> internal
> escaping, it looked simplest to create another template type like the
> current SQL and STDSQL that indicated that escaping was not needed
> and/or that prepared statements should be used.

To disable escaping, simply do not use SQL or STDSQL. However, the db outputs
currently require this option (easy to disable), because I cannot see how it
will work (with the existing code) without escaping.

Any idea is most welcome.

Rainer
> 
> Regards,
> Ken
> 
> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
> > On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >
> >> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
> >>> Kenneth,
> >>>   could you join the discussion on the rsyslog mailing list?
> >>> rsyslog-users <[email protected]>
> >>>
> >>> I'm surprised to hear you say that rsyslog can already do batch
> inserts
> >>> and
> >>> am interested in how you did that.
> >>>
> >>> what sort of insert rate did you mange to get?
> >>>
> >>> David Lang
> >>>
> >> David,
> >>
> >> I would be happy to join the discussion. I did not mean to say
> >> that rsyslog currently supported batch inserts, just that the
> >> pieces that provide "stand-by queuing" could be used to manage
> >> batching inserts.
> >
> > I've changed the to list to the rsyslog users list.
> >
> > currently the stand-by queuing still handles messages one at a time.
> > however a sponser has been found to pay to changing the rsyslog
> internals
> > to allow for multiple messages to be handled at once, which is what
> > triggered some of this discussion.
> >
> > which version of rsyslog are you working with?
> >
> > when you modified rsyslog to do prepared statement (to avoid the
> escaping
> > and parsing) did you hard-code the prepared statement? what other
> changes
> > did you make?
> >
> > David Lang
> >
> >> Cheers,
> >> Ken
> >>
> >>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >>>
> >>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
> >>>> From: Kenneth Marshall <[email protected]>
> >>>> To: Richard Huxton <[email protected]>
> >>>> Cc: [email protected], Stephen Frost <[email protected]>,
> >>>>     Greg Smith <[email protected]>, pgsql-
> [email protected]
> >>>> Subject: Re: [PERFORM] performance for high-volume log insertion
> >>>> Hi,
> >>>>
> >>>> I just finished reading this thread. We are currently working on
> >>>> setting up a central log system using rsyslog and PostgreSQL. It
> >>>> works well once we patched the memory leak. We also looked at what
> >>>> could be done to improve the efficiency of the DB interface. On
> the
> >>>> rsyslog side, moving to prepared queries allows you to remove the
> >>>> escaping that needs to be done currently before attempting to
> >>>> insert the data into the SQL backend as well as removing the
> parsing
> >>>> and planning time from the insert. This is a big win for high
> insert
> >>>> rates, which is what we are talking about. The escaping process is
> >>>> also a big CPU user in rsyslog which then hands the escaped string
> >>>> to the backend which then has to undo everything that had been
> done
> >>>> and parse/plan the resulting query. This can use a surprising
> amount
> >>>> of additional CPU. Even if you cannot support a general prepared
> >>>> query interface, by specifying what the query should look like you
> >>>> can handle much of the low-hanging fruit query-wise.
> >>>>
> >>>> We are currently using a date based trigger to use a new partition
> >>>> each day and keep 2 months of logs currently. This can be usefully
> >>>> managed on the backend database, but if rsyslog supported changing
> >>>> the insert to the new table on a time basis, the CPU used by the
> >>>> trigger to support this on the backend could be reclaimed. This
> >>>> would be a win for any DB backend. As you move to the new
> partition,
> >>>> issuing a truncate to clear the table would simplify the DB
> interfaces.
> >>>>
> >>>> Another performance enhancement already mentioned, would be to
> >>>> allow certain extra fields in the DB to be automatically populated
> >>>> as a function of the log messages. For example, logging the mail
> queue
> >>>> id for messages from mail systems would make it much easier to
> locate
> >>>> particular mail transactions in large amounts of data.
> >>>>
> >>>> To sum up, eliminating the escaping in rsyslog through the use of
> >>>> prepared queries would reduce the CPU load on the DB backend.
> Batching
> >>>> the inserts will also net you a big performance increase. Some DB-
> based
> >>>> applications allow for the specification of several types of
> queries,
> >>>> one for single inserts and then a second to support multiple
> inserts
> >>>> (copy). Rsyslog already supports the queuing pieces to allow you
> to
> >>>> batch inserts. Just some ideas.
> >>>>
> >>>> Regards,
> >>>> Ken
> >>>>
> >>>>
> >>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
> >>>>> [email protected] wrote:
> >>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
> >>>>>>> * [email protected] ([email protected]) wrote:
> >>>>>>>> while I fully understand the 'benchmark your situation' need,
> this
> >>>>>>>> isn't
> >>>>>>>> that simple.
> >>>>>>>
> >>>>>>> It really is.  You know your application, you know it's primary
> use
> >>>>>>> cases, and probably have some data to play with.  You're
> certainly in
> >>>>>>> a
> >>>>>>> much better situation to at least *try* and benchmark it than
> we are.
> >>>>>> rsyslog is a syslog server. it replaces (or for debian and
> fedora, has
> >>>>>> replaced) your standard syslog daemon. it recieves log messages
> from
> >>>>>> every
> >>>>>> app on your system (and possibly others), filters, maniulates
> them,
> >>>>>> and
> >>>>>> then stores them somewhere. among the places that it can store
> the
> >>>>>> logs
> >>>>>> are database servers (native support for MySQL, PostgreSQL, and
> >>>>>> Oracle.
> >>>>>> plus libdbi for others)
> >>>>>
> >>>>> Well, from a performance standpoint the obvious things to do are:
> >>>>> 1. Keep a connection open, do NOT reconnect for each log-
> statement
> >>>>> 2. Batch log statements together where possible
> >>>>> 3. Use prepared statements
> >>>>> 4. Partition the tables by day/week/month/year (configurable I
> suppose)
> >>>>>
> >>>>> The first two are vital, the third takes you a step further. The
> fourth
> >>>>> is
> >>>>> a long-term admin thing.
> >>>>>
> >>>>> And possibly
> >>>>> 5. Have two connections, one for fatal/error etc and one for
> info/debug
> >>>>> level log statements (configurable split?). Then you can use the
> >>>>> synchronous_commit setting on the less important ones. Might buy
> you
> >>>>> some
> >>>>> performance on a busy system.
> >>>>>
> >>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-
> wal.html#RUNTIME-CONFIG-WAL-SETTINGS
> >>>>>
> >>>>>> other apps then search and report on the data after it is
> stored. what
> >>>>>> apps?, I don't know either. pick your favorite reporting tool
> and
> >>>>>> you'll
> >>>>>> be a step ahead of me (I don't know a really good reporting
> tool)
> >>>>>> as for sample data, you have syslog messages, just like I do. so
> you
> >>>>>> have
> >>>>>> the same access to data that I have.
> >>>>>> how would you want to query them? how would people far less
> >>>>>> experianced
> >>>>>> that you want to query them?
> >>>>>> I can speculate that some people would do two columns (time,
> >>>>>> everything
> >>>>>> else), others will do three (time, server, everything else), and
> >>>>>> others
> >>>>>> will go further (I know some who would like to extract IP
> addresses
> >>>>>> embedded in a message into their own column). some people will
> index
> >>>>>> on
> >>>>>> the time and host, others will want to do full-text searches of
> >>>>>> everything.
> >>>>>
> >>>>> Well, assuming it looks much like traditional syslog, I would do
> >>>>> something
> >>>>> like: (timestamp, host, facility, priority, message). It's easy
> enough
> >>>>> to
> >>>>> stitch back together if people want that.
> >>>>>
> >>>>> PostgreSQL's full-text indexing is quite well suited to logfiles
> I'd
> >>>>> have
> >>>>> thought, since it knows about filenames, urls etc already.
> >>>>>
> >>>>> If you want to get fancy, add a msg_type column and one
> subsidiary
> >>>>> table
> >>>>> for each msg_type. So - you might have smtp_connect_from
> (hostname,
> >>>>> ip_addr). A set of perl regexps can match and extract the fields
> for
> >>>>> these
> >>>>> extra tables, or you could do it with triggers inside the
> database. I
> >>>>> think
> >>>>> it makes sense to do it in the application. Easier for users to
> >>>>> contribute
> >>>>> new patterns/extractions. Meanwhile, the core table is untouched
> so you
> >>>>> don't *need* to know about these extra tables.
> >>>>>
> >>>>> If you have subsidiary tables, you'll want to partition those too
> and
> >>>>> perhaps stick them in their own schema (logs200901, logs200902
> etc).
> >>>>>
> >>>>> --
> >>>>>   Richard Huxton
> >>>>>   Archonet Ltd
> >>>>>
> >>>>> --
> >>>>> Sent via pgsql-performance mailing list
> >>>>> ([email protected])
> >>>>> To make changes to your subscription:
> >>>>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Sent via pgsql-performance mailing list
> >>> ([email protected])
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>
> >>
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to