Re: [rsyslog] [PERFORM] performance for high-volume log insertion

david Tue, 21 Apr 2009 23:27:10 -0700

On Wed, 22 Apr 2009, Rainer Gerhards wrote:

> Hi Ken,
>
> glad to have you here. I am a bit silent at the moment, because I am not a
> real database guy and so I am primarily listening to any information that is
> incoming. If you have a couple of minutes, it would be useful to review this
> thread here:
>
> http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html
>
> ...one comment inline below...
>
>> -----Original Message-----
>> From: [email protected] [mailto:rsyslog-
>> [email protected]] On Behalf Of Kenneth Marshall
>>
>> David,
>>
>> Okay, I am now subscribed to the mailing list. We are currently using
>> rsyslog-3.20.x. As far as implementing a prototype of the prepared
>> statement, I was sidetracked by other duties and have not had a chance
>> to do anything but an initial evaluation. As far as the rsyslog
>> internal
>> escaping, it looked simplest to create another template type like the
>> current SQL and STDSQL that indicated that escaping was not needed
>> and/or that prepared statements should be used.
>
> To disable escaping, simply do not use SQL or STDSQL. However, the db outputs
> currently require this option (easy to disable), because I cannot see how it
> will work (with the existing code) without escaping.
>
> Any idea is most welcome.


when using prepared statement escaping is not needed. according to Ken's 
message below he found that the overhead of doing the escaping was 
significant. I don't see why this should be the case, but if it requires 
making an extra copy of the string I guess it's possible

I plan to get a setup togeather in the next couple of days that will let 
me do some testing of the options on the database side.

David Lang

> Rainer
>>
>> Regards,
>> Ken
>>
>> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>>>
>>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
>>>>> Kenneth,
>>>>>   could you join the discussion on the rsyslog mailing list?
>>>>> rsyslog-users <[email protected]>
>>>>>
>>>>> I'm surprised to hear you say that rsyslog can already do batch
>> inserts
>>>>> and
>>>>> am interested in how you did that.
>>>>>
>>>>> what sort of insert rate did you mange to get?
>>>>>
>>>>> David Lang
>>>>>
>>>> David,
>>>>
>>>> I would be happy to join the discussion. I did not mean to say
>>>> that rsyslog currently supported batch inserts, just that the
>>>> pieces that provide "stand-by queuing" could be used to manage
>>>> batching inserts.
>>>
>>> I've changed the to list to the rsyslog users list.
>>>
>>> currently the stand-by queuing still handles messages one at a time.
>>> however a sponser has been found to pay to changing the rsyslog
>> internals
>>> to allow for multiple messages to be handled at once, which is what
>>> triggered some of this discussion.
>>>
>>> which version of rsyslog are you working with?
>>>
>>> when you modified rsyslog to do prepared statement (to avoid the
>> escaping
>>> and parsing) did you hard-code the prepared statement? what other
>> changes
>>> did you make?
>>>
>>> David Lang
>>>
>>>> Cheers,
>>>> Ken
>>>>
>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>>>>>
>>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
>>>>>> From: Kenneth Marshall <[email protected]>
>>>>>> To: Richard Huxton <[email protected]>
>>>>>> Cc: [email protected], Stephen Frost <[email protected]>,
>>>>>>     Greg Smith <[email protected]>, pgsql-
>> [email protected]
>>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion
>>>>>> Hi,
>>>>>>
>>>>>> I just finished reading this thread. We are currently working on
>>>>>> setting up a central log system using rsyslog and PostgreSQL. It
>>>>>> works well once we patched the memory leak. We also looked at what
>>>>>> could be done to improve the efficiency of the DB interface. On
>> the
>>>>>> rsyslog side, moving to prepared queries allows you to remove the
>>>>>> escaping that needs to be done currently before attempting to
>>>>>> insert the data into the SQL backend as well as removing the
>> parsing
>>>>>> and planning time from the insert. This is a big win for high
>> insert
>>>>>> rates, which is what we are talking about. The escaping process is
>>>>>> also a big CPU user in rsyslog which then hands the escaped string
>>>>>> to the backend which then has to undo everything that had been
>> done
>>>>>> and parse/plan the resulting query. This can use a surprising
>> amount
>>>>>> of additional CPU. Even if you cannot support a general prepared
>>>>>> query interface, by specifying what the query should look like you
>>>>>> can handle much of the low-hanging fruit query-wise.
>>>>>>
>>>>>> We are currently using a date based trigger to use a new partition
>>>>>> each day and keep 2 months of logs currently. This can be usefully
>>>>>> managed on the backend database, but if rsyslog supported changing
>>>>>> the insert to the new table on a time basis, the CPU used by the
>>>>>> trigger to support this on the backend could be reclaimed. This
>>>>>> would be a win for any DB backend. As you move to the new
>> partition,
>>>>>> issuing a truncate to clear the table would simplify the DB
>> interfaces.
>>>>>>
>>>>>> Another performance enhancement already mentioned, would be to
>>>>>> allow certain extra fields in the DB to be automatically populated
>>>>>> as a function of the log messages. For example, logging the mail
>> queue
>>>>>> id for messages from mail systems would make it much easier to
>> locate
>>>>>> particular mail transactions in large amounts of data.
>>>>>>
>>>>>> To sum up, eliminating the escaping in rsyslog through the use of
>>>>>> prepared queries would reduce the CPU load on the DB backend.
>> Batching
>>>>>> the inserts will also net you a big performance increase. Some DB-
>> based
>>>>>> applications allow for the specification of several types of
>> queries,
>>>>>> one for single inserts and then a second to support multiple
>> inserts
>>>>>> (copy). Rsyslog already supports the queuing pieces to allow you
>> to
>>>>>> batch inserts. Just some ideas.
>>>>>>
>>>>>> Regards,
>>>>>> Ken
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
>>>>>>> [email protected] wrote:
>>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>>>>>>>> * [email protected] ([email protected]) wrote:
>>>>>>>>>> while I fully understand the 'benchmark your situation' need,
>> this
>>>>>>>>>> isn't
>>>>>>>>>> that simple.
>>>>>>>>>
>>>>>>>>> It really is.  You know your application, you know it's primary
>> use
>>>>>>>>> cases, and probably have some data to play with.  You're
>> certainly in
>>>>>>>>> a
>>>>>>>>> much better situation to at least *try* and benchmark it than
>> we are.
>>>>>>>> rsyslog is a syslog server. it replaces (or for debian and
>> fedora, has
>>>>>>>> replaced) your standard syslog daemon. it recieves log messages
>> from
>>>>>>>> every
>>>>>>>> app on your system (and possibly others), filters, maniulates
>> them,
>>>>>>>> and
>>>>>>>> then stores them somewhere. among the places that it can store
>> the
>>>>>>>> logs
>>>>>>>> are database servers (native support for MySQL, PostgreSQL, and
>>>>>>>> Oracle.
>>>>>>>> plus libdbi for others)
>>>>>>>
>>>>>>> Well, from a performance standpoint the obvious things to do are:
>>>>>>> 1. Keep a connection open, do NOT reconnect for each log-
>> statement
>>>>>>> 2. Batch log statements together where possible
>>>>>>> 3. Use prepared statements
>>>>>>> 4. Partition the tables by day/week/month/year (configurable I
>> suppose)
>>>>>>>
>>>>>>> The first two are vital, the third takes you a step further. The
>> fourth
>>>>>>> is
>>>>>>> a long-term admin thing.
>>>>>>>
>>>>>>> And possibly
>>>>>>> 5. Have two connections, one for fatal/error etc and one for
>> info/debug
>>>>>>> level log statements (configurable split?). Then you can use the
>>>>>>> synchronous_commit setting on the less important ones. Might buy
>> you
>>>>>>> some
>>>>>>> performance on a busy system.
>>>>>>>
>>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-
>> wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>>>>>>>
>>>>>>>> other apps then search and report on the data after it is
>> stored. what
>>>>>>>> apps?, I don't know either. pick your favorite reporting tool
>> and
>>>>>>>> you'll
>>>>>>>> be a step ahead of me (I don't know a really good reporting
>> tool)
>>>>>>>> as for sample data, you have syslog messages, just like I do. so
>> you
>>>>>>>> have
>>>>>>>> the same access to data that I have.
>>>>>>>> how would you want to query them? how would people far less
>>>>>>>> experianced
>>>>>>>> that you want to query them?
>>>>>>>> I can speculate that some people would do two columns (time,
>>>>>>>> everything
>>>>>>>> else), others will do three (time, server, everything else), and
>>>>>>>> others
>>>>>>>> will go further (I know some who would like to extract IP
>> addresses
>>>>>>>> embedded in a message into their own column). some people will
>> index
>>>>>>>> on
>>>>>>>> the time and host, others will want to do full-text searches of
>>>>>>>> everything.
>>>>>>>
>>>>>>> Well, assuming it looks much like traditional syslog, I would do
>>>>>>> something
>>>>>>> like: (timestamp, host, facility, priority, message). It's easy
>> enough
>>>>>>> to
>>>>>>> stitch back together if people want that.
>>>>>>>
>>>>>>> PostgreSQL's full-text indexing is quite well suited to logfiles
>> I'd
>>>>>>> have
>>>>>>> thought, since it knows about filenames, urls etc already.
>>>>>>>
>>>>>>> If you want to get fancy, add a msg_type column and one
>> subsidiary
>>>>>>> table
>>>>>>> for each msg_type. So - you might have smtp_connect_from
>> (hostname,
>>>>>>> ip_addr). A set of perl regexps can match and extract the fields
>> for
>>>>>>> these
>>>>>>> extra tables, or you could do it with triggers inside the
>> database. I
>>>>>>> think
>>>>>>> it makes sense to do it in the application. Easier for users to
>>>>>>> contribute
>>>>>>> new patterns/extractions. Meanwhile, the core table is untouched
>> so you
>>>>>>> don't *need* to know about these extra tables.
>>>>>>>
>>>>>>> If you have subsidiary tables, you'll want to partition those too
>> and
>>>>>>> perhaps stick them in their own schema (logs200901, logs200902
>> etc).
>>>>>>>
>>>>>>> --
>>>>>>>   Richard Huxton
>>>>>>>   Archonet Ltd
>>>>>>>
>>>>>>> --
>>>>>>> Sent via pgsql-performance mailing list
>>>>>>> ([email protected])
>>>>>>> To make changes to your subscription:
>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-performance mailing list
>>>>> ([email protected])
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>
>>>>
>>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to