Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Kenneth Marshall Tue, 21 Apr 2009 10:52:55 -0700

David,

Okay, I am now subscribed to the mailing list. We are currently using
rsyslog-3.20.x. As far as implementing a prototype of the prepared
statement, I was sidetracked by other duties and have not had a chance
to do anything but an initial evaluation. As far as the rsyslog internal
escaping, it looked simplest to create another template type like the
current SQL and STDSQL that indicated that escaping was not needed
and/or that prepared statements should be used.


Regards,
Ken

On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>
>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
>>> Kenneth,
>>>   could you join the discussion on the rsyslog mailing list?
>>> rsyslog-users <[email protected]>
>>>
>>> I'm surprised to hear you say that rsyslog can already do batch inserts 
>>> and
>>> am interested in how you did that.
>>>
>>> what sort of insert rate did you mange to get?
>>>
>>> David Lang
>>>
>> David,
>>
>> I would be happy to join the discussion. I did not mean to say
>> that rsyslog currently supported batch inserts, just that the
>> pieces that provide "stand-by queuing" could be used to manage
>> batching inserts.
>
> I've changed the to list to the rsyslog users list.
>
> currently the stand-by queuing still handles messages one at a time. 
> however a sponser has been found to pay to changing the rsyslog internals 
> to allow for multiple messages to be handled at once, which is what 
> triggered some of this discussion.
>
> which version of rsyslog are you working with?
>
> when you modified rsyslog to do prepared statement (to avoid the escaping 
> and parsing) did you hard-code the prepared statement? what other changes 
> did you make?
>
> David Lang
>
>> Cheers,
>> Ken
>>
>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>>>
>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
>>>> From: Kenneth Marshall <[email protected]>
>>>> To: Richard Huxton <[email protected]>
>>>> Cc: [email protected], Stephen Frost <[email protected]>,
>>>>     Greg Smith <[email protected]>, [email protected]
>>>> Subject: Re: [PERFORM] performance for high-volume log insertion
>>>> Hi,
>>>>
>>>> I just finished reading this thread. We are currently working on
>>>> setting up a central log system using rsyslog and PostgreSQL. It
>>>> works well once we patched the memory leak. We also looked at what
>>>> could be done to improve the efficiency of the DB interface. On the
>>>> rsyslog side, moving to prepared queries allows you to remove the
>>>> escaping that needs to be done currently before attempting to
>>>> insert the data into the SQL backend as well as removing the parsing
>>>> and planning time from the insert. This is a big win for high insert
>>>> rates, which is what we are talking about. The escaping process is
>>>> also a big CPU user in rsyslog which then hands the escaped string
>>>> to the backend which then has to undo everything that had been done
>>>> and parse/plan the resulting query. This can use a surprising amount
>>>> of additional CPU. Even if you cannot support a general prepared
>>>> query interface, by specifying what the query should look like you
>>>> can handle much of the low-hanging fruit query-wise.
>>>>
>>>> We are currently using a date based trigger to use a new partition
>>>> each day and keep 2 months of logs currently. This can be usefully
>>>> managed on the backend database, but if rsyslog supported changing
>>>> the insert to the new table on a time basis, the CPU used by the
>>>> trigger to support this on the backend could be reclaimed. This
>>>> would be a win for any DB backend. As you move to the new partition,
>>>> issuing a truncate to clear the table would simplify the DB interfaces.
>>>>
>>>> Another performance enhancement already mentioned, would be to
>>>> allow certain extra fields in the DB to be automatically populated
>>>> as a function of the log messages. For example, logging the mail queue
>>>> id for messages from mail systems would make it much easier to locate
>>>> particular mail transactions in large amounts of data.
>>>>
>>>> To sum up, eliminating the escaping in rsyslog through the use of
>>>> prepared queries would reduce the CPU load on the DB backend. Batching
>>>> the inserts will also net you a big performance increase. Some DB-based
>>>> applications allow for the specification of several types of queries,
>>>> one for single inserts and then a second to support multiple inserts
>>>> (copy). Rsyslog already supports the queuing pieces to allow you to
>>>> batch inserts. Just some ideas.
>>>>
>>>> Regards,
>>>> Ken
>>>>
>>>>
>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
>>>>> [email protected] wrote:
>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>>>>>> * [email protected] ([email protected]) wrote:
>>>>>>>> while I fully understand the 'benchmark your situation' need, this
>>>>>>>> isn't
>>>>>>>> that simple.
>>>>>>>
>>>>>>> It really is.  You know your application, you know it's primary use
>>>>>>> cases, and probably have some data to play with.  You're certainly in 
>>>>>>> a
>>>>>>> much better situation to at least *try* and benchmark it than we are.
>>>>>> rsyslog is a syslog server. it replaces (or for debian and fedora, has
>>>>>> replaced) your standard syslog daemon. it recieves log messages from
>>>>>> every
>>>>>> app on your system (and possibly others), filters, maniulates them, 
>>>>>> and
>>>>>> then stores them somewhere. among the places that it can store the 
>>>>>> logs
>>>>>> are database servers (native support for MySQL, PostgreSQL, and 
>>>>>> Oracle.
>>>>>> plus libdbi for others)
>>>>>
>>>>> Well, from a performance standpoint the obvious things to do are:
>>>>> 1. Keep a connection open, do NOT reconnect for each log-statement
>>>>> 2. Batch log statements together where possible
>>>>> 3. Use prepared statements
>>>>> 4. Partition the tables by day/week/month/year (configurable I suppose)
>>>>>
>>>>> The first two are vital, the third takes you a step further. The fourth
>>>>> is
>>>>> a long-term admin thing.
>>>>>
>>>>> And possibly
>>>>> 5. Have two connections, one for fatal/error etc and one for info/debug
>>>>> level log statements (configurable split?). Then you can use the
>>>>> synchronous_commit setting on the less important ones. Might buy you 
>>>>> some
>>>>> performance on a busy system.
>>>>>
>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>>>>>
>>>>>> other apps then search and report on the data after it is stored. what
>>>>>> apps?, I don't know either. pick your favorite reporting tool and 
>>>>>> you'll
>>>>>> be a step ahead of me (I don't know a really good reporting tool)
>>>>>> as for sample data, you have syslog messages, just like I do. so you
>>>>>> have
>>>>>> the same access to data that I have.
>>>>>> how would you want to query them? how would people far less 
>>>>>> experianced
>>>>>> that you want to query them?
>>>>>> I can speculate that some people would do two columns (time, 
>>>>>> everything
>>>>>> else), others will do three (time, server, everything else), and 
>>>>>> others
>>>>>> will go further (I know some who would like to extract IP addresses
>>>>>> embedded in a message into their own column). some people will index 
>>>>>> on
>>>>>> the time and host, others will want to do full-text searches of
>>>>>> everything.
>>>>>
>>>>> Well, assuming it looks much like traditional syslog, I would do
>>>>> something
>>>>> like: (timestamp, host, facility, priority, message). It's easy enough 
>>>>> to
>>>>> stitch back together if people want that.
>>>>>
>>>>> PostgreSQL's full-text indexing is quite well suited to logfiles I'd 
>>>>> have
>>>>> thought, since it knows about filenames, urls etc already.
>>>>>
>>>>> If you want to get fancy, add a msg_type column and one subsidiary 
>>>>> table
>>>>> for each msg_type. So - you might have smtp_connect_from (hostname,
>>>>> ip_addr). A set of perl regexps can match and extract the fields for
>>>>> these
>>>>> extra tables, or you could do it with triggers inside the database. I
>>>>> think
>>>>> it makes sense to do it in the application. Easier for users to
>>>>> contribute
>>>>> new patterns/extractions. Meanwhile, the core table is untouched so you
>>>>> don't *need* to know about these extra tables.
>>>>>
>>>>> If you have subsidiary tables, you'll want to partition those too and
>>>>> perhaps stick them in their own schema (logs200901, logs200902 etc).
>>>>>
>>>>> --
>>>>>   Richard Huxton
>>>>>   Archonet Ltd
>>>>>
>>>>> --
>>>>> Sent via pgsql-performance mailing list
>>>>> ([email protected])
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>
>>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list 
>>> ([email protected])
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>
>>
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to