Re: [rsyslog] [PERFORM] performance for high-volume log insertion

david Tue, 21 Apr 2009 23:52:44 -0700

On Wed, 22 Apr 2009, Rainer Gerhards wrote:

>> -----Original Message----
>> From: [email protected] [mailto:rsyslog-
>> [email protected]] On Behalf Of [email protected]
>> Sent: Wednesday, April 22, 2009 8:26 AM
>> To: rsyslog-users
>> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log
>> insertion
>>
>> On Wed, 22 Apr 2009, Rainer Gerhards wrote:
>>
>>> Hi Ken,
>>>
>>> glad to have you here. I am a bit silent at the moment, because I am
>> not a
>>> real database guy and so I am primarily listening to any information
>> that is
>>> incoming. If you have a couple of minutes, it would be useful to
>> review this
>>> thread here:
>>>
>>> http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html
>>>
>>> ...one comment inline below...
>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:rsyslog-
>>>> [email protected]] On Behalf Of Kenneth Marshall
>>>>
>>>> David,
>>>>
>>>> Okay, I am now subscribed to the mailing list. We are currently
>> using
>>>> rsyslog-3.20.x. As far as implementing a prototype of the prepared
>>>> statement, I was sidetracked by other duties and have not had a
>> chance
>>>> to do anything but an initial evaluation. As far as the rsyslog
>>>> internal
>>>> escaping, it looked simplest to create another template type like
>> the
>>>> current SQL and STDSQL that indicated that escaping was not needed
>>>> and/or that prepared statements should be used.
>>>
>>> To disable escaping, simply do not use SQL or STDSQL. However, the db
>> outputs
>>> currently require this option (easy to disable), because I cannot see
>> how it
>>> will work (with the existing code) without escaping.
>>>
>>> Any idea is most welcome.
>>
>> when using prepared statement escaping is not needed.
>
> ... but that can only be on the non-text API level (e.g. by using libpq). On
> the SQL text level, I have no idea how to tell the sql engine to insert ' -
> if I don't say '''' but ''' how does the engine know what I say? With the
> C-level API, I bind a parameter and specify a buffer and then put my
> character into that buffer - no escaping needed for sure. But, again, I do
> not see how this would work on the text level...


correct.

>> according to
>> Ken's
>> message below he found that the overhead of doing the escaping was
>> significant. I don't see why this should be the case, but if it
>> requires
>> making an extra copy of the string I guess it's possible
>>
>
> I am surprised, too. Even if a copy is made (I think it is), this is a quick
> in-memory operation. Given the rest of the picture, I would expect that to
> have very low impact on the overall cost (just think about the need to copy
> buffers between different contexts, down to different layers, etc - so I'd
> expect to see ample copy operations before the data finally hits the disk).
> Anyhow, I may be totally wrong...
>
>> I plan to get a setup togeather in the next couple of days that will
>> let
>> me do some testing of the options on the database side.
>>
>
> That would be great. I, for now, intend to look at the queue first. I have
> begun to thought about steps on how to tackle the beast. So I will probably
> not do much more on the database level than throw in some thoughts (but not
> do any testing or coding).

sounds good.

David Lang

> Rainer
>
>> David Lang
>>
>>> Rainer
>>>>
>>>> Regards,
>>>> Ken
>>>>
>>>> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>>>>>
>>>>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
>>>>>>> Kenneth,
>>>>>>>   could you join the discussion on the rsyslog mailing list?
>>>>>>> rsyslog-users <[email protected]>
>>>>>>>
>>>>>>> I'm surprised to hear you say that rsyslog can already do batch
>>>> inserts
>>>>>>> and
>>>>>>> am interested in how you did that.
>>>>>>>
>>>>>>> what sort of insert rate did you mange to get?
>>>>>>>
>>>>>>> David Lang
>>>>>>>
>>>>>> David,
>>>>>>
>>>>>> I would be happy to join the discussion. I did not mean to say
>>>>>> that rsyslog currently supported batch inserts, just that the
>>>>>> pieces that provide "stand-by queuing" could be used to manage
>>>>>> batching inserts.
>>>>>
>>>>> I've changed the to list to the rsyslog users list.
>>>>>
>>>>> currently the stand-by queuing still handles messages one at a
>> time.
>>>>> however a sponser has been found to pay to changing the rsyslog
>>>> internals
>>>>> to allow for multiple messages to be handled at once, which is what
>>>>> triggered some of this discussion.
>>>>>
>>>>> which version of rsyslog are you working with?
>>>>>
>>>>> when you modified rsyslog to do prepared statement (to avoid the
>>>> escaping
>>>>> and parsing) did you hard-code the prepared statement? what other
>>>> changes
>>>>> did you make?
>>>>>
>>>>> David Lang
>>>>>
>>>>>> Cheers,
>>>>>> Ken
>>>>>>
>>>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
>>>>>>>
>>>>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
>>>>>>>> From: Kenneth Marshall <[email protected]>
>>>>>>>> To: Richard Huxton <[email protected]>
>>>>>>>> Cc: [email protected], Stephen Frost <[email protected]>,
>>>>>>>>     Greg Smith <[email protected]>, pgsql-
>>>> [email protected]
>>>>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I just finished reading this thread. We are currently working on
>>>>>>>> setting up a central log system using rsyslog and PostgreSQL. It
>>>>>>>> works well once we patched the memory leak. We also looked at
>> what
>>>>>>>> could be done to improve the efficiency of the DB interface. On
>>>> the
>>>>>>>> rsyslog side, moving to prepared queries allows you to remove
>> the
>>>>>>>> escaping that needs to be done currently before attempting to
>>>>>>>> insert the data into the SQL backend as well as removing the
>>>> parsing
>>>>>>>> and planning time from the insert. This is a big win for high
>>>> insert
>>>>>>>> rates, which is what we are talking about. The escaping process
>> is
>>>>>>>> also a big CPU user in rsyslog which then hands the escaped
>> string
>>>>>>>> to the backend which then has to undo everything that had been
>>>> done
>>>>>>>> and parse/plan the resulting query. This can use a surprising
>>>> amount
>>>>>>>> of additional CPU. Even if you cannot support a general prepared
>>>>>>>> query interface, by specifying what the query should look like
>> you
>>>>>>>> can handle much of the low-hanging fruit query-wise.
>>>>>>>>
>>>>>>>> We are currently using a date based trigger to use a new
>> partition
>>>>>>>> each day and keep 2 months of logs currently. This can be
>> usefully
>>>>>>>> managed on the backend database, but if rsyslog supported
>> changing
>>>>>>>> the insert to the new table on a time basis, the CPU used by the
>>>>>>>> trigger to support this on the backend could be reclaimed. This
>>>>>>>> would be a win for any DB backend. As you move to the new
>>>> partition,
>>>>>>>> issuing a truncate to clear the table would simplify the DB
>>>> interfaces.
>>>>>>>>
>>>>>>>> Another performance enhancement already mentioned, would be to
>>>>>>>> allow certain extra fields in the DB to be automatically
>> populated
>>>>>>>> as a function of the log messages. For example, logging the mail
>>>> queue
>>>>>>>> id for messages from mail systems would make it much easier to
>>>> locate
>>>>>>>> particular mail transactions in large amounts of data.
>>>>>>>>
>>>>>>>> To sum up, eliminating the escaping in rsyslog through the use
>> of
>>>>>>>> prepared queries would reduce the CPU load on the DB backend.
>>>> Batching
>>>>>>>> the inserts will also net you a big performance increase. Some
>> DB-
>>>> based
>>>>>>>> applications allow for the specification of several types of
>>>> queries,
>>>>>>>> one for single inserts and then a second to support multiple
>>>> inserts
>>>>>>>> (copy). Rsyslog already supports the queuing pieces to allow you
>>>> to
>>>>>>>> batch inserts. Just some ideas.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Ken
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
>>>>>>>>> [email protected] wrote:
>>>>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>>>>>>>>>> * [email protected] ([email protected]) wrote:
>>>>>>>>>>>> while I fully understand the 'benchmark your situation'
>> need,
>>>> this
>>>>>>>>>>>> isn't
>>>>>>>>>>>> that simple.
>>>>>>>>>>>
>>>>>>>>>>> It really is.  You know your application, you know it's
>> primary
>>>> use
>>>>>>>>>>> cases, and probably have some data to play with.  You're
>>>> certainly in
>>>>>>>>>>> a
>>>>>>>>>>> much better situation to at least *try* and benchmark it than
>>>> we are.
>>>>>>>>>> rsyslog is a syslog server. it replaces (or for debian and
>>>> fedora, has
>>>>>>>>>> replaced) your standard syslog daemon. it recieves log
>> messages
>>>> from
>>>>>>>>>> every
>>>>>>>>>> app on your system (and possibly others), filters, maniulates
>>>> them,
>>>>>>>>>> and
>>>>>>>>>> then stores them somewhere. among the places that it can store
>>>> the
>>>>>>>>>> logs
>>>>>>>>>> are database servers (native support for MySQL, PostgreSQL,
>> and
>>>>>>>>>> Oracle.
>>>>>>>>>> plus libdbi for others)
>>>>>>>>>
>>>>>>>>> Well, from a performance standpoint the obvious things to do
>> are:
>>>>>>>>> 1. Keep a connection open, do NOT reconnect for each log-
>>>> statement
>>>>>>>>> 2. Batch log statements together where possible
>>>>>>>>> 3. Use prepared statements
>>>>>>>>> 4. Partition the tables by day/week/month/year (configurable I
>>>> suppose)
>>>>>>>>>
>>>>>>>>> The first two are vital, the third takes you a step further.
>> The
>>>> fourth
>>>>>>>>> is
>>>>>>>>> a long-term admin thing.
>>>>>>>>>
>>>>>>>>> And possibly
>>>>>>>>> 5. Have two connections, one for fatal/error etc and one for
>>>> info/debug
>>>>>>>>> level log statements (configurable split?). Then you can use
>> the
>>>>>>>>> synchronous_commit setting on the less important ones. Might
>> buy
>>>> you
>>>>>>>>> some
>>>>>>>>> performance on a busy system.
>>>>>>>>>
>>>>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-
>>>> wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>>>>>>>>>
>>>>>>>>>> other apps then search and report on the data after it is
>>>> stored. what
>>>>>>>>>> apps?, I don't know either. pick your favorite reporting tool
>>>> and
>>>>>>>>>> you'll
>>>>>>>>>> be a step ahead of me (I don't know a really good reporting
>>>> tool)
>>>>>>>>>> as for sample data, you have syslog messages, just like I do.
>> so
>>>> you
>>>>>>>>>> have
>>>>>>>>>> the same access to data that I have.
>>>>>>>>>> how would you want to query them? how would people far less
>>>>>>>>>> experianced
>>>>>>>>>> that you want to query them?
>>>>>>>>>> I can speculate that some people would do two columns (time,
>>>>>>>>>> everything
>>>>>>>>>> else), others will do three (time, server, everything else),
>> and
>>>>>>>>>> others
>>>>>>>>>> will go further (I know some who would like to extract IP
>>>> addresses
>>>>>>>>>> embedded in a message into their own column). some people will
>>>> index
>>>>>>>>>> on
>>>>>>>>>> the time and host, others will want to do full-text searches
>> of
>>>>>>>>>> everything.
>>>>>>>>>
>>>>>>>>> Well, assuming it looks much like traditional syslog, I would
>> do
>>>>>>>>> something
>>>>>>>>> like: (timestamp, host, facility, priority, message). It's easy
>>>> enough
>>>>>>>>> to
>>>>>>>>> stitch back together if people want that.
>>>>>>>>>
>>>>>>>>> PostgreSQL's full-text indexing is quite well suited to
>> logfiles
>>>> I'd
>>>>>>>>> have
>>>>>>>>> thought, since it knows about filenames, urls etc already.
>>>>>>>>>
>>>>>>>>> If you want to get fancy, add a msg_type column and one
>>>> subsidiary
>>>>>>>>> table
>>>>>>>>> for each msg_type. So - you might have smtp_connect_from
>>>> (hostname,
>>>>>>>>> ip_addr). A set of perl regexps can match and extract the
>> fields
>>>> for
>>>>>>>>> these
>>>>>>>>> extra tables, or you could do it with triggers inside the
>>>> database. I
>>>>>>>>> think
>>>>>>>>> it makes sense to do it in the application. Easier for users to
>>>>>>>>> contribute
>>>>>>>>> new patterns/extractions. Meanwhile, the core table is
>> untouched
>>>> so you
>>>>>>>>> don't *need* to know about these extra tables.
>>>>>>>>>
>>>>>>>>> If you have subsidiary tables, you'll want to partition those
>> too
>>>> and
>>>>>>>>> perhaps stick them in their own schema (logs200901, logs200902
>>>> etc).
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>   Richard Huxton
>>>>>>>>>   Archonet Ltd
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sent via pgsql-performance mailing list
>>>>>>>>> ([email protected])
>>>>>>>>> To make changes to your subscription:
>>>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent via pgsql-performance mailing list
>>>>>>> ([email protected])
>>>>>>> To make changes to your subscription:
>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com
>>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to