On Wed, 22 Apr 2009, Rainer Gerhards wrote: >> -----Original Message---- >> From: [email protected] [mailto:rsyslog- >> [email protected]] On Behalf Of [email protected] >> Sent: Wednesday, April 22, 2009 8:26 AM >> To: rsyslog-users >> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log >> insertion >> >> On Wed, 22 Apr 2009, Rainer Gerhards wrote: >> >>> Hi Ken, >>> >>> glad to have you here. I am a bit silent at the moment, because I am >> not a >>> real database guy and so I am primarily listening to any information >> that is >>> incoming. If you have a couple of minutes, it would be useful to >> review this >>> thread here: >>> >>> http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html >>> >>> ...one comment inline below... >>> >>>> -----Original Message----- >>>> From: [email protected] [mailto:rsyslog- >>>> [email protected]] On Behalf Of Kenneth Marshall >>>> >>>> David, >>>> >>>> Okay, I am now subscribed to the mailing list. We are currently >> using >>>> rsyslog-3.20.x. As far as implementing a prototype of the prepared >>>> statement, I was sidetracked by other duties and have not had a >> chance >>>> to do anything but an initial evaluation. As far as the rsyslog >>>> internal >>>> escaping, it looked simplest to create another template type like >> the >>>> current SQL and STDSQL that indicated that escaping was not needed >>>> and/or that prepared statements should be used. >>> >>> To disable escaping, simply do not use SQL or STDSQL. However, the db >> outputs >>> currently require this option (easy to disable), because I cannot see >> how it >>> will work (with the existing code) without escaping. >>> >>> Any idea is most welcome. >> >> when using prepared statement escaping is not needed. > > ... but that can only be on the non-text API level (e.g. by using libpq). On > the SQL text level, I have no idea how to tell the sql engine to insert ' - > if I don't say '''' but ''' how does the engine know what I say? With the > C-level API, I bind a parameter and specify a buffer and then put my > character into that buffer - no escaping needed for sure. But, again, I do > not see how this would work on the text level...
correct. >> according to >> Ken's >> message below he found that the overhead of doing the escaping was >> significant. I don't see why this should be the case, but if it >> requires >> making an extra copy of the string I guess it's possible >> > > I am surprised, too. Even if a copy is made (I think it is), this is a quick > in-memory operation. Given the rest of the picture, I would expect that to > have very low impact on the overall cost (just think about the need to copy > buffers between different contexts, down to different layers, etc - so I'd > expect to see ample copy operations before the data finally hits the disk). > Anyhow, I may be totally wrong... > >> I plan to get a setup togeather in the next couple of days that will >> let >> me do some testing of the options on the database side. >> > > That would be great. I, for now, intend to look at the queue first. I have > begun to thought about steps on how to tackle the beast. So I will probably > not do much more on the database level than throw in some thoughts (but not > do any testing or coding). sounds good. David Lang > Rainer > >> David Lang >> >>> Rainer >>>> >>>> Regards, >>>> Ken >>>> >>>> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote: >>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote: >>>>> >>>>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote: >>>>>>> Kenneth, >>>>>>> could you join the discussion on the rsyslog mailing list? >>>>>>> rsyslog-users <[email protected]> >>>>>>> >>>>>>> I'm surprised to hear you say that rsyslog can already do batch >>>> inserts >>>>>>> and >>>>>>> am interested in how you did that. >>>>>>> >>>>>>> what sort of insert rate did you mange to get? >>>>>>> >>>>>>> David Lang >>>>>>> >>>>>> David, >>>>>> >>>>>> I would be happy to join the discussion. I did not mean to say >>>>>> that rsyslog currently supported batch inserts, just that the >>>>>> pieces that provide "stand-by queuing" could be used to manage >>>>>> batching inserts. >>>>> >>>>> I've changed the to list to the rsyslog users list. >>>>> >>>>> currently the stand-by queuing still handles messages one at a >> time. >>>>> however a sponser has been found to pay to changing the rsyslog >>>> internals >>>>> to allow for multiple messages to be handled at once, which is what >>>>> triggered some of this discussion. >>>>> >>>>> which version of rsyslog are you working with? >>>>> >>>>> when you modified rsyslog to do prepared statement (to avoid the >>>> escaping >>>>> and parsing) did you hard-code the prepared statement? what other >>>> changes >>>>> did you make? >>>>> >>>>> David Lang >>>>> >>>>>> Cheers, >>>>>> Ken >>>>>> >>>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote: >>>>>>> >>>>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500 >>>>>>>> From: Kenneth Marshall <[email protected]> >>>>>>>> To: Richard Huxton <[email protected]> >>>>>>>> Cc: [email protected], Stephen Frost <[email protected]>, >>>>>>>> Greg Smith <[email protected]>, pgsql- >>>> [email protected] >>>>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion >>>>>>>> Hi, >>>>>>>> >>>>>>>> I just finished reading this thread. We are currently working on >>>>>>>> setting up a central log system using rsyslog and PostgreSQL. It >>>>>>>> works well once we patched the memory leak. We also looked at >> what >>>>>>>> could be done to improve the efficiency of the DB interface. On >>>> the >>>>>>>> rsyslog side, moving to prepared queries allows you to remove >> the >>>>>>>> escaping that needs to be done currently before attempting to >>>>>>>> insert the data into the SQL backend as well as removing the >>>> parsing >>>>>>>> and planning time from the insert. This is a big win for high >>>> insert >>>>>>>> rates, which is what we are talking about. The escaping process >> is >>>>>>>> also a big CPU user in rsyslog which then hands the escaped >> string >>>>>>>> to the backend which then has to undo everything that had been >>>> done >>>>>>>> and parse/plan the resulting query. This can use a surprising >>>> amount >>>>>>>> of additional CPU. Even if you cannot support a general prepared >>>>>>>> query interface, by specifying what the query should look like >> you >>>>>>>> can handle much of the low-hanging fruit query-wise. >>>>>>>> >>>>>>>> We are currently using a date based trigger to use a new >> partition >>>>>>>> each day and keep 2 months of logs currently. This can be >> usefully >>>>>>>> managed on the backend database, but if rsyslog supported >> changing >>>>>>>> the insert to the new table on a time basis, the CPU used by the >>>>>>>> trigger to support this on the backend could be reclaimed. This >>>>>>>> would be a win for any DB backend. As you move to the new >>>> partition, >>>>>>>> issuing a truncate to clear the table would simplify the DB >>>> interfaces. >>>>>>>> >>>>>>>> Another performance enhancement already mentioned, would be to >>>>>>>> allow certain extra fields in the DB to be automatically >> populated >>>>>>>> as a function of the log messages. For example, logging the mail >>>> queue >>>>>>>> id for messages from mail systems would make it much easier to >>>> locate >>>>>>>> particular mail transactions in large amounts of data. >>>>>>>> >>>>>>>> To sum up, eliminating the escaping in rsyslog through the use >> of >>>>>>>> prepared queries would reduce the CPU load on the DB backend. >>>> Batching >>>>>>>> the inserts will also net you a big performance increase. Some >> DB- >>>> based >>>>>>>> applications allow for the specification of several types of >>>> queries, >>>>>>>> one for single inserts and then a second to support multiple >>>> inserts >>>>>>>> (copy). Rsyslog already supports the queuing pieces to allow you >>>> to >>>>>>>> batch inserts. Just some ideas. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Ken >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote: >>>>>>>>> [email protected] wrote: >>>>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote: >>>>>>>>>>> * [email protected] ([email protected]) wrote: >>>>>>>>>>>> while I fully understand the 'benchmark your situation' >> need, >>>> this >>>>>>>>>>>> isn't >>>>>>>>>>>> that simple. >>>>>>>>>>> >>>>>>>>>>> It really is. You know your application, you know it's >> primary >>>> use >>>>>>>>>>> cases, and probably have some data to play with. You're >>>> certainly in >>>>>>>>>>> a >>>>>>>>>>> much better situation to at least *try* and benchmark it than >>>> we are. >>>>>>>>>> rsyslog is a syslog server. it replaces (or for debian and >>>> fedora, has >>>>>>>>>> replaced) your standard syslog daemon. it recieves log >> messages >>>> from >>>>>>>>>> every >>>>>>>>>> app on your system (and possibly others), filters, maniulates >>>> them, >>>>>>>>>> and >>>>>>>>>> then stores them somewhere. among the places that it can store >>>> the >>>>>>>>>> logs >>>>>>>>>> are database servers (native support for MySQL, PostgreSQL, >> and >>>>>>>>>> Oracle. >>>>>>>>>> plus libdbi for others) >>>>>>>>> >>>>>>>>> Well, from a performance standpoint the obvious things to do >> are: >>>>>>>>> 1. Keep a connection open, do NOT reconnect for each log- >>>> statement >>>>>>>>> 2. Batch log statements together where possible >>>>>>>>> 3. Use prepared statements >>>>>>>>> 4. Partition the tables by day/week/month/year (configurable I >>>> suppose) >>>>>>>>> >>>>>>>>> The first two are vital, the third takes you a step further. >> The >>>> fourth >>>>>>>>> is >>>>>>>>> a long-term admin thing. >>>>>>>>> >>>>>>>>> And possibly >>>>>>>>> 5. Have two connections, one for fatal/error etc and one for >>>> info/debug >>>>>>>>> level log statements (configurable split?). Then you can use >> the >>>>>>>>> synchronous_commit setting on the less important ones. Might >> buy >>>> you >>>>>>>>> some >>>>>>>>> performance on a busy system. >>>>>>>>> >>>>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config- >>>> wal.html#RUNTIME-CONFIG-WAL-SETTINGS >>>>>>>>> >>>>>>>>>> other apps then search and report on the data after it is >>>> stored. what >>>>>>>>>> apps?, I don't know either. pick your favorite reporting tool >>>> and >>>>>>>>>> you'll >>>>>>>>>> be a step ahead of me (I don't know a really good reporting >>>> tool) >>>>>>>>>> as for sample data, you have syslog messages, just like I do. >> so >>>> you >>>>>>>>>> have >>>>>>>>>> the same access to data that I have. >>>>>>>>>> how would you want to query them? how would people far less >>>>>>>>>> experianced >>>>>>>>>> that you want to query them? >>>>>>>>>> I can speculate that some people would do two columns (time, >>>>>>>>>> everything >>>>>>>>>> else), others will do three (time, server, everything else), >> and >>>>>>>>>> others >>>>>>>>>> will go further (I know some who would like to extract IP >>>> addresses >>>>>>>>>> embedded in a message into their own column). some people will >>>> index >>>>>>>>>> on >>>>>>>>>> the time and host, others will want to do full-text searches >> of >>>>>>>>>> everything. >>>>>>>>> >>>>>>>>> Well, assuming it looks much like traditional syslog, I would >> do >>>>>>>>> something >>>>>>>>> like: (timestamp, host, facility, priority, message). It's easy >>>> enough >>>>>>>>> to >>>>>>>>> stitch back together if people want that. >>>>>>>>> >>>>>>>>> PostgreSQL's full-text indexing is quite well suited to >> logfiles >>>> I'd >>>>>>>>> have >>>>>>>>> thought, since it knows about filenames, urls etc already. >>>>>>>>> >>>>>>>>> If you want to get fancy, add a msg_type column and one >>>> subsidiary >>>>>>>>> table >>>>>>>>> for each msg_type. So - you might have smtp_connect_from >>>> (hostname, >>>>>>>>> ip_addr). A set of perl regexps can match and extract the >> fields >>>> for >>>>>>>>> these >>>>>>>>> extra tables, or you could do it with triggers inside the >>>> database. I >>>>>>>>> think >>>>>>>>> it makes sense to do it in the application. Easier for users to >>>>>>>>> contribute >>>>>>>>> new patterns/extractions. Meanwhile, the core table is >> untouched >>>> so you >>>>>>>>> don't *need* to know about these extra tables. >>>>>>>>> >>>>>>>>> If you have subsidiary tables, you'll want to partition those >> too >>>> and >>>>>>>>> perhaps stick them in their own schema (logs200901, logs200902 >>>> etc). >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Richard Huxton >>>>>>>>> Archonet Ltd >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sent via pgsql-performance mailing list >>>>>>>>> ([email protected]) >>>>>>>>> To make changes to your subscription: >>>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sent via pgsql-performance mailing list >>>>>>> ([email protected]) >>>>>>> To make changes to your subscription: >>>>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>>>> >>>>>> >>>>> >>>> _______________________________________________ >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >>> >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

