David, thanks again for the good feedback. I have a couple of points:
First of all: I agree that a "one size fits all approach" does not exist. Thus the idea to craft a library, which than can be used at various places: e.g. inside message parsers, output modules or even processing runs outside of rsyslog. Secondly, an efficient normalizing parser tree must not be much slower than the regular parser. I think that the parser overhead will be very acceptable for average messages. Another story, however, is the normalized data that has been gathered. In short words, that is extra data, so copy overhead is much higher everywhere. Also, accessing the properties take some time. I guess that's the primary problem inside a real-time solution, even if very efficient lookup methods are used. On the normalized properties: I think it is really worth the effort to try to define an as-broad as possible set of normalized properties. But that does not imply all needs to be done at once. It can be an evolutionary approach. First of all, one would look for pretty obvious things like traffic flow data (one may even reuse existing data models like from ipfix!) and user login/logoff activity. The key point IMHO is that we would just need to gather what people so far have used and try to get folks to re-use these fields that already exist. If we have success with this approach, I think, we will have huge benefit in the reporting and analysis area where programs could then work on standard property sets (of normalized syslog data) rather than on the raw data itself. So you do not need to write an analyzer for each of the myriad vendor/device/version formats but rather only once for the normalized data and then create a one-line (!) parse template for each of the vendor/device/version formats (of course, for each message there, so it would be multiple lines, but very easy and intuitively to do). However, all of that does not really work without community involvement. None (expect large commercial entities) can create the necessary mass of parse templates alone. Rainer > -----Original Message----- > From: [email protected] [mailto:rsyslog- > [email protected]] On Behalf Of [email protected] > Sent: Friday, February 26, 2010 10:33 PM > To: rsyslog-users > Subject: Re: [rsyslog] Log Normalization effort > > On Fri, 26 Feb 2010, Rainer Gerhards wrote: > > > Hi all, > > > > I have blogged about my quest for log normalization. I think there is > some > > good information on the upcoming GPLed Adiscon LogAnalyzer and future > > directions for rsyslog in the blog post. So I thought I share the > link: > > > > http://blog.gerhards.net/2010/02/syslog-normalization.html > > > > Please note that part of the effort requires community involvement. I > would > > be very interested to learn if you think we could win enough support > to make > > this a useful effort. I am asking for your feedback, because it will > help me > > streamline my priorities for future rsyslog work. > > a few comments (but remember that I am usually dealing with high data > rates, so my concerns are biased in that direction) > > log analysis is usually done in batches as opposed to in real-time. > some of this is due to the difficulty in doing it in real time, but a > lot > of it is the processing overhead (you don't want to take so long to > process an individual request that you miss the next one to arrive) > > at low volumes the idea of name-value pairs in the logs makes a lot of > sense, but there is significantly more overhead in parsing a log with > name-value pairs in arbitrary orders than there is in using a tree > parsing > approach to analyze known log formats in a fixed order. The message > size > can also increase significantly. As a result, at high traffic volumes > this > starts to be a bad (or at least questionable) idea. > > I would love to see rsyslog gain the ability to efficiently do tree- > based > parsing instead of regex parsing. regex parsing is easy to understand > and > tinker with, but very expensive to implement. it may be that having > something that 'compiles' a list of regex parsers into a tree parser is > the right answer for usability. I would save several hours of > processing > a day if I could easily (and efficiently) make rsyslog write different > logs to different files (at high data rates and with a few hundred > conditions based variations in the syslog tag) > > > While there are some common events across different types of logs > (logins > for example) they almost always contain slightly different data in > them. I > also have no faith at all that anyone is going to make much effort to > clean up their logs to make them nicely parseable, and if they do I see > even less chance that they will end up using the same terms for the > same > thing. As such I see more value in trying to get samples of logs and > what > they mean than in trying to define a normalized version to shoehorn the > logs into. It is worth doing this for some events (logins, failed > logins > for example), but I think it's a mistake to think that this will end up > covering all, or even the majority of log messages. > > There's also a problem in that the ideal format for the output depends > on > what you are doing with the output. > > > If I could wave a magic wand and get the result I would look for > something > like this > > the parser starts at the beginning of the message (at the priority) and > can branch on priority/faclilty, timestamp, host, syslogtag, message > and > indicate if the message should be parsed into name-value pairs, or > split > based on a character (or character sequence like the perl split command > allows) into individually addressable elements (defaulting to > whitespace > separated elements), then the format (and if needed dynafile path/file > components) could be constructed from these variables. At any point in > the > parsing it should be possible to jump to another parser tree (so that > you > could say that sm-mta, sendmail, Sendmail, etc as syslog tags all end > up > using the same parser for the message without having to redefine the > rules > for each one) > > With this capability, people could start writing parser 'branches' to > understand a specific log type and output a 'standard' format (as such > a > format can start to be defined). > > This can be done in rsyslog today, but it is fairly difficult to > define, > and as I understand it, inefficient enough that it's not practical to > do > in real-time under heavy load. > > > If this is fast enough, then the next step would be to add the ability > to > have the format/action be 'increment a counter for log type X' and a > signal to rsyslog could generate a report on these counters. Although > at > some point it becomes better to feed the message into another > opensource > tool (SEC, Simple Event Correlator for example) instead of trying to do > everything in rsyslog. > > parsing the file to know what to do with it, and be able to re-format > log > messages is very defiantly something that can fit into the rsyslog > model > of receiving, formatting, and delivering logs. Alerting on specific log > entries, counting the number of times one thing shows up in logs, and > this sort of thing start pushing > beyond the core of rsyslog, and it may be better to feed other tools > instead. > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

