Re: [rsyslog] Log Normalization effort

david Fri, 26 Feb 2010 13:33:36 -0800

On Fri, 26 Feb 2010, Rainer Gerhards wrote:

> Hi all,
>
> I have blogged about my quest for log normalization. I think there is some
> good information on the upcoming GPLed Adiscon LogAnalyzer and future
> directions for rsyslog in the blog post. So I thought I share the link:
>
> http://blog.gerhards.net/2010/02/syslog-normalization.html
>
> Please note that part of the effort requires community involvement. I would
> be very interested to learn if you think we could win enough support to make
> this a useful effort. I am asking for your feedback, because it will help me
> streamline my priorities for future rsyslog work.


a few comments (but remember that I am usually dealing with high data 
rates, so my concerns are biased in that direction)

log analysis is usually done in batches as opposed to in real-time. 
some of this is due to the difficulty in doing it in real time, but a lot 
of it is the processing overhead (you don't want to take so long to 
process an individual request that you miss the next one to arrive)

at low volumes the idea of name-value pairs in the logs makes a lot of 
sense, but there is significantly more overhead in parsing a log with 
name-value pairs in arbitrary orders than there is in using a tree parsing 
approach to analyze known log formats in a fixed order. The message size 
can also increase significantly. As a result, at high traffic volumes this 
starts to be a bad (or at least questionable) idea.

I would love to see rsyslog gain the ability to efficiently do tree-based 
parsing instead of regex parsing. regex parsing is easy to understand and 
tinker with, but very expensive to implement. it may be that having 
something that 'compiles' a list of regex parsers into a tree parser is 
the right answer for usability. I would save several hours of processing 
a day if I could easily (and efficiently) make rsyslog write different 
logs to different files (at high data rates and with a few hundred 
conditions based variations in the syslog tag)


While there are some common events across different types of logs (logins 
for example) they almost always contain slightly different data in them. I 
also have no faith at all that anyone is going to make much effort to 
clean up their logs to make them nicely parseable, and if they do I see 
even less chance that they will end up using the same terms for the same 
thing. As such I see more value in trying to get samples of logs and what 
they mean than in trying to define a normalized version to shoehorn the 
logs into. It is worth doing this for some events (logins, failed logins 
for example), but I think it's a mistake to think that this will end up 
covering all, or even the majority of log messages.

There's also a problem in that the ideal format for the output depends on 
what you are doing with the output.


If I could wave a magic wand and get the result I would look for something 
like this

the parser starts at the beginning of the message (at the priority) and 
can branch on priority/faclilty, timestamp, host, syslogtag, message and 
indicate if the message should be parsed into name-value pairs, or split 
based on a character (or character sequence like the perl split command 
allows) into individually addressable elements (defaulting to whitespace 
separated elements), then the format (and if needed dynafile path/file 
components) could be constructed from these variables. At any point in the 
parsing it should be possible to jump to another parser tree (so that you 
could say that sm-mta, sendmail, Sendmail, etc as syslog tags all end up 
using the same parser for the message without having to redefine the rules 
for each one)

With this capability, people could start writing parser 'branches' to 
understand a specific log type and output a 'standard' format (as such a 
format can start to be defined).

This can be done in rsyslog today, but it is fairly difficult to define, 
and as I understand it, inefficient enough that it's not practical to do 
in real-time under heavy load.


If this is fast enough, then the next step would be to add the ability to 
have the format/action be 'increment a counter for log type X' and a 
signal to rsyslog could generate a report on these counters. Although at 
some point it becomes better to feed the message into another opensource 
tool (SEC, Simple Event Correlator for example) instead of trying to do 
everything in rsyslog.

parsing the file to know what to do with it, and be able to re-format log 
messages is very defiantly something that can fit into the rsyslog model 
of receiving, formatting, and delivering logs. Alerting on specific log 
entries, counting the number of times one thing shows up in logs, and this sort 
of thing start pushing 
beyond the core of rsyslog, and it may be better to feed other tools 
instead.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] Log Normalization effort

Reply via email to