Thanks a lot, David! This clears up a lot of stuff. I'll start using mmnormalize then, and I'll bug you guys again if I bump into issues :)
2013/12/4 David Lang <[email protected]> > On Wed, 4 Dec 2013, Radu Gheorghe wrote: > > Hi David, >> >> Thanks a lot for your reply! I will add my comments inline. >> >> 2013/12/4 David Lang <[email protected]> >> >> On Wed, 4 Dec 2013, Radu Gheorghe wrote: >>> >>> Hi list :) >>> >>>> >>>> I'm trying to understand if mmnormalize is a good fit for parsing a high >>>> traffic of logs, given the fact that events are really heterogeneous >>>> (think >>>> log4j logs, apache logs, whatever logs are commonly produced). >>>> >>>> My only frame of reference is Logstash's grok >>>> filter<http://logstash.net/docs/1.2.2/filters/grok>, >>>> >>>> which allows you to tag regular expressions in a dictionary, and then >>>> use >>>> those tags to match fields from logs, and put them in a structured >>>> event. >>>> Much like how you'd build a liblognorm rulebase. >>>> >>>> If I got it right, the advantage of mmnormalize seems to be performance, >>>> because it goes around using regular expressions. Not sure how this >>>> actually work, though. Practically, it sounds like this comes at the >>>> expense of flexibility: if I need to add a new "pattern" in liblognorm >>>> (say, a new date format) I'd have to patch the library itself, no? >>>> >>>> >>> a completly new type of data you would have to modify the library, but >>> you >>> seldom need to do that because when you are processing the logs, all you >>> really care about is that this string of characters is the date, you >>> aren't >>> parsing the date so that you can do calculations on it. >>> >>> >> So you're basically saying that if I just want to "copy-paste" a new date, >> I can simply say "word" or "char-to" and it should work. If I need to >> parse >> an SQL date and send it over, for example as an ISO date, I need a new >> type >> and therefore liblognorm needs patching. Right? >> > > remember that everything is just a string until it's interpreted. > > I believe that if you set a variable to the date and then use that > variable in a template with a timestamp formatting option, it will get > interpreted at that point (and if not, sponsoring that feature will be far > more valuable than another parsing type in liblognorm :-) > > > If so, this means that I can either do with the field types that exists, >> or >> patch liblognorm. That was my initial assumption, which leaves me a bit >> undecided. On one hand, the current set of field types looks like it would >> suit 99.9999999999999% of the logs out there. On the other hand, you don't >> really know until you're trying. I've tried to use mmnormalize a few >> months >> ago in my setup and I failed because it didn't have something to match the >> string until the end of the line. Now it has, so I'm going to give it a >> second shot. But God knows what will be coming up next. So it would be >> nice >> to have an easy way to define new field types. >> >> I'm guessing this is a design thing. You need to have those "specific" >> types if you want to have the awesome performance. Right? >> > > I believe so. I guess it's possible to introduce a language that could be > compiled down to something efficient at ruleset load time, but that would > be adding a lot of complexity, and unless someone can show a need for it, > it's unlikely to happen. > > > As long as you can say 'this string of characters is what I care about, >>> and I'm going to label it "date"' you are in good shape. >>> >>> mmnormalize is far better than regex engines for a couple of reasons. >>> >>> 1. full regex support requires supporting some very expensive types of >>> expressions, even if you don't plan to use them. This costs. >>> >>> 2. regex engines almost always go down the list, does regex1 match, if >>> not >>> does regex2 match, if not does regex3 match, .... >>> >>> mmnormalize in comparison compiles your config into a parse tree, so it >>> can walk down the log message a character at a time, looking that >>> character >>> up in the parse tree and when it comes to the end of the line it knows it >>> has the correct match, so instead of being O(N) based on the number of >>> rules it's (1) based on the (relatively) short length of the lines. >>> >> >> >> Thanks for the explanation. This makes a lot of sense. So it should really >> be A LOT faster, which would make a lot of difference at scale. >> > > when you are using 'hello world' type examples you aren't going to see a > difference, but if you load up hundreds to thousands of rules, you will see > a huge difference. > > > >> >>> >>> Speaking of scope, can liblognorm be enhanced to support parsing >>> multiline >>> >>>> messages? This seems to be possible in grok: >>>> https://logstash.jira.com/browse/LOGSTASH-692 >>>> >>>> >>> multiline logs cause all sorts of problems, in general you should avoid >>> them or collapse the multiline logs into a single line when you get it >>> into >>> your logging system, too many things will break a multiline log into >>> multiple logs. In some cases you can carefully configure everything to >>> handle multiline logs, but it's very fragile and prevents you from using >>> many tools and transport mechanisms. >>> >> >> >> Yeah, I know these tend to be a pain. But I have to deal with them. >> Collapsing sounds like a hack to me because I need to be aware of what I'm >> doing down the pipeline. For example, something else that works with the >> log, like an UI, would need to know that the strange character is actually >> a newline. I'll probably also have to escape it... The whole thing sounds >> more complicated (and hackier) than dealing with the newline itself. >> Especially since, right now at least, from rsyslog my events go to >> Elasticsearch (probably something else in future, like HDFS) and then >> Kibana and some other UI. All these have no problem handling multi-line >> events, so if rsyslog works with them, too, I'll be good. >> > > well, my suggestion is to escape them the same way that other control > characters are escaped (#xxx) > > I've dealt with this issue a bit in the past, and I've found that just > leaving things escaped actually works well enough to not be worth the > effort to swap things back. > > David Lang > > >> >>> >>> For me, it's important to understand whether I should put effort in >>> >>>> working >>>> with mmnormalize and sponsor needed enhancements, or would sponsoring a >>>> new >>>> "mmgrok" module be a better idea for my use-case. Because it looks like >>>> grok is available as a C library as well: >>>> https://github.com/jordansissel/grok >>>> >>>> >>> It's not clear what enhancements you are thinking that you need (other >>> than the multiline support, which as I say is problomatic) >>> >> >> >> To be honest, it's not clear to me either, because I didn't start working >> with it yet. It should be clear in less than a month, though. Expect the >> list to be spammed with mmnormalize questions :) >> >> My question for now is basically "what's the scope of mmnormalize?". Is is >> very hard to add a new type? If such additions should be rare and take >> lots >> of time, maybe mmgrok makes more sense. Is it very hard or unacceptable to >> add multi-line support? These are more about the design than about the >> current functionality, and I need to understand if enhancing mmnormalize >> is >> the way to go for a scenario like mine or I should go for something like >> mmgrok. >> >> Lots of people send logs from rsyslog to Elasticsearch or Solr via stuff >> like Logstash or Flume because of grok. I'm thinking that if I'd have >> grok-like capabilities in rsyslog, I'd be able to skip a step and have an >> easier and faster setup. If mmnormalize can do that, it sounds like it >> would be MUCH faster. >> >> This is not to say that grok is the only reason one would use >> Logstash/Flume. With Logstash, for example you have lots of stuff to >> modify >> your events (like a geoip >> <http://logstash.net/docs/1.2.2/filters/geoip>filter), and it's >> >> trivial to add new ones (I've recently commited a Solr >> output plugin <https://github.com/logstash/logstash/pull/675> and I'm a >> >> noob at Ruby). I don't think rsyslog can (or should?) have all these >> features. But if you can do the bulk of processing in rsyslog I can bet >> there will be much more interest for it when it comes to large-scale log >> processing, because of how fast it is. In my mind, that should draw more >> testing, more contributions, more sponsoring and hopefully make everyone >> happy. >> >> Best regards, >> Radu >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> >> _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

