Thanks a lot, David! This clears up a lot of stuff.

I'll start using mmnormalize then, and I'll bug you guys again if I bump
into issues :)


2013/12/4 David Lang <[email protected]>

> On Wed, 4 Dec 2013, Radu Gheorghe wrote:
>
>  Hi David,
>>
>> Thanks a lot for your reply! I will add my comments inline.
>>
>> 2013/12/4 David Lang <[email protected]>
>>
>>  On Wed, 4 Dec 2013, Radu Gheorghe wrote:
>>>
>>>  Hi list :)
>>>
>>>>
>>>> I'm trying to understand if mmnormalize is a good fit for parsing a high
>>>> traffic of logs, given the fact that events are really heterogeneous
>>>> (think
>>>> log4j logs, apache logs, whatever logs are commonly produced).
>>>>
>>>> My only frame of reference is Logstash's grok
>>>> filter<http://logstash.net/docs/1.2.2/filters/grok>,
>>>>
>>>> which allows you to tag regular expressions in a dictionary, and then
>>>> use
>>>> those tags to match fields from logs, and put them in a structured
>>>> event.
>>>> Much like how you'd build a liblognorm rulebase.
>>>>
>>>> If I got it right, the advantage of mmnormalize seems to be performance,
>>>> because it goes around using regular expressions. Not sure how this
>>>> actually work, though. Practically, it sounds like this comes at the
>>>> expense of flexibility: if I need to add a new "pattern" in liblognorm
>>>> (say, a new date format) I'd have to patch the library itself, no?
>>>>
>>>>
>>> a completly new type of data you would have to modify the library, but
>>> you
>>> seldom need to do that because when you are processing the logs, all you
>>> really care about is that this string of characters is the date, you
>>> aren't
>>> parsing the date so that you can do calculations on it.
>>>
>>>
>> So you're basically saying that if I just want to "copy-paste" a new date,
>> I can simply say "word" or "char-to" and it should work. If I need to
>> parse
>> an SQL date and send it over, for example as an ISO date, I need a new
>> type
>> and therefore liblognorm needs patching. Right?
>>
>
> remember that everything is just a string until it's interpreted.
>
> I believe that if you set a variable to the date and then use that
> variable in a template with a timestamp formatting option, it will get
> interpreted at that point (and if not, sponsoring that feature will be far
> more valuable than another parsing type in liblognorm :-)
>
>
>  If so, this means that I can either do with the field types that exists,
>> or
>> patch liblognorm. That was my initial assumption, which leaves me a bit
>> undecided. On one hand, the current set of field types looks like it would
>> suit 99.9999999999999% of the logs out there. On the other hand, you don't
>> really know until you're trying. I've tried to use mmnormalize a few
>> months
>> ago in my setup and I failed because it didn't have something to match the
>> string until the end of the line. Now it has, so I'm going to give it a
>> second shot. But God knows what will be coming up next. So it would be
>> nice
>> to have an easy way to define new field types.
>>
>> I'm guessing this is a design thing. You need to have those "specific"
>> types if you want to have the awesome performance. Right?
>>
>
> I believe so. I guess it's possible to introduce a language that could be
> compiled down to something efficient at ruleset load time, but that would
> be adding a lot of complexity, and unless someone can show a need for it,
> it's unlikely to happen.
>
>
>  As long as you can say 'this string of characters is what I care about,
>>> and I'm going to label it "date"' you are in good shape.
>>>
>>> mmnormalize is far better than regex engines for a couple of reasons.
>>>
>>> 1. full regex support requires supporting some very expensive types of
>>> expressions, even if you don't plan to use them. This costs.
>>>
>>> 2. regex engines almost always go down the list, does regex1 match, if
>>> not
>>> does regex2 match, if not does regex3 match, ....
>>>
>>> mmnormalize in comparison compiles your config into a parse tree, so it
>>> can walk down the log message a character at a time, looking that
>>> character
>>> up in the parse tree and when it comes to the end of the line it knows it
>>> has the correct match, so instead of being O(N) based on the number of
>>> rules it's (1) based on the (relatively) short length of the lines.
>>>
>>
>>
>> Thanks for the explanation. This makes a lot of sense. So it should really
>> be A LOT faster, which would make a lot of difference at scale.
>>
>
> when you are using 'hello world' type examples you aren't going to see a
> difference, but if you load up hundreds to thousands of rules, you will see
> a huge difference.
>
>
>
>>
>>>
>>>  Speaking of scope, can liblognorm be enhanced to support parsing
>>> multiline
>>>
>>>> messages? This seems to be possible in grok:
>>>> https://logstash.jira.com/browse/LOGSTASH-692
>>>>
>>>>
>>> multiline logs cause all sorts of problems, in general you should avoid
>>> them or collapse the multiline logs into a single line when you get it
>>> into
>>> your logging system, too many things will break a multiline log into
>>> multiple logs. In some cases you can carefully configure everything to
>>> handle multiline logs, but it's very fragile and prevents you from using
>>> many tools and transport mechanisms.
>>>
>>
>>
>> Yeah, I know these tend to be a pain. But I have to deal with them.
>> Collapsing sounds like a hack to me because I need to be aware of what I'm
>> doing down the pipeline. For example, something else that works with the
>> log, like an UI, would need to know that the strange character is actually
>> a newline. I'll probably also have to escape it... The whole thing sounds
>> more complicated (and hackier) than dealing with the newline itself.
>> Especially since, right now at least, from rsyslog my events go to
>> Elasticsearch (probably something else in future, like HDFS) and then
>> Kibana and some other UI. All these have no problem handling multi-line
>> events, so if rsyslog works with them, too, I'll be good.
>>
>
> well, my suggestion is to escape them the same way that other control
> characters are escaped (#xxx)
>
> I've dealt with this issue a bit in the past, and I've found that just
> leaving things escaped actually works well enough to not be worth the
> effort to swap things back.
>
> David Lang
>
>
>>
>>>
>>>  For me, it's important to understand whether I should put effort in
>>>
>>>> working
>>>> with mmnormalize and sponsor needed enhancements, or would sponsoring a
>>>> new
>>>> "mmgrok" module be a better idea for my use-case. Because it looks like
>>>> grok is available as a C library as well:
>>>> https://github.com/jordansissel/grok
>>>>
>>>>
>>> It's not clear what enhancements you are thinking that you need (other
>>> than the multiline support, which as I say is problomatic)
>>>
>>
>>
>> To be honest, it's not clear to me either, because I didn't start working
>> with it yet. It should be clear in less than a month, though. Expect the
>> list to be spammed with mmnormalize questions :)
>>
>> My question for now is basically "what's the scope of mmnormalize?". Is is
>> very hard to add a new type? If such additions should be rare and take
>> lots
>> of time, maybe mmgrok makes more sense. Is it very hard or unacceptable to
>> add multi-line support? These are more about the design than about the
>> current functionality, and I need to understand if enhancing mmnormalize
>> is
>> the way to go for a scenario like mine or I should go for something like
>> mmgrok.
>>
>> Lots of people send logs from rsyslog to Elasticsearch or Solr via stuff
>> like Logstash or Flume because of grok. I'm thinking that if I'd have
>> grok-like capabilities in rsyslog, I'd be able to skip a step and have an
>> easier and faster setup. If mmnormalize can do that, it sounds like it
>> would be MUCH faster.
>>
>> This is not to say that grok is the only reason one would use
>> Logstash/Flume. With Logstash, for example you have lots of stuff to
>> modify
>> your events (like a geoip
>> <http://logstash.net/docs/1.2.2/filters/geoip>filter), and it's
>>
>> trivial to add new ones (I've recently commited a Solr
>> output plugin <https://github.com/logstash/logstash/pull/675> and I'm a
>>
>> noob at Ruby). I don't think rsyslog can (or should?) have all these
>> features. But if you can do the bulk of processing in rsyslog I can bet
>> there will be much more interest for it when it comes to large-scale log
>> processing, because of how fast it is. In my mind, that should draw more
>> testing, more contributions, more sponsoring and hopefully make everyone
>> happy.
>>
>> Best regards,
>> Radu
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>  _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to