I'm assuming you mean "what's missing from mmnormalize", if that's not what you 
meant I apologize in advance.  What is "missing" would be the deep level of 
regexp support I need.  The regexp support is not nearly robust enough to 
support what I've been trying to do.

Here's a sample log line of what I am parsing with logstash:

May 30 17:52:22 mediacast5 rg_events: 10.12.247.179 - - [30/May/2013:17:52:22 
+0000] "GET 
/events?rg_type=2.4.5:revenue&rg_player_type=standard&rg_publisher=HGTV&rg_publisher_id=1248&rg_domain_category_id=&rg_domain_id=d5e19628176ce4f2ae05a06a4bd9a2f1&rg_page_host_url=Scripting%20Error%20TypeError:%20Cannot%20read%20property%20'width'%20of%20null&rg_ad_domain_id=null&rg_player_uuid=24355d48-cda7-43e4-aabc-76e402764bea&rg_video_provider_id=603&rg_video_catalog_id=562&rg_video_index_id=26&rg_guid=68664b27-3510-48f4-a1be-d0d0b64d3115&rg_session=6305b824c217985075259a92b90542c1&rg_counter=1&rg_event=jwplayerReady&rg_category=Player&rg_action=Impression&rg_ads_version=rg_all_13.140.12.19_flex_sdk_3.6.0.16995_DOUBLECLICK&rg_comscore_version=13.140.12.19_flex_sdk_3.6.0.16995&rg_coordinates=null&rg_visible=null&rg_size=null
 HTTP/1.1" 200 0 "http://<url elided>" "Mozilla/5.0 (Windows NT 6.1; WOW64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"

I can easily use mmnormalize to parse out most of the large chunks of stuff.  
However, when I get to this part:

GET 
/events?rg_type=2.4.5:revenue&rg_player_type=standard&rg_publisher=HGTV&rg_publisher_id=1248&rg_domain_category_id=&rg_domain_id=d5e19628176ce4f2ae05a06a4bd9a2f1&rg_page_host_url=Scripting%20Error%20TypeError:%20Cannot%20read%20property%20'width'%20of%20null&rg_ad_domain_id=null&rg_player_uuid=24355d48-cda7-43e4-aabc-76e402764bea&rg_video_pro0vider_id=603&rg_video_catalog_id=562&rg_video_index_id=26

(the rest snipped for brevity) I then need to break that up into discrete K/V 
pairs.  I want to end up with a structure that looks vaguely like this:

{"rg_type": "2.4.5:revenue",
 "rg_player_type": "standard",
  "rg_publisher": "HGTV",
 "rg_action": "Impression", …}

You get the idea… I basically need to split up all the params on the request 
line into K/V pairs.  Now, these values are in an arbitrary order.  They are 
also not always all there.  The set of pairs is dynamic (we add and remove 
various beacons as we conduct more experiments).  Some of the fields are URL 
encoded and need to be decoded as well.

All this needs to be converted into a JSON-ish structure and inserted into 
elastic search (I also route them onto a zeromq bus in CEE format).  We need to 
be able to drive counters, filters and actions based upon different parameters 
within the request on the fly.  We need them indexed in ES so we can look for 
trends and perform analytics on them.  We use them to track revenue events as 
well as player events, errors, etc.

I have a moderately complex logstash grok filter to do this.  Logstash has very 
deep regexp support for this sort of thing, and it also has built in plugins to 
do K/V splitting, on the fly conversions, etc.  In this case, I use rsyslog to 
route the message to logstash, let logstash filter and parse it and then insert 
it into the appropriate (daily) indexes in elastic search.  I also have various 
routing rules in rsyslog to trigger actions and counters based upon regexps it 
finds.  For instance, if it finds a stream progress event that looks like 
"rg_event=Stream%20Progress <int> <int> 0" that is what we call a "stream 
start" and I tick a redis counter for that directly from rsyslog.  However, 
even after quite a bit of work I've been unable to make mmnormalize come 
anywhere close to parsing out the stuff I need into appropriate structures like 
I can with logstash.

To get similar functionality without rsyslog I'd probably have to use splunk as 
my main router and indexer.  However, splunk is WAY less powerful in the high 
speed routing arena and is way more suited for handling logs themselves and not 
really "events" (an event is NOT a log entry IMO, an event is something that is 
happening right now and is actionable while a log entry is a record of 
something that already happened).  The above use case is just a small piece of 
what I'm using rsyslog for, I'm also using it as a message bus to route events 
to an event driven architecture (CEE messages get generated by clients and 
injected through the syslog stream, rsyslog has big support for complex routing 
so I route those messages to specific clients to trigger actions sort of like 
what you would use AMQP or the like for with a lot less overhead).  Not 
everything I do gets routed to logstash, just the stuff that needs really 
complex parsing prior to indexing it.  The intention is that I will do away 
with the godawful legacy format above and move everything to a common preparsed 
CEE-style message packet prior to injection into the system, at which point I 
more than likely won't need logstash anymore.

Rsyslog routes and handles WAY more traffic.  It allows me to designate 
specific destinations for these streams and do fancy routing tricks.  It 
handles backlogs way more gracefully.  It allows me to drive a lot of stuff 
directly with very few moving parts.  It's a great piece of software, but it's 
not both a floor wax AND a furniture polish :)  At least not yet anyway.

Now, how do I make this work?  I use omtcp and set a rebind interval to stream 
to a load balancer.  The heart of the system (rsyslog) can handle about 350,000 
messages per sec as I currently have it configured and tuned.  This is WAY 
above my current traffic and well within capacity for my projected traffic to 
come.  Behind the load balancer, I have various logstash instances running 
(with multiple filter worker threads on logstash).  Since right now each 
logstash worker thread can handle approximately 750 events per second and I am 
streaming approximately 2000 events per sec currently, I have to have at least 
3 worker threads to handle that traffic.  I am steadily ramping up on my event 
rate and am expecting a two orders of magnitude increase by October (we were at 
about 400 events per second 2 months ago).  I can run 3-4 worker thread 
instances on each logstash machine and with a little math can easily project 
how many logstash instances I need behind the load balancer to handle various 
levels of load.  I have the addition of new parsers fairy automated and am 
working towards having rsyslog triggering automated upscaling and downscaling 
events by monitoring the impstats output to pay attention to message traffic 
and queue sizes and trigger (via a ZMQ message) an automated action to bring 
more parsers online or take some offline when they are no longer needed.  I 
don't have the very last step done yet (the automatic trigger) but am finally 
done with all the groundwork I needed to accomplish in order to be able to have 
that happen.

In essence, I'm using rsyslog for much more than just a centralized high speed 
log router.  It's the heart of my event architecture here.  TBH, I would 
probably have to build a WHOLE lot more stuff and worry about a lot of 
different failure points if I didn't have rsyslog at the heart.

-- Gary F.

On May 30, 2013, at 3:51 AM, Rainer Gerhards <[email protected]> wrote:

> On Thu, May 30, 2013 at 1:07 AM, Gary Foster <[email protected]>wrote:
> 
>> Well, you have to do one or the other… either adjust your rsyslog output
>> template to match the template kibana uses on your output or tweak kibana
>> to expect the template you do use.  I think the first option is the most
>> sensible.
>> 
>> As for logstash, yeah if you don't have to do a lot of parsing going
>> straight from rsyslog to elastic search is probably a better solution.  I
>> don't (currently) have that option but I'm working towards it.
>> 
> Let me hijack that thread to ask what's actually missing. Please pardon me
> if we came across it and my memory fade away... ;)
> 
> The reason I ask is that I think it would be good if we could get up some
> guide on using Kibana with rsyslog (and patches to rsyslog if needed and
> doable).
> 
> Rainer
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
> LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to