I'm assuming you mean "what's missing from mmnormalize", if that's not what you
meant I apologize in advance. What is "missing" would be the deep level of
regexp support I need. The regexp support is not nearly robust enough to
support what I've been trying to do.
Here's a sample log line of what I am parsing with logstash:
May 30 17:52:22 mediacast5 rg_events: 10.12.247.179 - - [30/May/2013:17:52:22
+0000] "GET
/events?rg_type=2.4.5:revenue&rg_player_type=standard&rg_publisher=HGTV&rg_publisher_id=1248&rg_domain_category_id=&rg_domain_id=d5e19628176ce4f2ae05a06a4bd9a2f1&rg_page_host_url=Scripting%20Error%20TypeError:%20Cannot%20read%20property%20'width'%20of%20null&rg_ad_domain_id=null&rg_player_uuid=24355d48-cda7-43e4-aabc-76e402764bea&rg_video_provider_id=603&rg_video_catalog_id=562&rg_video_index_id=26&rg_guid=68664b27-3510-48f4-a1be-d0d0b64d3115&rg_session=6305b824c217985075259a92b90542c1&rg_counter=1&rg_event=jwplayerReady&rg_category=Player&rg_action=Impression&rg_ads_version=rg_all_13.140.12.19_flex_sdk_3.6.0.16995_DOUBLECLICK&rg_comscore_version=13.140.12.19_flex_sdk_3.6.0.16995&rg_coordinates=null&rg_visible=null&rg_size=null
HTTP/1.1" 200 0 "http://<url elided>" "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"
I can easily use mmnormalize to parse out most of the large chunks of stuff.
However, when I get to this part:
GET
/events?rg_type=2.4.5:revenue&rg_player_type=standard&rg_publisher=HGTV&rg_publisher_id=1248&rg_domain_category_id=&rg_domain_id=d5e19628176ce4f2ae05a06a4bd9a2f1&rg_page_host_url=Scripting%20Error%20TypeError:%20Cannot%20read%20property%20'width'%20of%20null&rg_ad_domain_id=null&rg_player_uuid=24355d48-cda7-43e4-aabc-76e402764bea&rg_video_pro0vider_id=603&rg_video_catalog_id=562&rg_video_index_id=26
(the rest snipped for brevity) I then need to break that up into discrete K/V
pairs. I want to end up with a structure that looks vaguely like this:
{"rg_type": "2.4.5:revenue",
"rg_player_type": "standard",
"rg_publisher": "HGTV",
"rg_action": "Impression", …}
You get the idea… I basically need to split up all the params on the request
line into K/V pairs. Now, these values are in an arbitrary order. They are
also not always all there. The set of pairs is dynamic (we add and remove
various beacons as we conduct more experiments). Some of the fields are URL
encoded and need to be decoded as well.
All this needs to be converted into a JSON-ish structure and inserted into
elastic search (I also route them onto a zeromq bus in CEE format). We need to
be able to drive counters, filters and actions based upon different parameters
within the request on the fly. We need them indexed in ES so we can look for
trends and perform analytics on them. We use them to track revenue events as
well as player events, errors, etc.
I have a moderately complex logstash grok filter to do this. Logstash has very
deep regexp support for this sort of thing, and it also has built in plugins to
do K/V splitting, on the fly conversions, etc. In this case, I use rsyslog to
route the message to logstash, let logstash filter and parse it and then insert
it into the appropriate (daily) indexes in elastic search. I also have various
routing rules in rsyslog to trigger actions and counters based upon regexps it
finds. For instance, if it finds a stream progress event that looks like
"rg_event=Stream%20Progress <int> <int> 0" that is what we call a "stream
start" and I tick a redis counter for that directly from rsyslog. However,
even after quite a bit of work I've been unable to make mmnormalize come
anywhere close to parsing out the stuff I need into appropriate structures like
I can with logstash.
To get similar functionality without rsyslog I'd probably have to use splunk as
my main router and indexer. However, splunk is WAY less powerful in the high
speed routing arena and is way more suited for handling logs themselves and not
really "events" (an event is NOT a log entry IMO, an event is something that is
happening right now and is actionable while a log entry is a record of
something that already happened). The above use case is just a small piece of
what I'm using rsyslog for, I'm also using it as a message bus to route events
to an event driven architecture (CEE messages get generated by clients and
injected through the syslog stream, rsyslog has big support for complex routing
so I route those messages to specific clients to trigger actions sort of like
what you would use AMQP or the like for with a lot less overhead). Not
everything I do gets routed to logstash, just the stuff that needs really
complex parsing prior to indexing it. The intention is that I will do away
with the godawful legacy format above and move everything to a common preparsed
CEE-style message packet prior to injection into the system, at which point I
more than likely won't need logstash anymore.
Rsyslog routes and handles WAY more traffic. It allows me to designate
specific destinations for these streams and do fancy routing tricks. It
handles backlogs way more gracefully. It allows me to drive a lot of stuff
directly with very few moving parts. It's a great piece of software, but it's
not both a floor wax AND a furniture polish :) At least not yet anyway.
Now, how do I make this work? I use omtcp and set a rebind interval to stream
to a load balancer. The heart of the system (rsyslog) can handle about 350,000
messages per sec as I currently have it configured and tuned. This is WAY
above my current traffic and well within capacity for my projected traffic to
come. Behind the load balancer, I have various logstash instances running
(with multiple filter worker threads on logstash). Since right now each
logstash worker thread can handle approximately 750 events per second and I am
streaming approximately 2000 events per sec currently, I have to have at least
3 worker threads to handle that traffic. I am steadily ramping up on my event
rate and am expecting a two orders of magnitude increase by October (we were at
about 400 events per second 2 months ago). I can run 3-4 worker thread
instances on each logstash machine and with a little math can easily project
how many logstash instances I need behind the load balancer to handle various
levels of load. I have the addition of new parsers fairy automated and am
working towards having rsyslog triggering automated upscaling and downscaling
events by monitoring the impstats output to pay attention to message traffic
and queue sizes and trigger (via a ZMQ message) an automated action to bring
more parsers online or take some offline when they are no longer needed. I
don't have the very last step done yet (the automatic trigger) but am finally
done with all the groundwork I needed to accomplish in order to be able to have
that happen.
In essence, I'm using rsyslog for much more than just a centralized high speed
log router. It's the heart of my event architecture here. TBH, I would
probably have to build a WHOLE lot more stuff and worry about a lot of
different failure points if I didn't have rsyslog at the heart.
-- Gary F.
On May 30, 2013, at 3:51 AM, Rainer Gerhards <[email protected]> wrote:
> On Thu, May 30, 2013 at 1:07 AM, Gary Foster <[email protected]>wrote:
>
>> Well, you have to do one or the other… either adjust your rsyslog output
>> template to match the template kibana uses on your output or tweak kibana
>> to expect the template you do use. I think the first option is the most
>> sensible.
>>
>> As for logstash, yeah if you don't have to do a lot of parsing going
>> straight from rsyslog to elastic search is probably a better solution. I
>> don't (currently) have that option but I'm working towards it.
>>
> Let me hijack that thread to ask what's actually missing. Please pardon me
> if we came across it and my memory fade away... ;)
>
> The reason I ask is that I think it would be good if we could get up some
> guide on using Kibana with rsyslog (and patches to rsyslog if needed and
> doable).
>
> Rainer
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
> LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.