Re: [rsyslog] [enhancement] Expose a REST API for pulling logs

Radu Gheorghe Wed, 09 Oct 2013 01:35:38 -0700

2013/10/8 David Lang <[email protected]>

> On Tue, 8 Oct 2013, Radu Gheorghe wrote:
>
>  2013/10/8 David Lang <[email protected]>
>>
>>  The big problem with this idea is that in general messages do not stay in
>>> the queues for very long. They are there only until they can get
>>> delivered
>>> to their destinations.
>>>
>>> If you were to have rsyslog stop delivering the messages to a destination
>>> and then hang on to them all, how would rsyslog ever know if all the
>>> destinations had completed their polling?
>>>
>>>
>> Ah, I think there's a misunderstanding here. I'm not suggesting to give
>> access to any action's queue. I'm suggesting to have a separate queue for
>> the REST API. For example, rsyslog is listening to /dev/log, and instead
>> of
>> forwarding those messages via TCP, put them in the action queue and wait
>> for something to pull those messages from the queue. When once asks for a
>> batch of messages, they get "consumed".
>>
>>
>>
>>> It is easy enough to have rsyslog deliver messages to a queue of your
>>> choice (may I suggest that you look at either Zero MQ or Rabbit MQ which
>>> are already supported by rsyslog), and then you can deal with them at
>>> your
>>> convienience.
>>>
>>>
>> Right, thanks for mentioning. This will work as well. Although having the
>> REST API directly in rsyslog will eliminate the need of another moving
>> piece.
>>
>
> well, it's another moving piece in any case, whether you call it part of
> rsyslog or separate :-)



I think you know it's not really the same thing. I mean, I get what you're
saying, this option implies more complexity than simply forwarding via TCP,
but compare this:

rsyslog + rabbitmq output + rabbitmq + elasticsearch rabbitmq river +
elasticsearch

vs

rsyslog + rest output + elasticsearch rsyslog river + elasticsearch

What annoys me in the first setup is not only just another piece of
software. Is the fact that you need to scale it, and stuff needs to be
parsed more times, wasting resources. Because RabbitMQ has to parse those
incoming messages and store them.

To be clear: I'm not implying that a pull model is superior to a push
model. For example, in the way I use it now, I prefer push. But a pull
model is preferable for some use-cases (see below).


>
>
>  There's also the problem that giving outside programs access to the
>>> rsyslog internals would be fragile, it wouldn't work with any other
>>> syslog
>>> daemon (destroying the value of standardization), and it would probably
>>> not
>>> continue to work even with rsyslog in the face of updates or even config
>>> changes.
>>>
>>>
>> Ah, but this has nothing to do with syslog standards. Think of it as a
>> different output, like omelasticsearch. Only instead of pushing logs,
>> rsyslog will store them so they can get pulled.
>>
>
> ahh, but anything that pretends to be serious about dealing with logs can
> handle syslog, if you create a new mechanism, then you will have to modify
> every tool to deal with the new protocol and format. It's still lots of new
> modules to write, just now it's a now module for each piece of software
> that will consume the logs rather than a module for rsyslog to feed it to
> the other software more efficently.


I mostly agree with this. To be spcific, I agree with the part that most
tools understand syslog and an output module is the most efficient way for
rsyslog to push logs forward. The problem is, if you're really looking into
it, most tools know about syslog, but they don't know about all of its
favors. They might know about RFC5424, but maybe not about CEE. They might
support TCP, but maybe not octet-counted framing. Or maybe not RELP.
Rsyslog offers a lot of possibilities now, but a lot of log-processing
tools up or down the chain are not yet up to date with all these
possibilities. Of course the best solution here would be to update those
tools, but that's not always trivial. I'm a big fan of modern syslog
protocols and I'd be really sad to see them not being widely adopted
because other tools don't make use of its functionality in a timely fashion.

Being able to pull syslog data from anywhere would open a huge opportunity
for workarounds that would fill that gap. Until, hopefully, the new stuff
gets just as ubiquitous as RFC3164 over UDP is.

And I'm not pretending such a REST API will help when it comes to rsyslog
processing logs efficiently. Clearly the push model, and output plugins is
the best way to go if that's the goal.

I'm talking about functionality here. For example, let's say you need your
logs in Cassandra (or just pick some data store that's not yet supported).
A REST API will help in the following use-cases:
- your firewalls are set up in such a way that it's preferable to pull data
rather than push
- you want to make sure Cassandra is not overwhelmed, so you want to make a
plugin for Cassandra that pulls logs from rsyslog rather than the other way
around
- you just need something quick, and the REST API allows you to write a
one-page script in ANY programming language that takes logs from rsyslog
and pushes them to Cassandra. Not the best way, but you can always fire up
multiple such scripts, and it will get you quite far, until you manage to
get a omcassandra working. Think of this REST API like omprog on steroids.


>
>
>  As for the logstash recommendations, I can tell you that the idea of
>>> putting all your logs through a single chain of processing the way they
>>> do
>>> doesn't scale that well. At really high log volumes, you need to be able
>>> to
>>> split your processing not only across different servers, but across
>>> different farms of servers.
>>>
>>>
>> The "ASCII diagram" was just a sketch of the flow, I didn't include
>> scaling
>> in there. With Logstash in particular, you can have multiple "receivers"
>> across multiple servers, then if you use Redis as a buffer you can cluster
>> that, and you can have multiple Logstash "indexers" to send logs to
>> Elasticsearch, which can be clustered, too.
>>
>> Maybe my drawing was confusing, sorry for that. I wanted to show the chain
>> of processing in terms of technology, not in terms of scale.
>>
>
> but even if there are multiple systems in each set, the idea of having all
> the logs go to one set of systems, then get forwarded to the next set of
> systems, etc is not really the best way to go.
>
> Yes, you want to get all the logs funneled into one set of central
> systems, but from there you want to fan out to multiple different
> destinations as quickly as possible, one destination may be your
> elasticsearch cluster, another may be a simple archive to disk, while
> another is an event correlation cluster, etc.
>
> take a look at the architecture I outline in this paper
> https://www.usenix.org/**conference/lisa12/building-**100k-logsec-logging-
> **infrastructure<https://www.usenix.org/conference/lisa12/building-100k-logsec-logging-infrastructure>
>

Aaaah, I see what you mean now. I actually watched that talk quite a while
ago. And I've been paying more attention to your Emails from that point on
:) Very nice, thanks for sharing.

Back to the fan out thing, I agree that it's useful in many use-cases. And
it's easier to fan-out by doing push to many destinations, rather than
having multiple destinations pull.

That said, I think/hope the trend is to reduce the number of destinations,
because it uses less resources and it implies less moving parts (yes. I
know. I have a few ideas and I insist on them :p).

For example, with search engines like Elasticsearch or Solr, not a long
time ago people were typically storing data in something else, and using
the search engine for real-time search only. Now the storage layer is a lot
more stable in them. Plus the extra performance of doing just one request,
instead of a search and then a pull from the "main" data-store. So more
deployments are now using ES/Solr as a NoSQL data-store that provides fast
search. Still, something like HDFS may be deployed alongside them for
complex analytics via map/reduce. But the analytics capabilities of those
two are developing rapidly (eg: pivot faceting in
Solr<http://docs.lucidworks.com/display/solr/Faceting#Faceting-Pivot%28DecisionTree%29Faceting>and
aggregations
in ES <https://github.com/elasticsearch/elasticsearch/issues/3300>).

Best regards,
Radu
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] [enhancement] Expose a REST API for pulling logs

Reply via email to