Re: [rsyslog] Logging to central server / data loss ....

Gareth Bult Thu, 11 Sep 2014 01:57:12 -0700

Hi David,

>If you can't tell, I've done a little bit of work in this direction :-)

Yes, I can tell .. :-)

I wasn't suggesting for a second an entire system could be built in such a way 
that
it never failed, there's no accounting for armageddon! However, I still think 
it would
be nice (even if impossible) if the logging system [in principle] was 100% 
within the
scope of it's own environment.

> "a person with one watch knows what time it is, a person with two watches is 
> never sure"

Yes, I like that, indeed it was the point I was trying to make .. :)

>If you can show what the normal operation of the system is, even having the 
>logging system modify the logs as they flow through the system is acceptable 
>(as 
>long as you can show the changes are done consistantly)

Mmm, quite possibly. However my experience of being in court and facing an 
expert
witness is that as soon as you accept a [potential] flaw in your mechanism for 
capturing
information to be used as evidence, in real terms that evidence becomes 
worthless. So if
we start with a view that we can't achieve 100% in terms of integrity, I have 
to wonder
if it's worth worrying about signing at all .. (?)

I would agree that signing each message is too expensive, what I was suggesting 
was a tool
to verify the integrity of a message. (if the physical process involves 
verifying the integrity
of "more" than the message, i.e. a block of data) so be it.

In terms of hardware, sure, it's always been a problem and although I think 
this will change, it's
beyond what I want to worry about in terms of syslog .. :)

Gareth.

----- Original Message -----
From: "David Lang" <[email protected]>
To: "rsyslog-users" <[email protected]>
Sent: Wednesday, 10 September, 2014 7:39:54 PM
Subject: Re: [rsyslog] Logging to central server / data loss ....

On Wed, 10 Sep 2014, Gareth Bult wrote:

> Hi David,
>
>> Whatever you do with your logging system, at some point it is going to break
>
> An interesting premise, and looking at the options in front of me I would be
> inclined to agree (at least to some extent), but I don't think it "has" to
> be that way.

I can make a good argument that it WILL break at some point. All the redundancy 
and failover you put in place can recover from some failures, but those layers 
add additional failure modes themselves.

I've had (high-end, tier 1, $100k+) computers with redundant power supplies 
that 
died because of a bug in the module that combined the power from the systems.

I've run systems that had multiple motherboards locked in sync comparing 
results and had them fail.

There isn't a computer system ever created with 100% reliability in either 
hardware or software.

So the probability of failure may be low, but you do still need to account for 
it.

>> As a result, the application really needs to double log.
>
> Absolutely, at the moment I'm logging to local storage and to a remote central
> server. I don't like the idea of local logging, there are many issues with
> data storage and persistence with regards to automatic (re-)provisioning, and
> I would far rather double log to two network targets .. however .. (!)

log with failover rather than to two systems.

"a person with one watch knows what time it is, a person with two watches is 
never sure"

> First issue with this is signing. If you don't have a local (signed) copy of 
> your
> logs but instead have two (remote) copies, the problem is proving that
> logging information that's passed over a network connection hasn't been 
> tampered
> with in transit. (not necessarily a technical issue, more of a 'convince a 
> court'
> type issue)
>
> Second issue, if you end up with two remote logs that differ, how to you prove
> which is authoritative and moreover, given they differ, how to you prove the
> system itself is trustable, which leads on to triple logging etc etc.

This is actually far less of an issue in practice (at my last job I was in 
security and dealing with this sort of thing on a regular basis)

If you can show what the normal operation of the system is, even having the 
logging system modify the logs as they flow through the system is acceptable 
(as 
long as you can show the changes are done consistantly)

As I note above, things happen with logs. If you make the claim that your logs 
are 100% perfect, when you go to court, anything less than that is a problem. 
However, if you document the things that can happen and can explain the 
differences, you seldom have much trouble.

The key thing that you want to be able to do in court is to prove that nobody 
tampered with your logs AFTER the point where the issue that caused the case 
came to light. You can do this just fine on a central server.

In a large company dealing with banking logs, it was acceptable to periodically 
gather the logs, compress them, and sign them. If you then send a hash of the 
file offsite to someplace that logs access so you can show from their 
independent logs that it wasn't changed, even better.

As a practical matter, if someone with root access to your systems really wants 
to change logs that are being generated _now_, they will be able to. So you 
can't prove that they haven't been tampered with since the app generated them 
without having the app sign them (and then you have to prove that someone 
didn't 
get hold of a copy of the key to sign replacement logs.......)

> [insert more issues here ...]
>
> I guess 'my' ideal solution would be (ok, so this is an off-the-cuff design);
>
> o Sign logs as they happen [stream/sliding window etc]
> o Hold local copies for buffering / backlogging only
> o Double log the data to two remotes
> o Independently double log the signing data to two (other) remotes
> o Transit should ensure no data loss
> o Storage should ensure retries / retransmits in the event of any sort of 
> failure
> o A comprehensive tool for 'proving' the integrity is a specific message
> o System should gracefully shut-down [with no data loss] @ 99% disk usage
>
> Maybe a 'pie in the sky' solution, and I may have missed out a lot, but if 
> something
> appeared with this spec I'd be more than a little inclined to try it ...:-)
>
> As far as disk flushing/speed is concerned, I'm happy to call that a hardware 
> issue. You
> can have real speed and data integrity, but it needs a device with a battery 
> backed RAM cache
> [things are heading in that direction, RAPID mode on Samsung SSD's for 
> example]
> for writes, which is a platform issue and not my problem .. ;-)

it's not just the media speed, there is a huge amount of overhead in waiting 
for 
the write to happen _now_ and all the syscalls to make small writes. I was 
using 
a very high end device in my testing and it was still slow with all the 
queueing 
and caching disabled. Filesystems are also not designed for lots of tiny 
writes, 
they are designed to consolidate small writes into larger ones. This isn't just 
a matter of fewer transactions to disk, it's also manipulating the metadata for 
the file (recording it's size, changing the allocation of disk, etc). The newer 
filesystems are actually significantly worse in this sort of environment than 
ext2.

Signing is a very expensive thing to do, so signing each message individually 
is 
not practical (there is actually a syslog standard for signed messages, but in 
practice it's not used by anyone)

The key thing here is "perfect is the enemy of good enough". Work to make the 
logs more reliable, but keep in mind that cost vs reliability is a curve, the 
closer you try to get to perfect reliability, the more expensive it gets to 
improve the reliability by the same amount.

remember that if you are going for perfect reliability, you also have to defend 
against a meteor hitting your building (or earthquake, or terrorist bomb...), 
so 
you need to get the logs to multiple buildings, and that means that you are now 
depending on wires between the buildings that can be taken out by a backhoe so 
you need to make sure that each building is powered by two different power 
grids, approaching the building from different directions (so that someone 
running into power lines won't take it out), and your networks need to be 
redundant, with multiple paths to multiple ISPs approaching your buildings from 
different directions, and you need to use different AS numbers so that BGP 
routing problems won't take you out.

If you can't tell, I've done a little bit of work in this direction :-)

Rather than setting the goal of perfection, start with a fairly simple scenario 
and document the failure modes and look at which ones you can afford to deal 
with.

For just about everywhere, redundant servers with failover, periodic signing of 
logs (and then shipping the signed, encrypted logs offsite to external storage) 
is well beyond merely 'good enough', even while still allowing the logs to be 
cached in memory for some amount of time and therefor a chance for them to be 
lost if the machine catches fire (something I've also had happen)

remember that unless the application does intent logging, the existance or 
absense of a log doesn't guarantee that the event did or did not happen, even 
with perfect logging.

If you are going to go this route with rsyslog, you must use disk-only queues 
for everything, set your checkpoint interval to 1, sync the queue files to 
start 
with. Make sure that you don't use disk hardware that lies about when the write 
is safe on disk.

The disk queueing code should also be audited to make sure that there are no 
codepaths or failure modes that can loose something, even with these settings.

David Lang

> Regards,
> Gareth.
>
> ----- Original Message -----
> From: "David Lang" <[email protected]>
> To: "rsyslog-users" <[email protected]>
> Sent: Tuesday, 9 September, 2014 8:16:58 PM
> Subject: Re: [rsyslog] Logging to central server / data loss ....
>
> On Tue, 9 Sep 2014, Gareth Bult wrote:
>
>> Hi Rainer,
>>
>> Many thanks for looking, I appreciate you're busy.
>>
>> If it looked trivial I might've tried to patch it, but it "looks" like
>> it's pulling from the queue and then running the send plugins, so my initial
>> impression is that various bits of code need reordering - which is too much
>> for me. I would guess it needs to be peeking the queue and only de-queueing
>> once all the output modules have been satisfied ..
>>
>> It's interesting how things develop, back in "the good-old-days" central 
>> logging
>> was useful to spotting problems without sshing to lots of boxes, and some 
>> data
>> loss / the use of udp was quite acceptable. Today however, people seem to be
>> using it for collecting 'important' information where 100% accuracy and log
>> signing are critical .. a paradigm shift in "use-case" really ...
>
> been there, done that, and found that people didn't really want what they
> claimed they wanted :-)
>
> Whatever you do with your logging system, at some point it is going to break
> (disk fills up, fails, etc)
>
> A question that you have to ask your users/management is "what do you want to
> happen when a log cannot be written?" If the anwer is that they would rather
> have the application fail and present the user with an error than to take an
> action that's not logged, then they are potentially a candidate for what I 
> call
> "Audit grade logging". Keep in mind that the application includes login and 
> ssh
> if you do this to all logs.
>
> When you shift to using Audit Grade logging, things slow down a LOT, something
> on the order of 1000x. I was doing benchmarking of this a few years ago, and
> with a high end PCI SSD drive, I was able to get between 2-8K logs/sec
> (depending on filesystem, ext3 being 2k) compared to 400K logs/sec on the same
> system with a simple SATA driver for normal logging.
>
> Also think about failure modes of the application. If it logs before it takes
> the action, then something may happen before the action is taken and the log 
> is
> telling you that something happened that didn't.
>
> If the application takes an action and the logs it, it may take the action and
> then die before sending out the log.
>
> As a result, the application really needs to double log.
>
> First, log "I intend to take action X", then take the action and log "I
> succeeded/failed to take action X". You then need to watch for the first 
> message
> without the second and investigate if the action did or did not take place in
> those cases.
>
> If you are still wanting to pursue this, we can talk more and get into more
> details about what this requires.
>
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
> LIKE THAT.
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
> LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Logging to central server / data loss ....

Reply via email to