On Wed, 14 Dec 2016, David Lang wrote:

no such thing as a temporary queue

as we are looking for 0% loss...

under all conditions? including a fire in the server room or a metor hitting the building?

0% loss is a very strong statement, and really needs to be clarified. If you really do mean 0% loss under all conditions (includeing the building being destroyed), then rsyslog may not actually be the right tool for you.

Due to the performance impact of trying to comply with such a requirement, the cost of implementing such a system would also be extremely high

But almost nobody really means 0% loss under all conditions, so the first thing you really need to do is to think really hard about what the real requirements are. Are there conditions where you would be willing to have data lost if the building is destroyed, or some server looses power?

Adding to this, if you have your systems sending logs to two different datacenters around the world you are generating an additional window for the loss of the message.

If the sending server 'goes awawy' before the systems several thousand miles away have received the logs, the log will be lost.

<soapbox>
databases have ACID guarantees, which mean that a database transaction (in this case a write) is:

Atomic: The entire write succeeds or none of it does
Consistant: there is never a window where the database is not in a valid state

Isolation: no interactions between different transactions happening at the same time are allowed (they all must only depend on transactions that were complete before they started)

Durable: once a transaction is complete, a system crash will not erase the transaction

But it's also important to note that this does not protect against all loss of data. Disk failure, OS corruption, fire, etc are not part of the threat model that databases are protecting themselves against.

To protect against these sorts of problems, databases use backups and replication, which always leave a window of vulnerability where the most recent data can be lost (absent all other issues, the speed of light prevents truely simultanious replication)

One way to deal with this is two phase commits [2], which still have failure modes that require that the admin go in and figure out what should have happened. They just make it so that the admin has the ability to use their judgement.



It's never possible to guarantee 0% chance of loss. Even the various space programs with virtually unlimited budgets are unable to do that. Just look at the various high-profile Mars lander failures and you will see that bugs are going to happen, however much you try to avoid them


Once you accept the fact that you really don't have a hard requirement for 0% loss, you then need to start talking about what you're real requirement is.

If you are a university and someone blows up your datacenter, is it really a requirement to have the log message that happened during the time between when the bomb went off and the blast destroyed the server safe somewhere? or would people understand that logs generated less than a minute before the bomb went off are not going to be able to be recovered?

0% loss while everything is operating normally is a very reasonable requirement to talk about.

0% loss in the face of network outages is a reasonable requirement to talk about (how long an outage are you required to survive? a few seconds, a day, a week, a year? the answer will change what you implement)

0% loss in the face of server shutdowns is a reasonable requirement to talk about (even many power failures can be handled)

but there are categories of failures that you cannot guarantee that you will survive. These include:

  software bugs

  system failures (crashes, some power failures)

  failures of multiple components of the systems

There are things that you can do to reduce the probability of such failures causing you to loose logs, but these all involve some form of redundancy and probability calculations along the lines of "one server has a 90% chance of working, so two servers have a 99% chance of working, three servers have a 99.5% chance of working..."



But even with all of this, you need to remember that more complex systems are more likely to fail. So the more redundancy you add to a system, the more likely that something is going to fail, and the redundancy mechanism is one of the things that can fail.

I've had very expensive IBM servers running mission critical applications fail BECAUSE of the redundancy built in to them. They had multiple power supplies in the system (to protect against power failures), but the board that coordinated the power and determined which power supplies was working had a problem and declared that they all failed when in fact they were all working properly. This caused multiple outages over months before it was tracked down


You protect against failures by adding complexity
The more complex something is, the more likly it is to fail.

You really need to do Cost Benefit analysis of things and understand what the price of failure is before you declare that you absolutly must protect against the failure at all costs.

</soapbox>

David Lang


[1] https://en.wikipedia.org/wiki/ACID

[2] https://en.wikipedia.org/wiki/Two-phase_commit_protocol
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to