[rsyslog] our requirement is 0% loss was: Re: global queue configuration?

David Lang Wed, 14 Dec 2016 11:09:13 -0800

On Wed, 14 Dec 2016, David Lang wrote:

no such thing as a temporary queue
as we are looking for 0% loss...
under all conditions? including a fire in the server room or a metor hittingthe building?
0% loss is a very strong statement, and really needs to be clarified. If youreally do mean 0% loss under all conditions (includeing the building beingdestroyed), then rsyslog may not actually be the right tool for you.
Due to the performance impact of trying to comply with such a requirement,the cost of implementing such a system would also be extremely high
But almost nobody really means 0% loss under all conditions, so the firstthing you really need to do is to think really hard about what the realrequirements are. Are there conditions where you would be willing to havedata lost if the building is destroyed, or some server looses power?

Adding to this, if you have your systems sending logs to two differentdatacenters around the world you are generating an additional window for theloss of the message.

If the sending server 'goes awawy' before the systems several thousand milesaway have received the logs, the log will be lost.


<soapbox>

databases have ACID guarantees, which mean that a database transaction (in thiscase a write) is:


Atomic: The entire write succeeds or none of it does
Consistant: there is never a window where the database is not in a valid state

Isolation: no interactions between different transactions happening at the sametime are allowed (they all must only depend on transactions that were completebefore they started)

Durable: once a transaction is complete, a system crash will not erase thetransaction

But it's also important to note that this does not protect against all loss ofdata. Disk failure, OS corruption, fire, etc are not part of the threat modelthat databases are protecting themselves against.

To protect against these sorts of problems, databases use backups andreplication, which always leave a window of vulnerability where the most recentdata can be lost (absent all other issues, the speed of light prevents truelysimultanious replication)

One way to deal with this is two phase commits [2], which still have failuremodes that require that the admin go in and figure out what should havehappened. They just make it so that the admin has the ability to use theirjudgement.

It's never possible to guarantee 0% chance of loss. Even the various spaceprograms with virtually unlimited budgets are unable to do that. Just look atthe various high-profile Mars lander failures and you will see that bugs aregoing to happen, however much you try to avoid them

Once you accept the fact that you really don't have a hard requirement for 0%loss, you then need to start talking about what you're real requirement is.

If you are a university and someone blows up your datacenter, is it really arequirement to have the log message that happened during the time between whenthe bomb went off and the blast destroyed the server safe somewhere? or wouldpeople understand that logs generated less than a minute before the bomb wentoff are not going to be able to be recovered?

0% loss while everything is operating normally is a very reasonable requirementto talk about.

0% loss in the face of network outages is a reasonable requirement to talk about(how long an outage are you required to survive? a few seconds, a day, a week, ayear? the answer will change what you implement)

0% loss in the face of server shutdowns is a reasonable requirement to talkabout (even many power failures can be handled)

but there are categories of failures that you cannot guarantee that you willsurvive. These include:


  software bugs

  system failures (crashes, some power failures)

  failures of multiple components of the systems

There are things that you can do to reduce the probability of such failurescausing you to loose logs, but these all involve some form of redundancy andprobability calculations along the lines of "one server has a 90% chance ofworking, so two servers have a 99% chance of working, three servers have a 99.5%chance of working..."

But even with all of this, you need to remember that more complex systems aremore likely to fail. So the more redundancy you add to a system, the more likelythat something is going to fail, and the redundancy mechanism is one of thethings that can fail.

I've had very expensive IBM servers running mission critical applications failBECAUSE of the redundancy built in to them. They had multiple power supplies inthe system (to protect against power failures), but the board that coordinatedthe power and determined which power supplies was working had a problem anddeclared that they all failed when in fact they were all working properly. Thiscaused multiple outages over months before it was tracked down



You protect against failures by adding complexity
The more complex something is, the more likly it is to fail.

You really need to do Cost Benefit analysis of things and understand what theprice of failure is before you declare that you absolutly must protect againstthe failure at all costs.


</soapbox>

David Lang


[1] https://en.wikipedia.org/wiki/ACID

[2] https://en.wikipedia.org/wiki/Two-phase_commit_protocol
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

[rsyslog] our requirement is 0% loss was: Re: global queue configuration?

Reply via email to