On Wed, 14 Dec 2016, David Lang wrote:
no such thing as a temporary queue
as we are looking for 0% loss...
under all conditions? including a fire in the server room or a metor hitting
the building?
0% loss is a very strong statement, and really needs to be clarified. If you
really do mean 0% loss under all conditions (includeing the building being
destroyed), then rsyslog may not actually be the right tool for you.
Due to the performance impact of trying to comply with such a requirement,
the cost of implementing such a system would also be extremely high
But almost nobody really means 0% loss under all conditions, so the first
thing you really need to do is to think really hard about what the real
requirements are. Are there conditions where you would be willing to have
data lost if the building is destroyed, or some server looses power?
Adding to this, if you have your systems sending logs to two different
datacenters around the world you are generating an additional window for the
loss of the message.
If the sending server 'goes awawy' before the systems several thousand miles
away have received the logs, the log will be lost.
<soapbox>
databases have ACID guarantees, which mean that a database transaction (in this
case a write) is:
Atomic: The entire write succeeds or none of it does
Consistant: there is never a window where the database is not in a valid state
Isolation: no interactions between different transactions happening at the same
time are allowed (they all must only depend on transactions that were complete
before they started)
Durable: once a transaction is complete, a system crash will not erase the
transaction
But it's also important to note that this does not protect against all loss of
data. Disk failure, OS corruption, fire, etc are not part of the threat model
that databases are protecting themselves against.
To protect against these sorts of problems, databases use backups and
replication, which always leave a window of vulnerability where the most recent
data can be lost (absent all other issues, the speed of light prevents truely
simultanious replication)
One way to deal with this is two phase commits [2], which still have failure
modes that require that the admin go in and figure out what should have
happened. They just make it so that the admin has the ability to use their
judgement.
It's never possible to guarantee 0% chance of loss. Even the various space
programs with virtually unlimited budgets are unable to do that. Just look at
the various high-profile Mars lander failures and you will see that bugs are
going to happen, however much you try to avoid them
Once you accept the fact that you really don't have a hard requirement for 0%
loss, you then need to start talking about what you're real requirement is.
If you are a university and someone blows up your datacenter, is it really a
requirement to have the log message that happened during the time between when
the bomb went off and the blast destroyed the server safe somewhere? or would
people understand that logs generated less than a minute before the bomb went
off are not going to be able to be recovered?
0% loss while everything is operating normally is a very reasonable requirement
to talk about.
0% loss in the face of network outages is a reasonable requirement to talk about
(how long an outage are you required to survive? a few seconds, a day, a week, a
year? the answer will change what you implement)
0% loss in the face of server shutdowns is a reasonable requirement to talk
about (even many power failures can be handled)
but there are categories of failures that you cannot guarantee that you will
survive. These include:
software bugs
system failures (crashes, some power failures)
failures of multiple components of the systems
There are things that you can do to reduce the probability of such failures
causing you to loose logs, but these all involve some form of redundancy and
probability calculations along the lines of "one server has a 90% chance of
working, so two servers have a 99% chance of working, three servers have a 99.5%
chance of working..."
But even with all of this, you need to remember that more complex systems are
more likely to fail. So the more redundancy you add to a system, the more likely
that something is going to fail, and the redundancy mechanism is one of the
things that can fail.
I've had very expensive IBM servers running mission critical applications fail
BECAUSE of the redundancy built in to them. They had multiple power supplies in
the system (to protect against power failures), but the board that coordinated
the power and determined which power supplies was working had a problem and
declared that they all failed when in fact they were all working properly. This
caused multiple outages over months before it was tracked down
You protect against failures by adding complexity
The more complex something is, the more likly it is to fail.
You really need to do Cost Benefit analysis of things and understand what the
price of failure is before you declare that you absolutly must protect against
the failure at all costs.
</soapbox>
David Lang
[1] https://en.wikipedia.org/wiki/ACID
[2] https://en.wikipedia.org/wiki/Two-phase_commit_protocol
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.