On Sat, 24 Jan 2015, singh.janmejay wrote:
Which is why I treat K-safety as a basic design parameter. If K nodes
disappear, data will be lost.
With this kind of coarse-grained failure-mode, messages can easily be kept
in memory.
whatever machine sends the ack saying that the message was delivered will
have to do disk queues or other equivalent, otherwise the message can be
lost after the ack is sent.
In the cluster, ACK for 'replicated till seq_no' is done only by the queue
in the last tier (Tier-3, in the example).
The last tier node is not special (or different) from any other node in the
cluster. It still queues things up in memory, and sends a ack saying it has
received messages till a certain seq_no. Because its the last tier (i.e.
k-th tier) this ack also means message has been k-way replicated. Let us
say the process / machine crashes at this point. The tier-2 machine will
correct this by re-transmitting all messages from what it thinks is the
current 'delivered till seq_no'.
if the ack has been sent, why would anything now thing that the message hasn't
been delivered?
Anything that requires replicating state to every machine in the cluster is
going to be a major performance and reliability problem.
How would you have all the different machines write to the same file? or insert
into the same database, or send to the same remote machine?
This is different from 'replicated till
seq_no' (and is always <= replicated till). This is the value that tells
us, that the downstream cluster has replicated messages till a certain
point. That is, replicated_till_seq_no(system N+1) ==
delivered_till_seq_no(system N), similarly replicated_till_seq_no(system
N-1) == delivered_till_seq_no(system N) etc.
Ok, i'll admit here that I didn't read through all the details that you posted.
Before we get into all those details, let's talk the big picture.
you have one machine generating the message.
it goes through some nebulous cloud of redundant machines.
then it needs to be in one place at the end (where you will then make use of the
log).
You could have the log end up in multile places, but then you have to reconcile
these places against each other, and you again get down to one place that you
look to read the log.
can we agree on this much? If not, please explain how you use the logs that are
distributed across the multiple machines.
delivering to one of N machines is trivial to do and requires no software
changes. CLUSTERIP managed by corosync will do this using standard syslog
software today. Specifically, this allows you to have X machines in a
cluster and a system sending a log to the cluster doesn't care which
machine it's delivered to. It just connects to the cluster IP and sends a
message. If the machine it's hitting fails or is dead, it just tries
connecting again and will almost always hit a different box immediately
(and if it's unlucky enough, just try again). Corosync will remove the
machine from the cluster (and add it back as appropriate) without any need
for the clients to know about it.
Thanks for pointing this out. I was thinking of using something like
zookeeper for discovery(ephemeral z-nodes, that disappear when syslog-nodes
die etc.), but this may be a simpler way of doing it.
Have never used it before, so need to learn the details, but from how you
describe it, seems like it does exactly what I want.
The only requirement is that all the receiving machines must be on the same
broadcast domain. The arriving packet is delivered to every machine by the
switch (a multicast bit is set in the MAC address, but this is NOT IP multicast,
that takes special software to use, this will work with ssh). Each machine then
hashes one or more of source IP, destination IP, source port, destination port
(depending on your config, if the destination IP and port are always going to be
the same, don't bother hashing on them), and the packets are assigned to one of
N buckets (number of machines), and the config on each machine tells if that it
is machine X of N. If the hash matches that machine, the packet is delivered to
the application, otherwise it's dropped and the software running on the machine
never sees it.
since each new connection will use a different source port, each connection
attempt is likely to hash to a different machine.
I give details and examples in the paper I presented at LISA in 2012
https://www.usenix.org/conference/lisa12/technical-sessions/presentation/lang_david
play around with it on a couple machines, setup the CLUSTERIP with the iptables
rules and then ssh to the VIP, see what machine you are on, disconnect and
repeat.
Corosync keeps track of the health of all the boxes in the cluster, and will do
the iptables config for you to make sure that there is always (allowing for
detection time) a machine responding to every value of X for your "I'm box X of
N" VIP.
the problem is at the destination. If the ruleset on tier 3 says that the
message should be delivered to multiple destiantions (say Elastic Search, a
couple of local files, and another remote machine), does the tier 3 machine
send the ack? or does it need to get acks from all the destinations? what
if one local file can be written and the other can't? What if the ruleset
says to discard the message? what if the ruleset says to discard the
message based on the current value of a global variable?
This is what I addressed in the 'user configuration level implementation'
where I said user needs to choose one destination (one action) that is used
as reliable destination (once written, data wouldn't be lost). However, we
can implement composite ack-aggregators that ack when each destination is
written to, and pass it up to queue only when all desired destinations have
ack'ed.
When does an ack get sent?
What happens if a box in the middle is supposed to modify the message?
This is not a problem. Any of the boxes may modify messages in any way.
This is fine because downstream-tier considers the modified message as
source of truth, so as far as a tier is concerned, it got the message, and
it needs to deliver it downstream. If it fails, the modified message will
be re-transmitted to another node picked from the same tier (by hashing, or
using Corosync, or some other way).
but if for example, the middle box is adding info about the path the message is
taking (the fact that it was relayed through box 3 vs 10 for example), then the
messages that arrive with the same ID are now different.
One more thing to reconcile
These problems make it so that I don't see how you would reasonably
manage
this sort of environment.
I would suggest that you think hard about what your requirements really
are.
It may be that you are only sending to one place, in which case, you
really want to just be inserting your messages into an ACID complient
database.
It may be that your requirements for absolute reliability are not quite
as
severe as you are initially thinking that they are, and that you can then
use the existing hop-by-hop reliability. Or they are even less severe and
you can accept some amount of memory buffering to get a few orders of
magnatude better performance from your logging. Remember that we are
talking about performance differences of 10,000x on normal hardware. A
bit
less, but still 100x or so on esoteric, high-end hardware.
Yep, I completely agree. In most cases extreme reliability such as this
is
not required, and is best avoided for cost reasons.
But for select applications it is lifesaver.
true, but at that point, is rsyslog the right software to be used?
I think syslog is the right tool for log-movement. The biggest reason is,
same tooling applies to both un-acked and fully-reliable pipelines.
So any user configuration can be used interchangeability(be it message
modification, log-rotation mechanism, queues, actions, rulesets, it could
be anything at all).
The problem is that to implement this in rsyslog, you would have to have the
ability to go backwards through the queues in rsyslog, and the changes to
support that would be massive. I'm not yet convinced that this gives us much in
the way of capability that we can't already get.
I will also say that there are messaging systems that claim to have the
properties that you are looking for (Flume for example), but almost
nobody
operates them in their full reliability mode because of the performance
issues. And they do not have the filtering and multiple destination
capabilities that *syslog provides.
Yes, Flume is one of the best options. But it comes with some unique
problems too (its not light-weight enough for running producer side +
managed-environment overhead (GC etc) cause their own set of problems).
There is also value in offering the same interface to producers for
ingestion into un-acked and reliable pipeline (because a lot of other
things, like integration with other systems can be reused). It also keeps
things simple because producers do all operations in one way, with one
tool, regardless of its ingestion mechanism being acked/replicated etc.
Reliability in this case is built end-to-end, so building stronger
guarantees over-the-wire parts of the pipeline doesn't seem very valuable
to me. Why do you feel RELP will be necessary?
what you are doing is basically just extending RELP from being per-hop to
being from the original source to the final destination. This requires that
the original source hold on to the message the entire time rather than
being able to delegate it to whatever box is just before the breakage. As I
say above, it also requires that the source know about all of the final
destinations.
Not really. Take a look at the following diagram:
ingestion_relay_6_purge_shown.png
<https://docs.google.com/file/d/0B_XhUZLNFT4dSk9obFRLYXdOY0k/edit?usp=drive_web>
I have added what the sliding window in nodes from different tiers and
upstream look like for the first diagram.
I haven't shown the sliding-window details for all message exchanges to
avoid unnecessary verbosity in the diagram, but it conveys the idea.
Note that last tier discards messages when downstream-system has replicated
it.
Tiers in the same cluster discard messages in response to last tier
delivering it to downstream system.
So first tier (or upstream system) doesn't need to hold messages any longer
than it takes for next-system to replicate it. This allows for burst as
large as queue size that next-system can provide. Its a typical
store-and-forward design.
At that point you no longer have the syslog async multiple relay
architecture.
What advantage do you get by having multile special machines in the middle
as opposed to just having the source machine talk directly to the
destination machine? You are not allowing the machines in the middle to
store the message, and you are waiting until the final machine responds
before you can release the message.
The advantage is burstability. If downstream is capable of keeping up with
very high bursts of messages, intermediate cluster may not be required.
However, in most cases that kind of over-provisioning for a database is
hard. The intermediate cluster allows smoothing out bursts for the final
tier.
well, your database isn't going to be part of your third tier machines in any
case. If you thing provisioning a database for performance is hard, try
provisioning it for active-active (where you can insert to any node at any time
and have all reads see all data), that's where things get _really_ nasty.
So you are back to the fact that at some point in your architecture, you are
going to have to reconcile all your copies down to one and do the database
insert.
How large a burst you can handle on your system will depend on how quickly you
can manage to replicate your state. shared state is the bugaboo of clustering,
the most successful clusters are the ones that require the least shared state,
and the most common examples (webservers) typically share no state between the
different machines in the cluster.
With rsyslog today you could have a source box do RELP through multiple
tiers of redundant routers or dumb TCP proxies to get to a rsyslog box
receiving the message and delivering it (with the need for disk queues,
fsync on every output message, etc that cripple speed)
no need for any sequence-id, message-id, or any message changes at all.
Yep, but this imposes 2 important restrictions. One is fsync at each
message which restricts throughput of relay.
I think that you would find that a fsync is as fast as replicating the message
to every machine on the cluster
The second is it has no
protection from disk-failures, which are going to be common in a very
large-cluster.
not true, if the node has a RAID (simple mirroring for example), it will survive
any single drive failing. Linux md raid 1 arrays can have N devices in them, not
just two so you can have up to N-1 nodes fail without an issue.
The throughput limitation makes scaling this cluster very expensive. High
throughput is necessary to keep it cheap.
replicating state over the network and keeping it syncronized across machines is
extremely expensive.
In fact, you can have two different copies of rsyslog running on the
machines, one operating normally, one in audit-grade reliability mode and
the client just decides which copy to send a particular log message to
(/dev/log can only be one, but the client can connect to localhost on
multiple ports)
I wasn't thinking 2 different instances. It was actually 2 different
pipelines, that is different inputs, bound to different rulesets and
actions (one that is acked, other unacked).
I think you could do my version that way with current rsyslog as well, given the
new syntax and the definitions of rulesets and the associated queues. It's just
easier to audit and trace if you keep them separate (the user wouldn't know,
they would connect to the two inputs the same way if they are running on one
instance or two.
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.