Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

David Lang Sat, 24 Jan 2015 13:58:10 -0800

On Sun, 25 Jan 2015, singh.janmejay wrote:

Ok, let me try to contrast properties of the originally proposed
architecture vs. the two RELP based alternate architectures.


I was trying to start at the problem, not the proposed solution.

- Throughput problems (with per message fsync):
 * Per message fsync means one fsync per 1k bytes.
 * Each fsync does a round-trip to disk (or raid controller)
 * Read-around-write, effect more pronounced under memory pressure
 * For data-centers being built now, 10G network-cards on servers are a
norm
 * A general purpose commodity server has limited file-io bandwidth (130
MB/s per spindle(for 10k SAS disk) * 6 spindles per box(1U chassis) = 780
MB/s, with RAID-10, that becomes 3 spindles, so 390 MB/s, which is still
good 800 MB/s shy of 10G)
 * Note that 130 MB/s is 100% fsync free throughput, with frequent fsync,
this is going to drop to less than 20% of original, that'd be < 100 MB/s on
RAID-10

spinning rust disks can do at most one fsync per rotation, using a disk queueand then writing the results requires multiple fsyncs per message. a 15K rpmdisk can do ~160 fsyncs/sec.

now, SSDs are starting to hit to comoddity level, and they can handle higherrates, at a longevity cost.

if you want true guarantees, then you need to replicate your data to storage. Ifyou are willing to settle for statistically reliable, then replicating to RAM onmultiple machines may be 'good enough', but it's not the same.

The problem with replicating it to the cluster isn't simple bandwidth, it'slatency and the problem of figuring out how long you should wait for the othernodes (which may have failed) to respond.

- Serviceability concerns:
 * One of the constraints large-scale puts is you can't handle problems
manually anymore. One of the economical things in a cloud for instance, is
to terminate and start a new instance as first response to any trouble, and
using online diagnostics to find machines with problems and fix them out of
band.

this causes data loss if there is any data unique to that machine, be it in ramor on disk

- Impact of message modification
 * The message routing and ingestion into the datastore is transparent to
user. So by definition, no elements of routing path must leak into
functional representation of the message. So user will not know, or care,
about what nodes the messages traversed.

as an admin of the system I may want to know what nodes it traversed, even ifthe person generating the message doesn't. If you have some messages beingdelayed significantly or corrupted, it's really helpful to have the messagestagged with the nodes they traverse so you can see what node is likely to behaving problems.

- Kind of complexity that exists in writing/reading from the storage tier
 * With data volume of a few million messages per second, a distributed
datastore is anyway necessary (no single-instance datastore built for
commodity hardware can ingest data at that rate).

So now you are talking about a unique data storage that rsyslog doesn'tcurrently support.


Development of that should be it's own project.

 * We provide at-least-once delivery guarantee, which means datastore
should have a way to identify duplicates either during ingestion, or during
serving reads. While multiple solutions are possible here, a rather simple
scheme that comes to my mind: hash the message by some key (say session_id)
and send it to the same machine in data-store cluster, and maintain
session-level idempotent-ingestion mechanism (with fixed sliding window of
just sequence number ranges, use range-representation to make it efficient)
on that node.

you can't have your farm of tier-1 nodes picking the unique id for the message,coordinating that would be too much overhead. Any unique id for the messageshould be per-sender, and it would need to have sender-id, generation-id andsequence-id (you can reliably store a generation-id and not re-use it if yourestart, but trying to reliably store a sequence-id gets you back to the fsyncproblem, so you accept that you will re-use sequence-ids on the same machine,but make sure you will never use the same sequence-id with the samegeneration-id on the same machine)

- Replication topology differences
 * Replication topology can be of 2 types, as you correctly pointed out:
K-way fan-out vs hop-by-hop with K hops.
 * K-way fanout design(using omrelp, or imrelp) that RELP based design
adopts, requires another complex component, which in case of node-failure,
needs to decide what the surviving replica must do.
Such a component must be aware of placement of replicas, must be able to
elect master and allow master to propagate changes further, and must re-run
master-election in case one of the replicas is lost. This requires
distributed coordination which is a hard problem, as you pointed out
earlier. Not just this, it must elect new replica as well from a pool to
replace the lost replica, as replication factor would have fallen by 1.

there is a huge advantage in having the cluster maintain it's state and controlmembership rather than having the things sending the cluster maintain thatstate.

First, you will have a lot fewer machines in the cluster than sending to thecluster.

Second, determining if a machine is healthy is a lot more than can you connectto it, and that sort of knowledge can only be determined from inside themachines of the cluster, so you either have the cluster running health checksand reporting them to the clients for them to decide if the node is healthyenough to use, or you have the cluster running health checks and reporting it tothe cluster management software on the cluster.

Third, having multiple views of what machines are in the cluster (one perclient) is a recipe for confusion as the different clients can end up withdifferent opinions on what nodes are healty.

 * Impact on solution: Burstability needs to be built before datastore.
Between replicated queues vs. backpressure and backlog-buildup on producer,
replicated queues are preferable for redundancy reasons.

assuming that your distributed datastore allows you to feed the same message toit multiple times and have it weed out duplicates. I think that this is a muchharder problem than you are thinking.

- RELP or not
 * My benchmarks on a 1G 20-core box achieved maximum of 27 MB/s
(uncompressed) with RELP with multiple producers. Compare it with ptcp: 98
MB/s uncompressed, 220 MB/s with stream-compression. So efficiency is one
of the points of contention.
 * Per message ack does not bring a whole lot of value over seq_no range
based acks. Entire range will have to be re-transmitted in case of error,
but if errors are infrequent, that'll not be a problem.

so let's look at improving RELP, something that will help many people who aren'tgoing to do the millions of logs/sec with the dozens of nodes per layer than youare describing.

Where are the bottlenecks in RELP? it does allow a window of X messages to bewaiting for acks. Does that window need to be larger? Do we need to allow formultiple acks to be sent as one message?

your approach assumes that if message 45 is acked, then messages 1-44 are alsoacked, not necessarily a valid assumption.

 * In hop-by-hop replication topology, we care about ACKs flowing
end-to-end, so additional reliability between each hop doesn't add a lot of
value.

you are again arguing based on your solution rather than the problem. You arealso contradicting yourself

If you are doing end-to-end acks, then the sender cannot forget the messageuntil the final destination acks.

If you are doing hop-by-hop acks, then the sender can forget the message as soonas the it is reliably at the next tier.


which is it?

 * Given wide-enough sliding-windows and pipelining (one of my original
sequence diagrams talk about pipelining, I don't show it in every diagram
for brevity reasons), anything built over ptcp will run as close to
un-acked speed as possible.
 * Given its TCP, and we have last-hop-to-source L7 acks, we can't really
lose an intermediate message. So per message ACKs are not valuable in this
case.

only if all messages follow the same path. If some messages go to machine(tier-node) 1-1, 2-1, 3 and other messages go 1-1, 2-2, 3, and yet others go1-2, 2-1, 3 you no longer have a single TCP flow and so message will notnecessarily arrive in the same order

- Broadcast-domain concern
 * The message producer being in the same broadcast domain is impossible,
because broadcast domains larger than 1024 chatty hosts are inefficient,
the larger, the worse (wrt ARP table sizes, noise on the network etc).


Cluster IP only requires that the recipients be in the same broadcast domain.

Also, if you are moving to statistically reliable rather than actuallyguaranteed reliable (since you are storing things in ram so if the whole systemlooses power you loose the message), you can actually consider using UDP foryour relay, at which point the sender can put one copy on the wire and theswitch will duplicate it for each machine in the cluster, avoiding bandwidthlimits on the sender (and no, comoddity hardware doesn't have 10G interfaces onevery system, it has 1G on every system)

 * Techniques like ECMP can be used to manage redundancy behind an IP and
for obtaining effective load-balancing too.

you are adding a LOT of complexity here. I don't know what scale you arethinking that you will be operating, but I've worked at Google and know a bitabout what they do for logging, and they are not large enough to need what youseem to be envisioning. I've also worked in Banking and even they do not needthe super-strong guarantees that you are talking about for even 0.1% of theirlogs. A stock exchange may need the volume and reliability that you are talkingabout, but they are not going to use general-purpose software because it'sflexibility will cause unacceptable latency for them. There are also not enoughstock exchanges out there for it to be worth significantly disrupting rsyslog onthe theory that they may use it.

I agree that implementing it in Rsyslog will be significant amount of work
(but it'll be significant amount of work even if its implemented in a
dedicated ingestion-relay that only does message movement, nothing else).
In my head the counter-weight is:
- Same interface for users (they talk rulesets, actions, plugins etc across
ack'ed and un-ack'ed pipelines
- Capability gain. Being able to work at very large scale with efficiency
using syslog.

How much complexity will it add to rsyslog? and how many people would actuallyuse this architecture? How close can we come to achieving the results withouthaving to change the fundamental way that rsyslog works?

Specifically, the fact that rsyslog queues are one-directional. An input isfinished processing a message once it's added to the queue. Implementingend-to-end acks requires that the input still keep track of the sender untilafter the output has processed the message so that the ack can flow backupstream to the sender.

I want us to figure out what we can do without breaking this fundamental designfactor of rsyslog.

thinking about it a bit more, RELP between nodes may not be needed, because youare already saying that if a machine hiccups, you are willing to loose allmessages that machine is processing. So if you just use TCP as your transport,as long as you don't have any SPOF between your sender and next hop recipientsthat could cut all your connections at once, you can count on that redundancy toprotect you. If you don't guarentee that there is no SPOF, then you do need anapplication level ack and we are back to RELP.

The sender can already craft and add a sequence-id and it's hostname can besender-id. Generation-id is a bit harder, but could be done in a wrapper aroundrsyslog or by (mis)using the table lookup mechanism to read a value from diskafter rsyslog starts. Generating a unique sequence-id will slow the throughputof rsyslog because it requires coordination between threads, but a modificationto rsyslog to have the output module create a sequence-id where the bottom bitsof the id aren't incramented, but instead are the thread-id would address thisperformance bottleneck.

I think that the biggest bottleneck that you would end up running into would bethe formatting of the output, especially if using JSON formatting so that youcan send the sequence-id independently of the original message contents.


David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

Reply via email to