Ok, let me try to contrast properties of the originally proposed architecture vs. the two RELP based alternate architectures.
I'll try to enumerate the points(as top level bullet points), and will try to explain the trade-offs for each point as sub-bullet-items below it. I have picked each top-level item based on one of the following reasons: - Architectural alignment (ability to provide desired level of reliability, ease of failure-mode definition, easy serviceability etc) - Effective cost of maintenance as a large-scale system - Concerns you brought up (around nature of distributed systems, complexity involved, data-management concerns etc) - Hi-lighting the important trade-offs effectively between solutions. I'll also point out the impact of each one on the options we are left with. So here is the list: - Throughput problems (with per message fsync): * Per message fsync means one fsync per 1k bytes. * Each fsync does a round-trip to disk (or raid controller) * Read-around-write, effect more pronounced under memory pressure * For data-centers being built now, 10G network-cards on servers are a norm * A general purpose commodity server has limited file-io bandwidth (130 MB/s per spindle(for 10k SAS disk) * 6 spindles per box(1U chassis) = 780 MB/s, with RAID-10, that becomes 3 spindles, so 390 MB/s, which is still good 800 MB/s shy of 10G) * Note that 130 MB/s is 100% fsync free throughput, with frequent fsync, this is going to drop to less than 20% of original, that'd be < 100 MB/s on RAID-10 * 1G machine with ptcp with stream compression runs at 220 MB/s of uncompressed throughput (around 110 MB/s compressed, almost line-rate) with CPU utilization remaining below 10% * Impact on solution: per message fsync is not an option from efficiency and resource-utilization point of view - Serviceability concerns: * One of the constraints large-scale puts is you can't handle problems manually anymore. One of the economical things in a cloud for instance, is to terminate and start a new instance as first response to any trouble, and using online diagnostics to find machines with problems and fix them out of band. * Even though storage has redundancy, significant percentage of node-losses happen because of non-storage-related issues (freeze-up, management software issues, false alarms, bad memory, network blip etc) * Even if node is degraded because of bad-disk, notifying the DC team to replace the disk, perform RAID-repair and resuming operation is time-taking and expensive if done machine by machine, on demand * Ideal server-maintenance is done in batches, identify bad rack/U-number and have end-of-the-day run to address them in one shot * Architectures that make sense at large scale, are the ones that can self-heal. Basically, plan for failure, and run services un-degraded in face of bounded(within planned range) partial failures. * Acking based on just writing data to disk (fsync or no-fsync), and not having a way to recover data when node dies and has to be replaced by a new node, directly affects self-healing capabilities. * Impact on solution: Recovery from degradation on node-replacement is an important trait. So replication matters. (I understand the RELP and write to 2 cluster solution is in alignment to this, still stating it out). - Overall effective-reliability as a large-scale system (in terms of guarantees provided as a log-management service to end-users): * Recovery after crash(not being able to recover within a few seconds) with large-cluster directly affect lag(message generation to surfacing) related guarantees. - Impact of message modification * The message routing and ingestion into the datastore is transparent to user. So by definition, no elements of routing path must leak into functional representation of the message. So user will not know, or care, about what nodes the messages traversed. * Message modification in terms of change of physical-representation (while keeping logical content the same) is the only type of relevant modification. Which can be done in any arbitrary tier (say Tier-1). All recovery(by re-transmit) in any tier after Tier-1 happens using modified messages. It could happen in Tier-2, in which case the same would hold for any tiers after Tier-2 (and any recovery before Tier-2 would happen using unmodified messages re-transmit). - Kind of complexity that exists in writing/reading from the storage tier * With data volume of a few million messages per second, a distributed datastore is anyway necessary (no single-instance datastore built for commodity hardware can ingest data at that rate). * We provide at-least-once delivery guarantee, which means datastore should have a way to identify duplicates either during ingestion, or during serving reads. While multiple solutions are possible here, a rather simple scheme that comes to my mind: hash the message by some key (say session_id) and send it to the same machine in data-store cluster, and maintain session-level idempotent-ingestion mechanism (with fixed sliding window of just sequence number ranges, use range-representation to make it efficient) on that node. * This has to live with complexity of multi-master design because of scale of the problem. The fact that we have multiple nodes in Tier-K of our design has no bearing on that complexity. - Replication topology differences * Replication topology can be of 2 types, as you correctly pointed out: K-way fan-out vs hop-by-hop with K hops. * K-way fanout design(using omrelp, or imrelp) that RELP based design adopts, requires another complex component, which in case of node-failure, needs to decide what the surviving replica must do. Such a component must be aware of placement of replicas, must be able to elect master and allow master to propagate changes further, and must re-run master-election in case one of the replicas is lost. This requires distributed coordination which is a hard problem, as you pointed out earlier. Not just this, it must elect new replica as well from a pool to replace the lost replica, as replication factor would have fallen by 1. * On the other hand, hop-by-hop architecture implicits the node responsible for propagating change downstream and provides failover without coordination. * The network-partition (what you called split-brain earlier) problem becomes much more pronounced in an architecture that has a coordinator, dynamic role change, coordinator assisted failover etc. * Also, it won't be slow because of end-to-end ACK given sliding window has been tuned correctly. Transmission will continue as we are using pipelining. - Ease of building burstability in ingestion rate * Its difficult to build burstability in indexing rate of a data-store because a lot more things are at play. For any typical data-store, you need to worry about its impact on check-pointing, replication mechanism's ability to catchup, worker-pool sizes, data-segment merging, its impact on available file-io bandwidth etc). * Its easier to build in queue before the data-store because all you need to worry about is where to keep this data. No complex data-structures, reorganization of data etc is involved. * Backpressure is another good way of handling this, but then in context of intestion-relay, it causes higher volume of data to continue to sit on producer's local-store, which is unreplicated. * Impact on solution: Burstability needs to be built before datastore. Between replicated queues vs. backpressure and backlog-buildup on producer, replicated queues are preferable for redundancy reasons. - RELP or not * My benchmarks on a 1G 20-core box achieved maximum of 27 MB/s (uncompressed) with RELP with multiple producers. Compare it with ptcp: 98 MB/s uncompressed, 220 MB/s with stream-compression. So efficiency is one of the points of contention. * Per message ack does not bring a whole lot of value over seq_no range based acks. Entire range will have to be re-transmitted in case of error, but if errors are infrequent, that'll not be a problem. * In hop-by-hop replication topology, we care about ACKs flowing end-to-end, so additional reliability between each hop doesn't add a lot of value. * Given wide-enough sliding-windows and pipelining (one of my original sequence diagrams talk about pipelining, I don't show it in every diagram for brevity reasons), anything built over ptcp will run as close to un-acked speed as possible. * Given its TCP, and we have last-hop-to-source L7 acks, we can't really lose an intermediate message. So per message ACKs are not valuable in this case. - Broadcast-domain concern * The message producer being in the same broadcast domain is impossible, because broadcast domains larger than 1024 chatty hosts are inefficient, the larger, the worse (wrt ARP table sizes, noise on the network etc). * Techniques like ECMP can be used to manage redundancy behind an IP and for obtaining effective load-balancing too. I agree that implementing it in Rsyslog will be significant amount of work (but it'll be significant amount of work even if its implemented in a dedicated ingestion-relay that only does message movement, nothing else). In my head the counter-weight is: - Same interface for users (they talk rulesets, actions, plugins etc across ack'ed and un-ack'ed pipelines - Capability gain. Being able to work at very large scale with efficiency using syslog. On Sat, Jan 24, 2015 at 1:03 PM, David Lang <[email protected]> wrote: > On Fri, 23 Jan 2015, David Lang wrote: > > On Sat, 24 Jan 2015, singh.janmejay wrote: >> >> That is alright. I wanted to float the idea and get an agreement. I'll be >>> happy to implement it, if we see value in it. >>> >> >> Let's back up here and talk about the problem rather than getting bogged >> down in the solution. >> >> Rsyslog already has the ability to relay the message from one machine to >> another reliably, the problem you are trying to address is what happens if >> the box dies after the receiving machine acks the message, correct? >> >> Currently we can (at a performance cost) configure the receiving machine >> to ack the message only after it's safe on redundant disks, but the message >> still won't be delivered until after the machine is repaired (worst case) >> >> You are looking for a way to deal with this problem. >> >> Is this a reasonable problem statement? >> > > following up (to try and keep things in bite size pieces), there are two > parts to the problem. > > 1. delivering messages to multiple machines > > 2. combining the resulting messages again at the end. > > for the moment let's ignore #2 > > Looking at #1. > > There are two fundamental approaches you can take. > > 1. you can have the sender deliver it to multiple destinations and not > consider it delivered until it's been accepted by enough destinations. > > 2. you can have the receiver replicate it out to multiple peers and not > consider it delivered (and therefor not sending the ack) until it's been > accepted by enough peers. > > the first option would mostly modify omrelp while the second would mostly > modify imrelp > > Since we have many cases where people want to replicate the data, > sometimes to different networks, I think we would be best off doing it on > the output side. > > Actually, now that I think about it. If you have a ruleset with a queue on > it that contains two relp actions, no message will be considered delivered > until both actions have succeeded in delivering it. As long as you have > some way of ensuring that both actions aren't going to deliver to the same > destination, no code changes are needed. If you make the destinations be > two clusters of machines, at least one machine in each cluster will receive > the message. > > Getting fancier than this, if RELP has the receiver identify themselves, > you could modify the RELP output module to allow multiple destinations in > some way, and if it recives the same ID that it already has for one of it's > existing connections, close the connnection and retry. This way you could > have all the machines in a single cluster and guarantee that you would > deliver to at least two of them. > > by the way, one other issue you will run into in modern datacenters is > figuing out how to make sure that you don't have the machines that you are > using for redundancy sharing a single point of failure. That SPoF could be > that they are both VMs on the same host machine, both are plugged in to the > same power source, both need to talk through the same switch, both need to > talk through the same router, etc. Then you have to worry about what > happens if communication breaks down between these independent pieces but > they can still talk to the outside world (this is referred to as the "split > brain" problem). The complexity in ensuring that you have the machines > completely independent is severe enough that I personally would be happier > having two independent clusters rather than replicating to N nodes of the > same cluster and try to make sure that you have hit machines that are > completely independent of each other. > > So it may be that this part of the problem is doable today, with a bit of > ugly config, but no code changes needed (although it's very possible to > make things simpler). > > > David Lang > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > -- Regards, Janmejay http://codehunk.wordpress.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

