On Sun, 25 Jan 2015, singh.janmejay wrote:

Ok, let me try to contrast properties of the originally proposed
architecture vs. the two RELP based alternate architectures.

I was trying to start at the problem, not the proposed solution.

- Throughput problems (with per message fsync):
 * Per message fsync means one fsync per 1k bytes.
 * Each fsync does a round-trip to disk (or raid controller)
 * Read-around-write, effect more pronounced under memory pressure
 * For data-centers being built now, 10G network-cards on servers are a
norm
 * A general purpose commodity server has limited file-io bandwidth (130
MB/s per spindle(for 10k SAS disk) * 6 spindles per box(1U chassis) = 780
MB/s, with RAID-10, that becomes 3 spindles, so 390 MB/s, which is still
good 800 MB/s shy of 10G)
 * Note that 130 MB/s is 100% fsync free throughput, with frequent fsync,
this is going to drop to less than 20% of original, that'd be < 100 MB/s on
RAID-10

spinning rust disks can do at most one fsync per rotation, using a disk queue and then writing the results requires multiple fsyncs per message. a 15K rpm disk can do ~160 fsyncs/sec.

now, SSDs are starting to hit to comoddity level, and they can handle higher rates, at a longevity cost.

if you want true guarantees, then you need to replicate your data to storage. If you are willing to settle for statistically reliable, then replicating to RAM on multiple machines may be 'good enough', but it's not the same.

The problem with replicating it to the cluster isn't simple bandwidth, it's latency and the problem of figuring out how long you should wait for the other nodes (which may have failed) to respond.

- Serviceability concerns:
 * One of the constraints large-scale puts is you can't handle problems
manually anymore. One of the economical things in a cloud for instance, is
to terminate and start a new instance as first response to any trouble, and
using online diagnostics to find machines with problems and fix them out of
band.

this causes data loss if there is any data unique to that machine, be it in ram or on disk

- Impact of message modification
 * The message routing and ingestion into the datastore is transparent to
user. So by definition, no elements of routing path must leak into
functional representation of the message. So user will not know, or care,
about what nodes the messages traversed.

as an admin of the system I may want to know what nodes it traversed, even if the person generating the message doesn't. If you have some messages being delayed significantly or corrupted, it's really helpful to have the messages tagged with the nodes they traverse so you can see what node is likely to be having problems.

- Kind of complexity that exists in writing/reading from the storage tier
 * With data volume of a few million messages per second, a distributed
datastore is anyway necessary (no single-instance datastore built for
commodity hardware can ingest data at that rate).

So now you are talking about a unique data storage that rsyslog doesn't currently support.

Development of that should be it's own project.

 * We provide at-least-once delivery guarantee, which means datastore
should have a way to identify duplicates either during ingestion, or during
serving reads. While multiple solutions are possible here, a rather simple
scheme that comes to my mind: hash the message by some key (say session_id)
and send it to the same machine in data-store cluster, and maintain
session-level idempotent-ingestion mechanism (with fixed sliding window of
just sequence number ranges, use range-representation to make it efficient)
on that node.

you can't have your farm of tier-1 nodes picking the unique id for the message, coordinating that would be too much overhead. Any unique id for the message should be per-sender, and it would need to have sender-id, generation-id and sequence-id (you can reliably store a generation-id and not re-use it if you restart, but trying to reliably store a sequence-id gets you back to the fsync problem, so you accept that you will re-use sequence-ids on the same machine, but make sure you will never use the same sequence-id with the same generation-id on the same machine)

- Replication topology differences
 * Replication topology can be of 2 types, as you correctly pointed out:
K-way fan-out vs hop-by-hop with K hops.
 * K-way fanout design(using omrelp, or imrelp) that RELP based design
adopts, requires another complex component, which in case of node-failure,
needs to decide what the surviving replica must do.
Such a component must be aware of placement of replicas, must be able to
elect master and allow master to propagate changes further, and must re-run
master-election in case one of the replicas is lost. This requires
distributed coordination which is a hard problem, as you pointed out
earlier. Not just this, it must elect new replica as well from a pool to
replace the lost replica, as replication factor would have fallen by 1.

there is a huge advantage in having the cluster maintain it's state and control membership rather than having the things sending the cluster maintain that state.

First, you will have a lot fewer machines in the cluster than sending to the cluster.

Second, determining if a machine is healthy is a lot more than can you connect to it, and that sort of knowledge can only be determined from inside the machines of the cluster, so you either have the cluster running health checks and reporting them to the clients for them to decide if the node is healthy enough to use, or you have the cluster running health checks and reporting it to the cluster management software on the cluster.

Third, having multiple views of what machines are in the cluster (one per client) is a recipe for confusion as the different clients can end up with different opinions on what nodes are healty.

 * Impact on solution: Burstability needs to be built before datastore.
Between replicated queues vs. backpressure and backlog-buildup on producer,
replicated queues are preferable for redundancy reasons.

assuming that your distributed datastore allows you to feed the same message to it multiple times and have it weed out duplicates. I think that this is a much harder problem than you are thinking.

- RELP or not
 * My benchmarks on a 1G 20-core box achieved maximum of 27 MB/s
(uncompressed) with RELP with multiple producers. Compare it with ptcp: 98
MB/s uncompressed, 220 MB/s with stream-compression. So efficiency is one
of the points of contention.
 * Per message ack does not bring a whole lot of value over seq_no range
based acks. Entire range will have to be re-transmitted in case of error,
but if errors are infrequent, that'll not be a problem.

so let's look at improving RELP, something that will help many people who aren't going to do the millions of logs/sec with the dozens of nodes per layer than you are describing.

Where are the bottlenecks in RELP? it does allow a window of X messages to be waiting for acks. Does that window need to be larger? Do we need to allow for multiple acks to be sent as one message?

your approach assumes that if message 45 is acked, then messages 1-44 are also acked, not necessarily a valid assumption.

 * In hop-by-hop replication topology, we care about ACKs flowing
end-to-end, so additional reliability between each hop doesn't add a lot of
value.

you are again arguing based on your solution rather than the problem. You are also contradicting yourself

If you are doing end-to-end acks, then the sender cannot forget the message until the final destination acks.

If you are doing hop-by-hop acks, then the sender can forget the message as soon as the it is reliably at the next tier.

which is it?

 * Given wide-enough sliding-windows and pipelining (one of my original
sequence diagrams talk about pipelining, I don't show it in every diagram
for brevity reasons), anything built over ptcp will run as close to
un-acked speed as possible.
 * Given its TCP, and we have last-hop-to-source L7 acks, we can't really
lose an intermediate message. So per message ACKs are not valuable in this
case.

only if all messages follow the same path. If some messages go to machine (tier-node) 1-1, 2-1, 3 and other messages go 1-1, 2-2, 3, and yet others go 1-2, 2-1, 3 you no longer have a single TCP flow and so message will not necessarily arrive in the same order

- Broadcast-domain concern
 * The message producer being in the same broadcast domain is impossible,
because broadcast domains larger than 1024 chatty hosts are inefficient,
the larger, the worse (wrt ARP table sizes, noise on the network etc).

Cluster IP only requires that the recipients be in the same broadcast domain.

Also, if you are moving to statistically reliable rather than actually guaranteed reliable (since you are storing things in ram so if the whole system looses power you loose the message), you can actually consider using UDP for your relay, at which point the sender can put one copy on the wire and the switch will duplicate it for each machine in the cluster, avoiding bandwidth limits on the sender (and no, comoddity hardware doesn't have 10G interfaces on every system, it has 1G on every system)

 * Techniques like ECMP can be used to manage redundancy behind an IP and
for obtaining effective load-balancing too.

you are adding a LOT of complexity here. I don't know what scale you are thinking that you will be operating, but I've worked at Google and know a bit about what they do for logging, and they are not large enough to need what you seem to be envisioning. I've also worked in Banking and even they do not need the super-strong guarantees that you are talking about for even 0.1% of their logs. A stock exchange may need the volume and reliability that you are talking about, but they are not going to use general-purpose software because it's flexibility will cause unacceptable latency for them. There are also not enough stock exchanges out there for it to be worth significantly disrupting rsyslog on the theory that they may use it.

I agree that implementing it in Rsyslog will be significant amount of work
(but it'll be significant amount of work even if its implemented in a
dedicated ingestion-relay that only does message movement, nothing else).
In my head the counter-weight is:
- Same interface for users (they talk rulesets, actions, plugins etc across
ack'ed and un-ack'ed pipelines
- Capability gain. Being able to work at very large scale with efficiency
using syslog.

How much complexity will it add to rsyslog? and how many people would actually use this architecture? How close can we come to achieving the results without having to change the fundamental way that rsyslog works?

Specifically, the fact that rsyslog queues are one-directional. An input is finished processing a message once it's added to the queue. Implementing end-to-end acks requires that the input still keep track of the sender until after the output has processed the message so that the ack can flow back upstream to the sender.

I want us to figure out what we can do without breaking this fundamental design factor of rsyslog.



thinking about it a bit more, RELP between nodes may not be needed, because you are already saying that if a machine hiccups, you are willing to loose all messages that machine is processing. So if you just use TCP as your transport, as long as you don't have any SPOF between your sender and next hop recipients that could cut all your connections at once, you can count on that redundancy to protect you. If you don't guarentee that there is no SPOF, then you do need an application level ack and we are back to RELP.

The sender can already craft and add a sequence-id and it's hostname can be sender-id. Generation-id is a bit harder, but could be done in a wrapper around rsyslog or by (mis)using the table lookup mechanism to read a value from disk after rsyslog starts. Generating a unique sequence-id will slow the throughput of rsyslog because it requires coordination between threads, but a modification to rsyslog to have the output module create a sequence-id where the bottom bits of the id aren't incramented, but instead are the thread-id would address this performance bottleneck.

I think that the biggest bottleneck that you would end up running into would be the formatting of the output, especially if using JSON formatting so that you can send the sequence-id independently of the original message contents.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to