Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

singh.janmejay Sun, 25 Jan 2015 10:12:51 -0800

On Sun, Jan 25, 2015 at 5:09 AM, David Lang <[email protected]> wrote:

> On Sun, 25 Jan 2015, singh.janmejay wrote:
>
>  if you want true guarantees, then you need to replicate your data to
>>> storage. If you are willing to settle for statistically reliable, then
>>> replicating to RAM on multiple machines may be 'good enough', but it's
>>> not
>>> the same.
>>>
>>>
>> Im working with a simple definition of reliability. If something goes
>> wrong, be it rsyslog crash, hardware failure or anything in between, as
>> far
>> as the cluster is concerned, its a node loss. The cluster should be able
>> to
>> tolerate loss of upto K nodes without losing data.
>>
>
> I'm looking a little wider than that. I'm looking at what happens if you
> loose a chunk of network or all the power in an area of the datacenter.
> Things that can cause you to 'loose' lots of nodes at one time.
>


Yes, any more than K-nodes loss can cause data-loss.

You point out later that its easier said than done at a cloud provisioning
level. I had a simple scheme in mind, obviously a bigger outage radius will
still mess it up, but I think its worth putting it out.

I was thinking of each tier of the cluster being built using a different
availability-zone. So power problems, minor disaster scenarios(limited to
availability zone) etc will not cause any data loss, even though it'll
pause data-movement completely if entire availability-zone is lost (until
the tier is partly rebuilt using replacements).

Basically, whatever is deleted by upstream is already replicated across
tiers, so single tier loss can't kill it.



>
>   - Serviceability concerns:
>>>
>>>>  * One of the constraints large-scale puts is you can't handle problems
>>>> manually anymore. One of the economical things in a cloud for instance,
>>>> is
>>>> to terminate and start a new instance as first response to any trouble,
>>>> and
>>>> using online diagnostics to find machines with problems and fix them out
>>>> of
>>>> band.
>>>>
>>>>
>>> this causes data loss if there is any data unique to that machine, be it
>>> in ram or on disk
>>>
>>>
>> No, it doesn't. It causes node-loss, but once the upper-tier, or
>> upper-system detects the loss, it re-floods un-delivered messages to the
>> replacement. The last 2 sequence-diagrams in the first mail walk through 2
>> ways of recovering from node-failures without any data-loss.
>>
>
> imprecise wording on my part, I meant that there is no attempt to preserve
> that data so that the node will send it later. That _copy_ of the data is
> lost.
>

True, writing to disk I think is a valid option, and a valuable one when
trade-off is acceptable. But in my opinion it shouldn't be the only option
(I don't think you are proposing it as the only option either).


>
>   - Impact of message modification
>>>
>>>>  * The message routing and ingestion into the datastore is transparent
>>>> to
>>>> user. So by definition, no elements of routing path must leak into
>>>> functional representation of the message. So user will not know, or
>>>> care,
>>>> about what nodes the messages traversed.
>>>>
>>>>
>>> as an admin of the system I may want to know what nodes it traversed,
>>> even
>>> if the person generating the message doesn't. If you have some messages
>>> being delayed significantly or corrupted, it's really helpful to have the
>>> messages tagged with the nodes they traverse so you can see what node is
>>> likely to be having problems.
>>>
>>>
>> Yes, but then you would rather depend on failure-logs from cluster nodes,
>> and stats around lag across tiers like lag-percentiles etc. Isn't it?
>>
>
> where do you get those stats? you have to have info to extract those stats
> from.
>

Oh, I thought you were talking about output that goes to the data-store.

For messages that get exchanged between tiers and even systems, we'll have
to figure out some way of carrying session_id, seq_no etc.


>
>  Also, even with routing info being added to the message,
>> idempotent-ingestion-layer can choose to store all duplicate message's
>> metadata in a way which allows such reporting.
>>
>
> I'm saying that you have to have the metadata being passed around in the
> first place. This means that you have to add info to what's being passed
> around.


Agree.


>
>
>   - Kind of complexity that exists in writing/reading from the storage tier
>>>
>>>>  * With data volume of a few million messages per second, a distributed
>>>> datastore is anyway necessary (no single-instance datastore built for
>>>> commodity hardware can ingest data at that rate).
>>>>
>>>>
>>> So now you are talking about a unique data storage that rsyslog doesn't
>>> currently support.
>>>
>>> Development of that should be it's own project.
>>>
>>>
>> Yes, that is true.  But assuming the output module required for that out
>> of
>> the scope of this discussion, it doesn't matter that much.
>>
>> But you are right, its usefulness in absence of integration with such a
>> data-store is limited.
>>
>> However, HDFS can be considered such a data-store. All output-modules need
>> to log to the same directory, but in different files. Idempotency will be
>> taken care of in reduce-phase of data processing.
>>
>>
>>
>>>   * We provide at-least-once delivery guarantee, which means datastore
>>>
>>>> should have a way to identify duplicates either during ingestion, or
>>>> during
>>>> serving reads. While multiple solutions are possible here, a rather
>>>> simple
>>>> scheme that comes to my mind: hash the message by some key (say
>>>> session_id)
>>>> and send it to the same machine in data-store cluster, and maintain
>>>> session-level idempotent-ingestion mechanism (with fixed sliding window
>>>> of
>>>> just sequence number ranges, use range-representation to make it
>>>> efficient)
>>>> on that node.
>>>>
>>>>
>>> you can't have your farm of tier-1 nodes picking the unique id for the
>>> message, coordinating that would be too much overhead. Any unique id for
>>> the message should be per-sender, and it would need to have sender-id,
>>> generation-id and sequence-id (you can reliably store a generation-id and
>>> not re-use it if you restart, but trying to reliably store a sequence-id
>>> gets you back to the fsync problem, so you accept that you will re-use
>>> sequence-ids on the same machine, but make sure you will never use the
>>> same
>>> sequence-id with the same generation-id on the same machine)
>>>
>>>
>> No, cluster doesn't generate session_ids. Session_ids are random-uuids
>> generated by message-producer.
>>
>
> Ok.
>
>
>   - Replication topology differences
>>>
>>>>  * Replication topology can be of 2 types, as you correctly pointed out:
>>>> K-way fan-out vs hop-by-hop with K hops.
>>>>  * K-way fanout design(using omrelp, or imrelp) that RELP based design
>>>> adopts, requires another complex component, which in case of
>>>> node-failure,
>>>> needs to decide what the surviving replica must do.
>>>> Such a component must be aware of placement of replicas, must be able to
>>>> elect master and allow master to propagate changes further, and must
>>>> re-run
>>>> master-election in case one of the replicas is lost. This requires
>>>> distributed coordination which is a hard problem, as you pointed out
>>>> earlier. Not just this, it must elect new replica as well from a pool to
>>>> replace the lost replica, as replication factor would have fallen by 1.
>>>>
>>>>
>>> there is a huge advantage in having the cluster maintain it's state and
>>> control membership rather than having the things sending the cluster
>>> maintain that state.
>>>
>>> First, you will have a lot fewer machines in the cluster than sending to
>>> the cluster.
>>>
>>> Second, determining if a machine is healthy is a lot more than can you
>>> connect to it, and that sort of knowledge can only be determined from
>>> inside the machines of the cluster, so you either have the cluster
>>> running
>>> health checks and reporting them to the clients for them to decide if the
>>> node is healthy enough to use, or you have the cluster running health
>>> checks and reporting it to the cluster management software on the
>>> cluster.
>>>
>>> Third, having multiple views of what machines are in the cluster (one per
>>> client) is a recipe for confusion as the different clients can end up
>>> with
>>> different opinions on what nodes are healty.
>>>
>>>
>> Membership, yes. The zookeeper (ephemeral znodes) based discovery solution
>> that I mentioned was one of the ways I was thinking of solving that.
>> Corosync based solution was another way.
>>
>> Membership info is essential, so all producers can easily discover Tier-1
>> nodes, Tier-1 nodes can discover Tier-2 nodes etc. My contention is with
>> having to manage failed nodes, re-replicate data on failure discovery,
>> identify replacement node centrally etc. In event of network partitions,
>> the way such systems fail becomes much worse with complex cluster
>> management techniques. Self-organizing techniques in my observation work
>> better (fewer nodes talking, fewer failure modes).
>>
>
> I agree that fewere nodes talking, fewer failure modes. That's why I'm
> saying that having the clients figure this out is wrong. It needs to be
> done by the much smaller number of nodes in the cluster, not by every
> machine that could send data to them.
>

>From systems that I have some experience with, having dedicated controller
nodes for making distributed systems decisions allows arbitrary complexity
in decision making. It eventually leads to undesirable complexity and more
types of failure as a distributed-system.

I have been trying to work towards a design where there is no requirement
for a separate decision-maker. This obviously works when each node in
isolation can make decisions, which forces us to keep the decision-making
simple.

In this case, all it needs to know (from a discovery mechanism or
otherwise), is how to reach some machine in the next tier (or 1st tier of
the next system, for inter-system case). It doesn't need to know or work
with complexity of is all data K-way replicated, what data is not, which
node should it be replicated to, which of these nodes must send data
downstream, how to agree upon which node will make such decisions, how to
failover to some other node in event of network-partition of various types
etc.

Using tiers for replication is a way to achieve coordinator-free design.

Clients (as in message producers) are nothing more than a general upstream
system (as in the sequence diagrams). They are just a single-replica
system(with disk based message ingestion, but single replica at node level
anyway), so the quicker we move data out of them, the better.


>
>    * Impact on solution: Burstability needs to be built before datastore.
>>>
>>>> Between replicated queues vs. backpressure and backlog-buildup on
>>>> producer,
>>>> replicated queues are preferable for redundancy reasons.
>>>>
>>>>
>>> assuming that your distributed datastore allows you to feed the same
>>> message to it multiple times and have it weed out duplicates. I think
>>> that
>>> this is a much harder problem than you are thinking.
>>>
>>>
>> Yes, it is certainly hard. But is there a way to avoid it? Regardless of
>> what we do, that problem has to be solved.
>>
>
> only if multiple deliveries are common.
>

Well, that depends on guarantees that users of this data desire. For audit
kind of usecase for some important transactions for instance, even one
duplicate will be undesirable.

But did you have a scheme in mind which reduces the duplicate problem
enough for us to be able to ignore it (like 1 in a million, which can be
handed by visual inspection if the data is ever refereed to)?


>
>  Where are the bottlenecks in RELP? it does allow a window of X messages to
>>> be waiting for acks. Does that window need to be larger? Do we need to
>>> allow for multiple acks to be sent as one message?
>>>
>>> your approach assumes that if message 45 is acked, then messages 1-44 are
>>> also acked, not necessarily a valid assumption.
>>>
>>>
>> No, it doesn't assume that. It expects acks to work in terms of ranges. 1
>> -
>> 30 may not be acked, 31 - 45 may be. The state-machine will force retry on
>> ack-wait timeout. Eventually, it'll prevent sliding window from moving
>> forward, so it will have to be handled.
>>
>
> so your windows are a micro-optimization over a list of acked messages.
> Not a significant architectural issue.


Agree.


>
>
>    * In hop-by-hop replication topology, we care about ACKs flowing
>>>
>>>> end-to-end, so additional reliability between each hop doesn't add a lot
>>>> of
>>>> value.
>>>>
>>>>
>>> you are again arguing based on your solution rather than the problem. You
>>> are also contradicting yourself
>>>
>>> If you are doing end-to-end acks, then the sender cannot forget the
>>> message until the final destination acks.
>>>
>>> If you are doing hop-by-hop acks, then the sender can forget the message
>>> as soon as the it is reliably at the next tier.
>>>
>>> which is it?
>>>
>>>
>> Sorry, I think I should have done a better job of explaining inter-system
>> ack mechanism upfront.
>>
>> It is end-to-end acking, but end-to-end is not first-system to
>> last-system,
>> it is end to end within the cluster.
>>
>> Assume multiple such clusters, A, B, C and D.
>>
>> Let us say message flows from A -> B -> C -> D.
>>
>> We expect each of these has capability to keep messages replicated and
>> safe
>> as long as next system doesn't confirm that it has replicated it to
>> desired
>> level.
>>
>> Once C has replicated it sufficiently, B can discard it. C has multiple
>> tiers, and acks flow the two watermark values from last tier to first
>> tier.
>>
>> C's internal ack flow is what I mean by end-to-end ack flow.
>>
>> If we assume A, B, C and D to be single node systems, then you can think
>> if
>> this as next-hop acking (not end-to-end acking).
>>
>> The sequence-diagrams I shared mark these systems, A, B and C in pink,
>> grey
>> and blue respectively (there is no D in that).
>>
>> I think if you look at how acks are used to purge old messages (shown in
>> black on ack arrow) across systems in the diagrams, that may explain it
>> better than the text here.
>>
>
> so that is what I consider hop-by-hop acks, not end-to-end acks, and other
> than the fact that it's going to multiple machines, this is exactly what
> RELP does today.
>
> If you were to modify imrelp so that it didn't send it's ack until it had
> sent the message out to all the other machines in the cluster, you would
> have exactly what you are talking about implementing, right?
>

Well, RELP or something else here is not a critical decision the way I am
seeing it.

Consider this, we are talking about building ack-flow across queues,
backwards. If this is built, it'll allow any protocol for input/output to
utilize it. RELP can utilize it out of the box (the contract won't be the
same as it stand today though, as it'll be acking on completion of
replication and completion of delivery rather than on reaching the input
bound queue). But other protocols like ptcp will be able to utilize it
well. Its upto the implementer of input/output module to decide if they
want to support this or not.

If we address throughput limitations of RELP, we should be able to use it
just fine. Or we can enhance ptcp to support this. RELP is still a notch
ahead in functionallity, because it is capable of acking each message, and
is capable of acking on delivery to input-bound-queue, but the two are not
quite the same thing.


>
>    * Given wide-enough sliding-windows and pipelining (one of my original
>>>
>>>> sequence diagrams talk about pipelining, I don't show it in every
>>>> diagram
>>>> for brevity reasons), anything built over ptcp will run as close to
>>>> un-acked speed as possible.
>>>>  * Given its TCP, and we have last-hop-to-source L7 acks, we can't
>>>> really
>>>> lose an intermediate message. So per message ACKs are not valuable in
>>>> this
>>>> case.
>>>>
>>>>
>>> only if all messages follow the same path. If some messages go to machine
>>> (tier-node) 1-1, 2-1, 3 and other messages go 1-1, 2-2, 3, and yet others
>>> go 1-2, 2-1, 3 you no longer have a single TCP flow and so message will
>>> not
>>> necessarily arrive in the same order
>>>
>>>
>> The need not arrive in the same order. It doesn't matter because we aren't
>> making any ordering guarantees. As long as all messages come through
>> between host from Tier-n to Tier-n+1, we are ok.
>>
>
> It's sounding more and more like you are talking about a modified version
> of RELP, with the added requirement to replicate across machines before
> sending the ack.


A modified version of RELP will do.


>
>
>   I agree that implementing it in Rsyslog will be significant amount of
>>> work
>>>
>>>> (but it'll be significant amount of work even if its implemented in a
>>>> dedicated ingestion-relay that only does message movement, nothing
>>>> else).
>>>> In my head the counter-weight is:
>>>> - Same interface for users (they talk rulesets, actions, plugins etc
>>>> across
>>>> ack'ed and un-ack'ed pipelines
>>>> - Capability gain. Being able to work at very large scale with
>>>> efficiency
>>>> using syslog.
>>>>
>>>>
>>> How much complexity will it add to rsyslog? and how many people would
>>> actually use this architecture? How close can we come to achieving the
>>> results without having to change the fundamental way that rsyslog works?
>>>
>>> Specifically, the fact that rsyslog queues are one-directional. An input
>>> is finished processing a message once it's added to the queue.
>>> Implementing
>>> end-to-end acks requires that the input still keep track of the sender
>>> until after the output has processed the message so that the ack can flow
>>> back upstream to the sender.
>>>
>>> I want us to figure out what we can do without breaking this fundamental
>>> design factor of rsyslog.
>>>
>>>
>>>
>>> thinking about it a bit more, RELP between nodes may not be needed,
>>> because you are already saying that if a machine hiccups, you are willing
>>> to loose all messages that machine is processing. So if you just use TCP
>>> as
>>> your transport, as long as you don't have any SPOF between your sender
>>> and
>>> next hop recipients that could cut all your connections at once, you can
>>> count on that redundancy to protect you. If you don't guarentee that
>>> there
>>> is no SPOF, then you do need an application level ack and we are back to
>>> RELP.
>>>
>>>
>> I agree. The provisioning system will have to be smart to provision nodes
>> from different availability zones, different chassis etc.
>>
>
> it's messier than that makes it sound
>

Like I mentioned earlier, its as simple as allocating each tier of a
cluster in a separate availability zone.

The allocation-stack needs to be smart enough to have concept of
availability zones, but not much more.


>
>  The sender can already craft and add a sequence-id and it's hostname can
>>> be sender-id. Generation-id is a bit harder, but could be done in a
>>> wrapper
>>> around rsyslog or by (mis)using the table lookup mechanism to read a
>>> value
>>> from disk after rsyslog starts. Generating a unique sequence-id will slow
>>> the throughput of rsyslog because it requires coordination between
>>> threads,
>>> but a modification to rsyslog to have the output module create a
>>> sequence-id where the bottom bits of the id aren't incramented, but
>>> instead
>>> are the thread-id would address this performance bottleneck.
>>>
>>>
>> I was thinking of both session_id and sequence_id being handled by
>> producer.
>>
>> Say producer is a multi-threaded system. Each logging thread can generate
>> a
>> random-uuid and keep it as a thread-local. All logging uses it as
>> session_id, and each thread has a thread-local counter, which serves as
>> seq_id, gets incremented for each log etc.
>>
>> That way all that complexity will stay out of rsyslog.
>>
>
> Well, if it's in the message delivered to rsyslog in the first place, then
> rsyslog doesn't care about it.
>
> But as I say, I think it's very reasonable to have the leaf rsyslog node
> generate the data.
>

The reason I think it should be with source-system: it allows them to
control order. We may delivery it out of order(so a newer event may become
visible before an older one) to the data-store, but if they have an option
to see it in order, its valuable to let them decide the order too.


>
>  I think that the biggest bottleneck that you would end up running into
>>> would be the formatting of the output, especially if using JSON
>>> formatting
>>> so that you can send the sequence-id independently of the original
>>> message
>>> contents.
>>>
>>
>>
>> Yes, in some cases it may be easier to change functional definition to
>> have
>> a wrapper which carries this non-functional stuff too.
>>
>
> that doesn't solve the problem, it just shifts it slightly. Something
> still needs to take the provided data and the metadata and create a
> sequence of bytes that will be sent over the wire.
>

Yes, this was due to a confusion, my bad. I assumed you meant data that is
sent from last tier to data-store.


>
> Rsyslog has the ability to have string modules (sm*) that do this in C
> code. It used to be that all of the standard message formats were created
> using the template engine, but after I raised the issue, we found that
> optimizing the common formats down to C was enough of a win to result in a
> ~10% improvement in rsyslog's overall performance.
>
> I think we are going to find that we really should do something along the
> same lines for JSON output. And if you opt to send the data in some other
> structure, you will need to do the same for your format.


Sorry, I lost you on this one. Can you elaborate a little about why JSON is
relevant?


>
>
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>



-- 
Regards,
Janmejay
http://codehunk.wordpress.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable 'at-least-once' message delivery at large scale

Reply via email to