Greetings,
This is a proposal for new-feature, and im inviting thoughts.
The aim is to use a set of rsyslog nodes(let us call it a cluster) to be
able to move messages reliably from source to destination.
Let us make a few assumptions so we can define the expected properties
clearly.
Assumptions:
- Data once successfully delivered to the Destination (typically a
datastore) is considered safe.
- Source-crashing with incomplete message hand-off to the cluster is
outside the scope of this. In such a case, source must retry.
- The cluster must be designed to support a maximum of K node failures
without any message loss
Here are the properties that may be desirable in such a service(the cluster
is implementation of this service):
- No message should ever be lost once handed over to the delivery-network
except in a disaster scenario
- Disaster scenario is a condition where more than k nodes in the cluster
fail
- Each source may pick a desirable value of k, where (k <= K)
- Any cluster nodes must re-transmit messages at a timeout T, if downstream
fails to ACK it before the timeout.
- Such a cluster should ideally be composable, in the sense, user should be
able to chain multiple such clusters.
This requires the cluster to support k-way replication of messages in the
cluster.
Implementation:
High level:
- The cluster is divided in multiple tiers (let us call them
replication-tiers (or rep-tiers).
- The cluster can handle multiple sessions at a time.
- Session_ids are unique and are generated by producer system when they
start producing messages
- Within a session, we have a notion of sequence-number (or seq_no), which
is a monotonically increasing number(incremented by 1 per message). This
requirement can possibly be relaxed for performance reasons, and gaps in
seq-id may be acceptable.
- Replication is basically managed by lower tiers sending data over to
higher tiers within the cluster, until replica-number (an attribute each
message carries, falls to 1)
- When replica-number falls to zero, we transmit message to desired
destination. (This can alternatively be done at the earliest opportunity,
i.e. in Tier-1, under special-circumstances, but let us discuss that later
if we find enough interest in doing so).
- There must be several nodes in each Tier, allocated to minimize
possibility of all of them going down at once (across availability zones,
different chassis etc).
- There must be a mechanism which allows nodes from upstream system to
discover nodes of Tier-1 of the cluster, and Tier-1 nodes to discover nodes
in Tier-2 of the cluster and so on. Hence nodes in Tier-K of the cluster
should be able to discover downstream nodes.
- Each session (or multiple sessions bundled according to arbitrary logic,
such as hashing), must pick one node from each tier as downstream-tier-node.
- Each node must maintain 2 watermarks:
* Replicated till seq_no : till what sequence number have messages been
k-way replicated in the cluster
* Delivered till seq_no: till what sequence number have messages been
delivered to downstream system
- Each send-operation (i.e. transmission of messages) from upstream to
cluster's Tier-1 or from lower tier in cluster to higher tier in cluster
will pass messages such that highest seq_no of any message(per session) in
transmitted batch is known
- Each receive-operation in cluster's Tier-1 or in upper-tiers within
cluster must respond/reply to transmitter with the two water-mark values
(i.e Replicated seq_no and Delivered seq_no) per session.
- Lower tiers (within the cluster) are free to discard messages all message
with seq_no <= Delivered till seq_no
- Upstream system is free to discard all messages with seq_no <= Replicated
till seq_no of cluster
- Upstream and downstream systems can be chained as instances of such
clusters if need be
- Maximum replication factor 'K' is dictated by cluster design (number of
tiers)
- Desired replication factor 'k' is a per-message controllable attribute
(decided by the upstream)
The sequence-diagrams below explain this visually:
Here is a case with an upstream sending messages with k = K :
ingestion_relay_1_max_replication.png
<https://docs.google.com/file/d/0B_XhUZLNFT4dN21TLTZBQjZMdUk/edit?usp=drive_web>
This is a case with k < K :
ingestion_relay_2_low_replication.png
<https://docs.google.com/file/d/0B_XhUZLNFT4da1lKMnRKdU9JUkU/edit?usp=drive_web>
The above 2 cases show only one transmission going from upstream system to
downstream system serially, this shows it pipelined :
ingestion_relay_3_pipelining.png
<https://docs.google.com/file/d/0B_XhUZLNFT4dQUpTZGRDdVVXLVU/edit?usp=drive_web>
This demonstrates failure of a node in the cluster, and how it recovers in
absence of continued transmission (it is recovered by timeout and
retransmission) :
ingestion_relay_4_timeout_based_recovery.png
<https://docs.google.com/file/d/0B_XhUZLNFT4dMm5kUWtaTlVfV1U/edit?usp=drive_web>
This demonstrates failure of a node in the cluster, and how it recovers due
to continued transmission :
ingestion_relay_5_broken_transmission_based_recovery.png
<https://docs.google.com/file/d/0B_XhUZLNFT4dd3M0SXpUYjFXdlk/edit?usp=drive_web>
Rsyslog level implementation sketch:
- Let us assume there is a way to identify the set of inputs, queues,
rulesets and actions that need to participate as reliable pipeline
components in a cluster node
- Each participating queue, will expect messages to contain a session-id
- Consumer bound to a queue will be expected to provide values for both
watermarks to per-session to dequeue more messages.
- Producer bound to a queue will be provided values for both watermarks
per-session as return value when en-queueing more messages.
- The inputs will transmit (either broadcast or unicast) both watermark
values to upstream actions (unicast is sent over relevant connections,
broadcast is sent across all connections) (please note this has nothing to
do with network broadcast domains, as everything is over TCP).
- Actions will receive the two watermarks and push it back to the queue
action is bound to, in order to dequeue more messages
- Rulesets will need to pick the relevant actions value across multiple
action-queues according to user-provided configuration, and propagate it
backwards
- Action must have ability to set arbitrarily value for replica-number when
passing it to downstream-system (so that chaining is possible).
- Inputs may produce the new value for replicated till seq_no when
receiving a message with replica_number == 1
- Action may produce the new value for delivered till seq_no after having
successfully delivered a message with replica_number == 1
Rsyslog configuration required(from user):
- User will need to identify machines that are a part of cluster
- These machines will have to be divided in multiple replication tiers (as
replication will happen only across machines in different tiers)
- User can pass message to the next cluster by setting replica_number back
to a desired number and passing it to an action which writes it to one of
the nodes in a downstream cluster
- User needs to check replica_number in the ruleset and take special action
(to write it to downstream system) when replica_number == 1
Does this have any overlap with RELP?
I haven't studied RELP in depth yet, but as far as I understand it, it
tries to solve the problem of delivering messages reliably between a
single-producer and a single-consumer losslessly (it targets different kind
of loss scenarios specifically). In addition to this, its scope is limited
to ensuring no messages are lost during transportation. In event of a crash
of the receiver node before it can handle received message reliably, some
messages may be lost. Someone with deeper knowledge of RELP should chime in.
Thoughts?
--
Regards,
Janmejay
http://codehunk.wordpress.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.