I admit I just skimmed over the messages. The windows architecture is quite a bit different, the input waits there in some cases.
Sent from phone, thus brief. Am 23.01.2015 23:35 schrieb "David Lang" <[email protected]>: > for this case it would have to feed back across the queues to the input > module so that the input module could send something out. > > That's the really hard part. > > David Lang > > On Fri, 23 Jan 2015, Rainer Gerhards wrote: > > Well, an output can block when forwarding. We do a similar thing with the >> windows products, via another protocol. But as you say David, it's pretty >> problematic in many cases. >> >> But as I said, I really can't help develop this right now. That includes >> in depth design discussions. >> >> Rainer >> >> Sent from phone, thus brief. >> Am 23.01.2015 22:54 schrieb "David Lang" <[email protected]>: >> >> The huge problem would be how to pass data (the ack) back up from the >>> final receiver to the original sender. I don't think rsyslog would be >>> able >>> to do this without a MAJOR rewrite (currently the earlier machines would >>> have completely forgotten about the message by the time the ack gets >>> generated) >>> >>> David Lang >>> >>> On Fri, 23 Jan 2015, Rainer Gerhards wrote: >>> >>> Sorry to be that blunt, but I simply have no time to participate in >>> >>>> developing this. But I would be very open to merge any results. >>>> >>>> Rainer >>>> >>>> Sent from phone, thus brief. >>>> Am 23.01.2015 22:17 schrieb "singh.janmejay" <[email protected] >>>> >: >>>> >>>> On Sat, Jan 24, 2015 at 2:19 AM, David Lang <[email protected]> wrote: >>>> >>>>> >>>>> > RELP is the network protocol you need for this sort of reliability. >>>>> > However, you would also need to not allow any message to be stored in >>>>> > memory (because it would be lost if rsyslog crashes or the system >>>>> reboots >>>>> > unexpectedly). You would have to use disk queues (not disk assisted >>>>> queues) >>>>> > everywhere and do some other settings (checkpoint interval of 1 for >>>>> example) >>>>> > >>>>> > This would absolutly cripple your performance due to the disk I/O >>>>> > limitations. I did some testing of this a few years ago. I was using >>>>> a >>>>> > high-end PCI SSD (a 160G card cost >$5K at the time) and depending on >>>>> the >>>>> > filesystem I used, I could get rsyslog to receive between 2K and 8K >>>>> > messages/sec. The same hardware writing to a 7200rpm SATA drive with >>>>> memory >>>>> > buffering allowed could handle 380K messages/sec (the limiting factor >>>>> was >>>>> > the Gig-E network) >>>>> > >>>>> > Doing this sort of reliability on a 15Krpm SAS drive would limit you >>>>> to >>>>> > ~50 logs/sec. Modern SSDs would be able to do better, I would guess a >>>>> few >>>>> > hundred logs/sec from a good drive, but you would be chewing through >>>>> the >>>>> > drive lifetime several thousand times faster than if you were >>>>> allowing >>>>> > memory buffering. >>>>> > >>>>> > Very few people have logs that are critical enough to warrent this >>>>> sort >>>>> of >>>>> > performance degredation. >>>>> > >>>>> >>>>> I didn't particularly have disk-based queues in mind for reliability >>>>> reasons. However, messages may need to overflow to disk to manage >>>>> bursts >>>>> (but only for burstability reasons). For a large-architecture for this >>>>> nature, its generally useful to classify failures in a broad way >>>>> (rather >>>>> than very granular failure modes, that we identify for transactional >>>>> databases etc). The reason for this ties back to self-healing. Its >>>>> easier >>>>> to build self-healing mechanisms assuming only one kind of failure, >>>>> node >>>>> loss. It could happen for multiple reasons, but if we treat it that >>>>> way, >>>>> all we have to do is build room for managing the cluster when 1 (or k) >>>>> nodes are lost. >>>>> >>>>> So thinking of it that way, a rsyslog crash, or a machine-crash or a >>>>> disk >>>>> failure are all the same to me. They are just node loss (we may be able >>>>> to >>>>> bring the node back with some offline procedure), but it'll come back >>>>> as >>>>> a >>>>> fresh machine with no state. >>>>> >>>>> Which is why I treat K-safety as a basic design parameter. If K nodes >>>>> disappear, data will be lost. >>>>> >>>>> With this kind of coarse-grained failure-mode, messages can easily be >>>>> kept >>>>> in memory. >>>>> >>>>> >>>>> > >>>>> > In addition, this sort of reliability is saying that you would rather >>>>> have >>>>> > your applications freeze than have them do something and not have it >>>>> > logged. And that you are willing to have your application slow down >>>>> to >>>>> the >>>>> > speed of the logging. Very few people are willing to do this. >>>>> > >>>>> > >>>>> > >>>>> > You are proposing doing the application ack across multiple hops >>>>> instead >>>>> > of doing it hop-by-hop. This would avoid the problem that can happen >>>>> with >>>>> > hop-by-hop acks where a machine that has acked a message then dies >>>>> and >>>>> > needs to be recovered before the message can get delivered (assuming >>>>> you >>>>> > have redundant storage and enough of the storage survives to be able >>>>> to >>>>> be >>>>> > read, the message would eventually get through). >>>>> > >>>>> > But you now have the problem that the sender needs to know how many >>>>> > destinations the logs are going to. If you have any filters to decide >>>>> what >>>>> > to do with the logs, the sender needs to know if the log got lost, or >>>>> if >>>>> a >>>>> > filter decided to not write the log. If the rules would deliver the >>>>> logs >>>>> to >>>>> > multiple places, the sender will need to know how many places it's >>>>> going >>>>> to >>>>> > be delivered to so that it can know how many different acks it's >>>>> supposed >>>>> > to get back. >>>>> > >>>>> >>>>> So the design expects clusters to be broken in multiple tiers. Let us >>>>> take >>>>> a 3 tier example. >>>>> >>>>> Say we have 100 machines, we break them into 3 tiers of 34, 33 and 33 >>>>> machines. >>>>> >>>>> Assuming every producer wants at the most 2-safety, I can use 3 tiers >>>>> to >>>>> build this design. >>>>> >>>>> So first producer discovers Tier-1 nodes, and hashes its session_id to >>>>> pick >>>>> one of the 34 nodes, if it is not able to connect, it discards that >>>>> node >>>>> from the collection(now we end up with 33 nodes in Tier-1) and hashes >>>>> to >>>>> a >>>>> different node (again a Tier-1 node, of-course). >>>>> >>>>> One it finds a node that it can connect to, it sends its session_id and >>>>> message-batch to it. >>>>> >>>>> The selected Tier-1 node now hashes the session_id and finds a Tier-2 >>>>> node >>>>> (it again discovers all Tier-2 nodes via external discovery mechanism). >>>>> If >>>>> it fails to connect, it discards that node and hashes again to one of >>>>> the >>>>> remaining 32 nodes, and so on. >>>>> >>>>> Eventually it reaches Tier-3, which is where ruleset has a clause which >>>>> checks for replica_number == 1, and handles the message differently. It >>>>> is >>>>> handed over to an action which delivers it to the downstream system >>>>> (which >>>>> may in-turn again be a syslog-cluster, or a datastore etc). >>>>> >>>>> So each node only has to worry about the next hop that it needs to >>>>> deliver >>>>> to. >>>>> >>>>> >>>>> > >>>>> > These problems make it so that I don't see how you would reasonably >>>>> manage >>>>> > this sort of environment. >>>>> > >>>>> > >>>>> > >>>>> > I would suggest that you think hard about what your requirements >>>>> really >>>>> > are. >>>>> > >>>>> > It may be that you are only sending to one place, in which case, you >>>>> > really want to just be inserting your messages into an ACID complient >>>>> > database. >>>>> > >>>>> > It may be that your requirements for absolute reliability are not >>>>> quite >>>>> as >>>>> > severe as you are initially thinking that they are, and that you can >>>>> then >>>>> > use the existing hop-by-hop reliability. Or they are even less severe >>>>> and >>>>> > you can accept some amount of memory buffering to get a few orders of >>>>> > magnatude better performance from your logging. Remember that we are >>>>> > talking about performance differences of 10,000x on normal hardware. >>>>> A >>>>> bit >>>>> > less, but still 100x or so on esoteric, high-end hardware. >>>>> > >>>>> > >>>>> Yep, I completely agree. In most cases extreme reliability such as this >>>>> is >>>>> not required, and is best avoided for cost reasons. >>>>> >>>>> But for select applications it is lifesaver. >>>>> >>>>> >>>>> > >>>>> > I will also say that there are messaging systems that claim to have >>>>> the >>>>> > properties that you are looking for (Flume for example), but almost >>>>> nobody >>>>> > operates them in their full reliability mode because of the >>>>> performance >>>>> > issues. And they do not have the filtering and multiple destination >>>>> > capabilities that *syslog provides. >>>>> > >>>>> >>>>> Yes, Flume is one of the best options. But it comes with some unique >>>>> problems too (its not light-weight enough for running producer side + >>>>> managed-environment overhead (GC etc) cause their own set of problems). >>>>> There is also value in offering the same interface to producers for >>>>> ingestion into un-acked and reliable pipeline (because a lot of other >>>>> things, like integration with other systems can be reused). It also >>>>> keeps >>>>> things simple because producers do all operations in one way, with one >>>>> tool, regardless of its ingestion mechanism being acked/replicated etc. >>>>> >>>>> Reliability in this case is built end-to-end, so building stronger >>>>> guarantees over-the-wire parts of the pipeline doesn't seem very >>>>> valuable >>>>> to me. Why do you feel RELP will be necessary? >>>>> >>>>> >>>>> > >>>>> > David Lang >>>>> > >>>>> > >>>>> > >>>>> > On Sat, 24 Jan 2015, singh.janmejay wrote: >>>>> > >>>>> > Date: Sat, 24 Jan 2015 01:48:18 +0530 >>>>> >> From: singh.janmejay <[email protected]> >>>>> >> Reply-To: rsyslog-users <[email protected]> >>>>> >> To: rsyslog-users <[email protected]> >>>>> >> Subject: [rsyslog] [RFC: Ingestion Relay] End-to-end reliable >>>>> >> 'at-least-once' >>>>> >> message delivery at large scale >>>>> >> >>>>> >> >>>>> >> Greetings, >>>>> >> >>>>> >> This is a proposal for new-feature, and im inviting thoughts. >>>>> >> >>>>> >> The aim is to use a set of rsyslog nodes(let us call it a cluster) >>>>> to >>>>> be >>>>> >> able to move messages reliably from source to destination. >>>>> >> >>>>> >> Let us make a few assumptions so we can define the expected >>>>> properties >>>>> >> clearly. >>>>> >> >>>>> >> >>>>> >> Assumptions: >>>>> >> >>>>> >> - Data once successfully delivered to the Destination (typically a >>>>> >> datastore) is considered safe. >>>>> >> - Source-crashing with incomplete message hand-off to the cluster is >>>>> >> outside the scope of this. In such a case, source must retry. >>>>> >> - The cluster must be designed to support a maximum of K node >>>>> failures >>>>> >> without any message loss >>>>> >> >>>>> >> >>>>> >> Here are the properties that may be desirable in such a service(the >>>>> >> cluster >>>>> >> is implementation of this service): >>>>> >> >>>>> >> - No message should ever be lost once handed over to the >>>>> delivery-network >>>>> >> except in a disaster scenario >>>>> >> - Disaster scenario is a condition where more than k nodes in the >>>>> cluster >>>>> >> fail >>>>> >> - Each source may pick a desirable value of k, where (k <= K) >>>>> >> - Any cluster nodes must re-transmit messages at a timeout T, if >>>>> >> downstream >>>>> >> fails to ACK it before the timeout. >>>>> >> - Such a cluster should ideally be composable, in the sense, user >>>>> should >>>>> >> be >>>>> >> able to chain multiple such clusters. >>>>> >> >>>>> >> >>>>> >> This requires the cluster to support k-way replication of messages >>>>> in >>>>> the >>>>> >> cluster. >>>>> >> >>>>> >> Implementation: >>>>> >> >>>>> >> High level: >>>>> >> - The cluster is divided in multiple tiers (let us call them >>>>> >> replication-tiers (or rep-tiers). >>>>> >> - The cluster can handle multiple sessions at a time. >>>>> >> - Session_ids are unique and are generated by producer system when >>>>> they >>>>> >> start producing messages >>>>> >> - Within a session, we have a notion of sequence-number (or seq_no), >>>>> which >>>>> >> is a monotonically increasing number(incremented by 1 per message). >>>>> This >>>>> >> requirement can possibly be relaxed for performance reasons, and >>>>> gaps >>>>> in >>>>> >> seq-id may be acceptable. >>>>> >> - Replication is basically managed by lower tiers sending data over >>>>> to >>>>> >> higher tiers within the cluster, until replica-number (an attribute >>>>> each >>>>> >> message carries, falls to 1) >>>>> >> - When replica-number falls to zero, we transmit message to desired >>>>> >> destination. (This can alternatively be done at the earliest >>>>> opportunity, >>>>> >> i.e. in Tier-1, under special-circumstances, but let us discuss that >>>>> later >>>>> >> if we find enough interest in doing so). >>>>> >> - There must be several nodes in each Tier, allocated to minimize >>>>> >> possibility of all of them going down at once (across availability >>>>> zones, >>>>> >> different chassis etc). >>>>> >> - There must be a mechanism which allows nodes from upstream system >>>>> to >>>>> >> discover nodes of Tier-1 of the cluster, and Tier-1 nodes to >>>>> discover >>>>> >> nodes >>>>> >> in Tier-2 of the cluster and so on. Hence nodes in Tier-K of the >>>>> cluster >>>>> >> should be able to discover downstream nodes. >>>>> >> - Each session (or multiple sessions bundled according to arbitrary >>>>> logic, >>>>> >> such as hashing), must pick one node from each tier as >>>>> >> downstream-tier-node. >>>>> >> - Each node must maintain 2 watermarks: >>>>> >> * Replicated till seq_no : till what sequence number have >>>>> messages >>>>> been >>>>> >> k-way replicated in the cluster >>>>> >> * Delivered till seq_no: till what sequence number have messages >>>>> been >>>>> >> delivered to downstream system >>>>> >> - Each send-operation (i.e. transmission of messages) from upstream >>>>> to >>>>> >> cluster's Tier-1 or from lower tier in cluster to higher tier in >>>>> cluster >>>>> >> will pass messages such that highest seq_no of any message(per >>>>> session) >>>>> in >>>>> >> transmitted batch is known >>>>> >> - Each receive-operation in cluster's Tier-1 or in upper-tiers >>>>> within >>>>> >> cluster must respond/reply to transmitter with the two water-mark >>>>> values >>>>> >> (i.e Replicated seq_no and Delivered seq_no) per session. >>>>> >> - Lower tiers (within the cluster) are free to discard messages all >>>>> >> message >>>>> >> with seq_no <= Delivered till seq_no >>>>> >> - Upstream system is free to discard all messages with seq_no <= >>>>> >> Replicated >>>>> >> till seq_no of cluster >>>>> >> - Upstream and downstream systems can be chained as instances of >>>>> such >>>>> >> clusters if need be >>>>> >> - Maximum replication factor 'K' is dictated by cluster design >>>>> (number >>>>> of >>>>> >> tiers) >>>>> >> - Desired replication factor 'k' is a per-message controllable >>>>> attribute >>>>> >> (decided by the upstream) >>>>> >> >>>>> >> The sequence-diagrams below explain this visually: >>>>> >> >>>>> >> Here is a case with an upstream sending messages with k = K : >>>>> >> ingestion_relay_1_max_replication.png >>>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dN21TLTZBQjZMdUk/ >>>>> >> edit?usp=drive_web> >>>>> >> >>>>> >> This is a case with k < K : >>>>> >> ingestion_relay_2_low_replication.png >>>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4da1lKMnRKdU9JUkU/ >>>>> >> edit?usp=drive_web> >>>>> >> >>>>> >> The above 2 cases show only one transmission going from upstream >>>>> system >>>>> to >>>>> >> downstream system serially, this shows it pipelined : >>>>> >> ingestion_relay_3_pipelining.png >>>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dQUpTZGRDdVVXLVU/ >>>>> >> edit?usp=drive_web> >>>>> >> >>>>> >> This demonstrates failure of a node in the cluster, and how it >>>>> recovers >>>>> in >>>>> >> absence of continued transmission (it is recovered by timeout and >>>>> >> retransmission) : >>>>> >> ingestion_relay_4_timeout_based_recovery.png >>>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dMm5kUWtaTlVfV1U/ >>>>> >> edit?usp=drive_web> >>>>> >> >>>>> >> This demonstrates failure of a node in the cluster, and how it >>>>> recovers >>>>> >> due >>>>> >> to continued transmission : >>>>> >> ingestion_relay_5_broken_transmission_based_recovery.png >>>>> >> <https://docs.google.com/file/d/0B_XhUZLNFT4dd3M0SXpUYjFXdlk/ >>>>> >> edit?usp=drive_web> >>>>> >> >>>>> >> >>>>> >> >>>>> >> Rsyslog level implementation sketch: >>>>> >> >>>>> >> - Let us assume there is a way to identify the set of inputs, >>>>> queues, >>>>> >> rulesets and actions that need to participate as reliable pipeline >>>>> >> components in a cluster node >>>>> >> - Each participating queue, will expect messages to contain a >>>>> session-id >>>>> >> - Consumer bound to a queue will be expected to provide values for >>>>> both >>>>> >> watermarks to per-session to dequeue more messages. >>>>> >> - Producer bound to a queue will be provided values for both >>>>> watermarks >>>>> >> per-session as return value when en-queueing more messages. >>>>> >> - The inputs will transmit (either broadcast or unicast) both >>>>> watermark >>>>> >> values to upstream actions (unicast is sent over relevant >>>>> connections, >>>>> >> broadcast is sent across all connections) (please note this has >>>>> nothing >>>>> to >>>>> >> do with network broadcast domains, as everything is over TCP). >>>>> >> - Actions will receive the two watermarks and push it back to the >>>>> queue >>>>> >> action is bound to, in order to dequeue more messages >>>>> >> - Rulesets will need to pick the relevant actions value across >>>>> multiple >>>>> >> action-queues according to user-provided configuration, and >>>>> propagate >>>>> it >>>>> >> backwards >>>>> >> - Action must have ability to set arbitrarily value for >>>>> replica-number >>>>> >> when >>>>> >> passing it to downstream-system (so that chaining is possible). >>>>> >> - Inputs may produce the new value for replicated till seq_no when >>>>> >> receiving a message with replica_number == 1 >>>>> >> - Action may produce the new value for delivered till seq_no after >>>>> having >>>>> >> successfully delivered a message with replica_number == 1 >>>>> >> >>>>> >> Rsyslog configuration required(from user): >>>>> >> >>>>> >> - User will need to identify machines that are a part of cluster >>>>> >> - These machines will have to be divided in multiple replication >>>>> tiers >>>>> (as >>>>> >> replication will happen only across machines in different tiers) >>>>> >> - User can pass message to the next cluster by setting >>>>> replica_number >>>>> back >>>>> >> to a desired number and passing it to an action which writes it to >>>>> one >>>>> of >>>>> >> the nodes in a downstream cluster >>>>> >> - User needs to check replica_number in the ruleset and take special >>>>> >> action >>>>> >> (to write it to downstream system) when replica_number == 1 >>>>> >> >>>>> >> >>>>> >> Does this have any overlap with RELP? >>>>> >> >>>>> >> I haven't studied RELP in depth yet, but as far as I understand it, >>>>> it >>>>> >> tries to solve the problem of delivering messages reliably between a >>>>> >> single-producer and a single-consumer losslessly (it targets >>>>> different >>>>> >> kind >>>>> >> of loss scenarios specifically). In addition to this, its scope is >>>>> limited >>>>> >> to ensuring no messages are lost during transportation. In event of >>>>> a >>>>> >> crash >>>>> >> of the receiver node before it can handle received message reliably, >>>>> some >>>>> >> messages may be lost. Someone with deeper knowledge of RELP should >>>>> chime >>>>> >> in. >>>>> >> >>>>> >> >>>>> >> >>>>> >> Thoughts? >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Regards, >>>>> >> Janmejay >>>>> >> http://codehunk.wordpress.com >>>>> >> _______________________________________________ >>>>> >> rsyslog mailing list >>>>> >> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> >> http://www.rsyslog.com/professional-services/ >>>>> >> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if >>>>> you >>>>> >> DON'T LIKE THAT. >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > rsyslog mailing list >>>>> > http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> > http://www.rsyslog.com/professional-services/ >>>>> > What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if >>>>> you >>>>> > DON'T LIKE THAT. >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Janmejay >>>>> http://codehunk.wordpress.com >>>>> _______________________________________________ >>>>> rsyslog mailing list >>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> http://www.rsyslog.com/professional-services/ >>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>>> DON'T LIKE THAT. >>>>> >>>>> _______________________________________________ >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com/professional-services/ >>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>> DON'T LIKE THAT. >>>> >>> >>> >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >>> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

