@Thomas: This is not about testing and quantifying loss during a test. Its about quantifying it during normal operation. I see it as a choice between: A. deploy the strongest protocol at every system-boundary and test each one rigorously and each change rigorously to identify or bound loss in test conditions, and expect nothing unexpected to show up in production B. do the former and measure loss in production to identify that something unexpected happened C. deploy efficient protocols at all system-boundaries and measure loss (as long as loss stays within an acceptable level, deployment benefits from all the efficiency gains)
I am talking in the context of C. If/when loss is above acceptable level, one needs to debug and fix the problem. Both B and C provide the data required to identify situation(s) when such debugging needs to happen. The approach of stamping on one end an measuring on the other treats all intermediate hops as a blackbox. For instance, it can be used to quantify losses in face of frequent machine failures or down-time free maintenance etc. @David: As of now, I am thinking of end-of-the-day style measurement (basically report number of messages lost at a good-enough granularity, say host x severity). I am thinking of this as something independent of frequency of outages and unrelated to maintenance windows. Im thinking of it as a report that captures extent of loss, where one can pull down several months of this data and verify loss was never beyond a acceptable level, compare it across days when load profile was very different (the day when too many circuit-breakers engaged etc). I haven't thought through this, but reset may not be required. Basically let the counter count-up and wrap-around (as long as wrap-around is well defined behavior which is accounted for during measurement). On Sat, Feb 13, 2016 at 5:13 AM, David Lang <[email protected]> wrote: > On Sat, 13 Feb 2016, singh.janmejay wrote: > >> The ideal solution would be one that identifies host, log-source and >> time of loss along with accurate number of messages lost. >> >> pstats makes sense, but correlating data from stats across large >> number of machines will be difficult (some machines may send stats >> slightly delayed which may skew aggregation etc). > > > if you don't reset the counters, they keep increasing, so over time the > error due to the slew becomes a very minor componnent. > > >> One approach I can think of: slap a stream-identifier and >> sequence-number on each received message, then find gaps in sequence >> number for a session-id on the other side (as a query over log-store >> etc). > > > I'll point out that generating/checking a monotonic sequence number destroys > parallelism, and so it can seriously hurt performance. > > Are you trying to detect problems 'on the fly' as they happen? or at the end > of the hour/day saying 'hey, there was a problem at some point' > > how frequent do you think problems are? I would suggest that you run some > stress tests on your equipment/network and push things until you do have > problems, so you can track when they happen. I expect that you will find > that they don't start happening until you have much higher loads than you > expect (at least after a bit of tuning), and this can make it so that the > most invastive solutions aren't needed. > > David Lang > > >> Large issues such as producer suddenly going silent can be detected >> using macro mechanisms (like pstats). >> >> On Sat, Feb 13, 2016 at 2:56 AM, David Lang <[email protected]> wrote: >>> >>> On Sat, 13 Feb 2016, Andre wrote: >>> >>>> >>>> The easiest way I found to do that is to have a control system and send >>>> two >>>> streams of data to two or more different destinations. >>>> >>>> In case of rsyslog processing a large message volume UDP the loss has >>>> always been noticeable. >>> >>> >>> >>> this depends on your setup. I was able to send UDP logs at gig-E wire >>> speed >>> with no losses, but it required tuning the receiving sytem to not do DNS >>> lookups, have sufficient RAM for buffering, etc >>> >>> >>> I never was able to get my hands on 10G equiepment to push up from there. >>> >>> David Lang >>> >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of >>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T >>> LIKE THAT. >> >> >> >> >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. -- Regards, Janmejay http://codehunk.wordpress.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

