Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

singh.janmejay Tue, 16 Feb 2016 03:28:36 -0800

@Thomas: This is not about testing and quantifying loss during a test.
Its about quantifying it during normal operation. I see it as a choice
between:
A. deploy the strongest protocol at every system-boundary and test
each one rigorously and each change rigorously to identify or bound
loss in test conditions, and expect nothing unexpected to show up in
production
B. do the former and measure loss in production to identify that
something unexpected happened
C. deploy efficient protocols at all system-boundaries and measure
loss (as long as loss stays within an acceptable level, deployment
benefits from all the efficiency gains)


I am talking in the context of C.

If/when loss is above acceptable level, one needs to debug and fix the
problem. Both B and C provide the data required to identify
situation(s) when such debugging needs to happen.

The approach of stamping on one end an measuring on the other treats
all intermediate hops as a blackbox. For instance, it can be used to
quantify losses in face of frequent machine failures or down-time free
maintenance etc.

@David: As of now, I am thinking of end-of-the-day style measurement
(basically report number of messages lost at a good-enough
granularity, say host x severity).

I am thinking of this as something independent of frequency of outages
and unrelated to maintenance windows. Im thinking of it as a report
that captures extent of loss, where one can pull down several months
of this data and verify loss was never beyond a acceptable level,
compare it across days when load profile was very different (the day
when too many circuit-breakers engaged etc).

I haven't thought through this, but reset may not be required.
Basically let the counter count-up and wrap-around (as long as
wrap-around is well defined behavior which is accounted for during
measurement).


On Sat, Feb 13, 2016 at 5:13 AM, David Lang <[email protected]> wrote:
> On Sat, 13 Feb 2016, singh.janmejay wrote:
>
>> The ideal solution would be one that identifies host, log-source and
>> time of loss along with accurate number of messages lost.
>>
>> pstats makes sense, but correlating data from stats across large
>> number of machines will be difficult (some machines may send stats
>> slightly delayed which may skew aggregation etc).
>
>
> if you don't reset the counters, they keep increasing, so over time the
> error due to the slew becomes a very minor componnent.
>
>
>> One approach I can think of: slap a stream-identifier and
>> sequence-number on each received message, then find gaps in sequence
>> number for a session-id on the other side (as a query over log-store
>> etc).
>
>
> I'll point out that generating/checking a monotonic sequence number destroys
> parallelism, and so it can seriously hurt performance.
>
> Are you trying to detect problems 'on the fly' as they happen? or at the end
> of the hour/day saying 'hey, there was a problem at some point'
>
> how frequent do you think problems are? I would suggest that you run some
> stress tests on your equipment/network and push things until you do have
> problems, so you can track when they happen. I expect that you will find
> that they don't start happening until you have much higher loads than you
> expect (at least after a bit of tuning), and this can make it so that the
> most invastive solutions aren't needed.
>
> David Lang
>
>
>> Large issues such as producer suddenly going silent can be detected
>> using macro mechanisms (like pstats).
>>
>> On Sat, Feb 13, 2016 at 2:56 AM, David Lang <[email protected]> wrote:
>>>
>>> On Sat, 13 Feb 2016, Andre wrote:
>>>
>>>>
>>>> The easiest way I found to do that is to have a control system and send
>>>> two
>>>> streams of data to two or more different destinations.
>>>>
>>>> In case of rsyslog processing a large message volume UDP the loss has
>>>> always been noticeable.
>>>
>>>
>>>
>>> this depends on your setup. I was able to send UDP logs at gig-E wire
>>> speed
>>> with no losses, but it required tuning the receiving sytem to not do DNS
>>> lookups, have sufficient RAM for buffering, etc
>>>
>>>
>>> I never was able to get my hands on 10G equiepment to push up from there.
>>>
>>> David Lang
>>>
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of
>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
>>> LIKE THAT.
>>
>>
>>
>>
>>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
> LIKE THAT.



-- 
Regards,
Janmejay
http://codehunk.wordpress.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

Reply via email to