Hi Jörg, Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps is your probing traffic already high priority?
Best, Laurent > On 8 Jul 2021, at 15:58, Jörg Kost <[email protected]> wrote: > > We have a similar gray issue, where switches in a virtual chassis > configuration with layer3-configuration seem to lose transit ICMP messages > like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let > alone variances, or errors in measuring ). > > We noticed this when we replaced Nagios with some more bursting, > trigger-happy monitoring software a few years back. Since then, it's > reporting false positives from time to time, and this can become annoying. > > Besides spending a lot of time debugging this, we never had a breakthrough in > finding the root cause, just looking to replace things in the next year. > > On 8 Jul 2021, at 15:28, Mark Tinka wrote: > >> On 7/8/21 15:22, Vanbever Laurent wrote: >> >>> Did you folks manage to understand what was causing the gray issue in the >>> first place? >> >> Nope, still chasing it. We suspect a FIB issue on a transit device, but >> currently building a test to confirm. >> >> Mark.

